Skip to main content

Lessons from Hyperscale, Part 1: NVMe as a Service

Can the technologies and strategies used by hyperscalers be deployed by the rest of us? We say, yes!

Headshot of Sam Machiz
Sam MachizDirector, Equinix Metal Solutions
Lessons from Hyperscale, Part 1: NVMe as a Service

Receive the answer to any question almost even before you ask it. Get a ride anywhere, anytime, within minutes. Detect diseases sooner and more accurately with AI.

A decade ago, these experiences would have sounded like science fiction. Now, it’s pretty much the world in which we live. People are actually flying around in jet packs and cars are (mainly) driving themselves! 

As massive amounts of data are served up to ever more powerful and networked computers, real-time experiences and Machine Learning (ML) are reshaping what it means for businesses to be “tech enabled.” 

And yet only a handful of companies have the infrastructure and software prowess to make experiences possible at a global scale. Google, Amazon, and Microsoft—the “hyperscaler” clouds du jour—excel at pushing the limits of compute, storage, and networking technologies to feed our hyper-connected world. In addition to huge investments in datacenters and global networks, they have charted new territory in computing architecture that has pushed the boundaries of performance while increasing efficiency.

But can the technologies and strategies used by hyperscalers be deployed by the rest of us? The short answer is yes, and this series will help tackle them.
Let’s look at one of the most common bottlenecks first: storage.

The Inefficiencies of Direct Attached Storage (DAS)

As technology advances and makes new experiences possible, we want to process, transmit, and store increased amounts of data faster and more securely. But where does the data live? 

For performance reasons, many deployments rely on direct-attached storage (or DAS), which offers consistently high performance. The downside with this approach is that it  requires each compute node to contain the storage it uses. This means every new compute server must include storage. 

In general, a DAS strategy leads to storage being underutilized. By some estimates, companies use only 40% to 50% of their local storage capacity, leaving a lot of money on the table while making scaling an expensive, inefficient process.  

In addition to underutilization, it also creates logistical challenges—a need for more storage means additional servers that need to be shared, maintained, upgraded, and protected. 

Talking to Fast Flash...via Fabrics! 

A few years ago, we started putting high-performance SSDs (called NMVe SSDs) into servers, replacing slower media. In order to support this innovation, the industry developed a new language to enable substantially lower latency and higher throughput than the SAS/SATA standards used up until that time. 

With these fast drives sitting inside each server, the next wave of optimization was disaggregation—accessing pooled NVMe storage from various compute nodes over a high performance network. This became known as the “NVMe over Fabrics” (NVMe-oF) standard. 

These “fabrics” — such as Fiber Channel (FC), Infiniband and Remote Direct Memory Access (RDMA) — enabled significant performance and improvements in scalability. Unfortunately, investments in hardware and network changes (specific to these fabrics) drove up costs and made planning more challenging. Retrofits of existing environment were not cost-effective and rare, so it was difficult to adopt the promise of disaggregated NVMe.  A lot of NVMe remained in a direct attach setup.

The Secret Super-Power of TCP

So what’s the solution? How can you achieve consistently high performance without breaking the bank? The answer: run NVMe-oF using a standard ‘boring’ Ethernet network. Yup, good old TCP! 

With NVMe/TCP, the compute node becomes simpler. The storage becomes a lot smarter and faster—thanks to NVMe (low latency, PCIe connected fast storage) over TCP (or as the nerds say, “NVMe/TCP.”)  

We met the innovators of this approach— Lightbits Labs — back in 2018 through a shared investor (we both have funding from Dell Technologies Capital). As a bare metal provider, we were immediately struck by the promise of their technology.  With Lightbits, our customers could more effectively scale their infrastructure by taking advantage of NVMe that performed like local disks … but were not! All without a massive investment in proprietary networking and optics.  
 
With our focus on bringing hyperscale-style capabilities to Enterprises and constrained environments at the edge (where DAS storage is a huge issue), we had to learn more.  So we began to work closely with Lightbits to learn about NVMe/TCP and to learn more about their products.  

Making the Technology Consumable

The promise of NVMe/TCP is clear, but making it work in reality for end users requires a good deal of polish. It is data, after all, and nobody likes to misplace it!  

That’s why in addition to innovating on the core NVMe/TCP protocol, Lightbits also offers an “easy button” called LightOS. This software-defined storage  solution works on standard servers and SSDs, and provides rich services like data reduction, thin provisioning, and Erasure Coding protection.  

For instance, if you are a fast-growing SaaS company, you might work with Packet to configure servers with only minimal local SSD capacity, and instead access an expandable pool of NVMe drives using Lightbits for more efficient use and easier management —all without any loss in performance (compared to a traditional DAS setup). 

For use cases with even greater performance needs, Lightbits has another hyperscale trick  up their sleeves: the LightField™ hardware accelerator, an FGPA that improves network processing, IOPs, and reduces latency.  Fancy!

Lightfield Accelerator hardware photo

The Results: PB&J

It’s easy to promise “hyperscale” benefits, and yet quite another to deliver. But when the right technologies (and partners) come together, it’s magical.

By disaggregating storage and compute—and scaling them independently across standard networks—the results are compelling: easy adoption, outstanding performance, and improved economics through better utilization of resources. 

Sounds like hyperscale to us - but deployable in decidedly non-hyperscale environments: your local Packet edge site, or the nearby Equinix, perhaps! 

But don't take our word for it. Lightbits has a lab in Packet’s EWR1 facility (fully stocked with compute and storage infrastructure) to help you kick the tires. Request access here.
 


Lightbits Labs, founded in 2016, specializes in developing proprietary solutions in industry-leading cloud datacenters around the globe. With strategic investors including Dell Technologies Capital, Cisco Investments, and Micron, and with investments from Chairman and Founder Avigdor Willenz, Lightbits Labs is focused on disaggregating storage and compute to improve performance and TCO.
 

Published on

15 January 2020
Subscribe to our newsletter

A monthly digest of the latest news, articles, and resources.