Deploying AI Infrastructure to Launch a Global Cloud Service in a Few Months

If you’ve been paying attention to NVIDIA in recent years, you know that it’s been transforming into a company that’s known for a lot more than its computer graphics. Developing AI software tools—that run on its accelerated computing platforms—has been one way it’s gone about it, and the way modern software tools of any kind are offered is through the cloud model, also known as SaaS.

The SaaS model is especially valuable for a company like NVIDIA, whose bread and butter is sales of not just hardware but also, increasingly, software. Without a SaaS capability, a potential customer wanting to test drive NVIDIA’s AI software would have to invest in deploying AI infrastructure on premises for a trial. It can take months to hammer out an SoW, build, and ship servers—especially now, with all the supply chain disruptions. Then, the prospect would need to dedicate precious IT team resources to set up and run the trial environment.

AI Infrastructure and Software as a Service

An ability to offer test drives through a SaaS model circumvents all that. NVIDIA’s answer is NVIDIA LaunchPad, a portfolio of free trials and curated labs of AI software and infrastructure tools recently launched in nine markets around the world. Like anybody launching a new SaaS product, the people behind LaunchPad had to decide how to best source the hardware it would run on, data centers it would live in, and networks it would connect to.

NVIDIA deployed its DGX A100 AI systems in two Equinix IBX data centers to support NVIDIA DGX Foundry running NVIDIA Base Command™, the managed end-to-end AI development software platform that’s offered with monthly subscriptions. Free trials of the Base Command software are available as part of the LaunchPad program.

For NVIDIA AI Enterprise, which enables users to run an end-to-end suite of AI and data analytics software on industry-standard servers with VMware vSphere, and NVIDIA Fleet Command, a cloud platform for deploying and scaling AI applications distributed across many edge sites, NVIDIA turned to Equinix Metal. (NVIDIA AI Enterprise and Fleet Command are also part of the LaunchPad portfolio.)

We worked with NVIDIA to set up a global fleet of demo machines inside Equinix data centers, enabling potential customers to access them over the internet. Because this AI infrastructure is in Metal sites across nine metros, it’s easy for buyers to run LaunchPad curated labs in a location close to them to minimize latency.

When a customer is ready to move to production, we can quickly provide the underlying infrastructure in our IBX data centers, so the experience of deployment looks and feels like SaaS—including on their balance sheet, meaning the capital investment required to buy hardware is replaced by an operating expense, billed monthly and calculated based on actual use.

AI Infrastructure in Nine Data Center Metros in Three Months

To help NVIDIA push LaunchPad live in nine markets, we deployed 180 customized, high-performance, accelerated NVIDIA-Certified servers around the world within the span of three months—amidst a pandemic-induced tech supply chain meltdown.

NVIDIA knew which of its GPU accelerators would power its new AI SaaS product. We worked together to optimize the rest of the infrastructure stack. The teams’ hardware experts recommended configuring servers using Intel Ice Lake CPUs. (Older-generation CPUs could become harder to support in the coming years.) Metal recommended off-the-shelf mainstream servers that could be acquired quickly, without running into supply-chain delays.

With GPUs, CPUs, and mainboards sorted out, the remaining big piece of the AI infrastructure puzzle was networking. NVIDIA wanted each LaunchPad server to have four NIC ports: two 100-gigabit ones to streamline east-west traffic (within the LaunchPad infrastructure) and two 25-gigabit ones for north-south traffic (to and from users).

Not only was this more NICs than a typical networking hardware setup would require, but it also needed high-bandwidth switches capable of supporting 100-gigabit port speeds. The Metal team solved this by deploying the same switches we use to configure our Open19 infrastructure (which uses 4x25-gigabit port aggregation).

The team began with a pilot deployment of just one proof-of-concept rack in one Metal location. From there, we scaled up by building two pilot racks in that location, and then expanded over the following weeks to all nine locations.

Things went so smoothly that the biggest delay in the whole project was caused by a slow shipment of 100-gigabit network cables to support LaunchPad’s east-west traffic due to current supply chain issues enterprises are all very familiar with these days. When you’re deploying AI infrastructure in data centers around the world to stand up a global cloud platform and your biggest problem is Ethernet cables taking a while to arrive in the mail, you know you’ve done something right.

Dozens of NVIDIA LaunchPad environments were spun up by prospective customers within 10 days of the platform’s launch.

Curated labs are available free of charge to enterprises interested in trying AI on NVIDIA LaunchPad.

How to Stand Up a Global AI Cloud Service in a Matter of Months

AI Infrastructure and Software as a Service

AI Infrastructure in Nine Data Center Metros in Three Months

Published on

Category

Tags