Skip to main content

What We Learned Running a Virtual Networking IaaS

And what we’re doing to improve Network Edge performance, reliability and upgradeability.

Headshot of Mandar Joshi
Mandar JoshiSenior Manager, Product Management, Network Edge
What We Learned Running a Virtual Networking IaaS

Back in 2018, as more and more customers were starting to transform their networks for a hybrid-cloud world, it became clear that they needed an easier way to deploy and manage complex network infrastructure at Equinix. Without one, their network transformation required each customer to put racks, cables, switches, servers, connectivity and various other tidbits in multiple—sometimes dozens—of places around the world. There had to be a better way. 

We started exploring how we could abstract all this work away and put it behind an “easy button” for customers to build, manage and evolve their networks. Of course, if you know anything about carrier and enterprise networks, you know that there is nothing “easy” about them. Our customers’ needs varied widely. Some were Cisco shops, while others were adopting Palo Alto, for example. Some wanted a mix of things, and some were already thinking about multicloud networking. So, we embarked on a journey to build a service that would enable each of them to put their networking preference across the global Equinix footprint and leverage our powerful interconnection options—all with a few mouse clicks. 

A year later we launched Network Edge, an IaaS platform built on a network-optimized OpenStack distribution and designed to meet the intense networking requirements of our interconnected, hybrid-cloud-obsessed customers. It provided optimized virtual machines with plenty of virtual interfaces, direct access to Equinix Fabric (our platform for software defined connectivity to clouds, carriers and others) and a marketplace of VNF products by our partners. 

At first, to build our operational chops, we deployed Network Edge in only a few markets with only a few VNF partners and early-access customers. The initial results were a mixed bag. The good? Customers loved it! The days of having to wait for six to 12 months to build a new network POP or test and add a new device to their topology were over. The bad? Operating highly available multitenant network infrastructure in software was hard, and the expectations for uptime, performance, scalability and new features were at times overwhelming. 

We have since learned a lot about the technology stack we have chosen. We have discovered challenges with the architecture that prevented us from hitting that sweet spot of having the right feature velocity, performance and resiliency. Upgrading our core software platforms has carried risk and has been difficult to do confidently without using many mitigating tools for customers and our own operators. 

Three Big Lessons 

We learned a few major lessons over the past two years, as we scaled Network Edge to dozens of metros and hundreds of customer deployments. 

1. Upgrades need to happen more frequently and without disruption 

We operate cloud infrastructure to deliver Network Edge services, more specifically, an OpenStack cloud in each Network Edge-enabled market. The vendor technology stack we’ve been using makes updates more disruptive than we would like them to be.  

One lesson we have taken to heart is that we need to keep up with smaller, incremental and less impactful upgrades to keep the software stack current. We have also taken to heart the fact that we must own the most critical pieces of the technology stack we use to deliver Network Edge services. We have started the work to make it so, with the goal of getting there in the second half of 2023. 

2. We can really use live migration 

Live migration would help us significantly reduce service impact from our latest upgrade—as it would in other areas of operations. Alas, our current infrastructure stack doesn’t support it. We are keen to upgrade our infrastructure and ultimately support live migration for our own operations teams and for customer use.

3. Local redundancy is insufficient 

Despite our best efforts working with our technology vendor partner, we have been unable to perform upgrades without interruption to multiple servers simultaneously. This means that customers whose primary and secondary instances happen to be on a set of servers that are being upgraded experience traffic disruption. Single-server upgrades would have substantially widened an already extended maintenance window. The better solution is for each customer to keep their primary instances and their backups running in different geographic locations, which would ensure that there’s always a second instance running in a data center that’s not being upgraded.

The Plan Going Forward 

As we work toward owning the technology stack, we must continue the current upgrade to ensure business continuity. 

Given our global footprint and lengthy upgrades, we have thought carefully about timing the upgrades in a way that would minimize customer impact. But our customers are highly diverse and often have opposite requirements, so it’s been difficult to schedule upgrades in a way that accommodates everybody. (Some, for example, prefer upgrades on the weekends, while for others, such as retailers, the weekends are prime business hours.) 

We do understand and appreciate that many Network Edge customers route critical business data through the platform. We know that even short periods of disruption for critical business applications can translate to material revenue loss, or even life-safety issues. Therefore, as part of these extensive upgrades, we recommend that customers who have not already implemented geographic redundancy do so now. Fortunately, with virtual infrastructure like Network Edge, implementing geo-redundancy is straightforward: deploy additional devices, connect them together with our Device Link Groups and proactively reroute traffic before and after the maintenance. 

To make it easier, we are providing geo-redundant devices and circuits free of charge for 30 days. These devices and associated virtual circuits can be used to avoid disruption during the maintenance, and we’re confident that customers running business critical applications will continue to operate them as a matter of best practices in cloud networking infrastructure risk management. 

Now that we have a few years of experience under our belts running our virtual networking IaaS cloud, we have a much clearer picture of our original architecture’s limitations and what we must do to address them. Our top three priorities are to enable upgrades as they come, to support live migration and to get all our users to set up geo-diverse backups. As we work on these goals, our feedback dashboard is open for you to share other ways we can make Network Edge work better for you and your use case! 

You can find more details about the infrastructure upgrade on this Network Edge documentation page

Published on

21 December 2022