A few years back, when I first started at Equinix Metal, I encountered a systems problem that appeared simple enough—at least on the surface: a subset of machines exhibited “flapping” failure behavior when provisioning. Provisioning the same host with identical traits (such as the operating system) several times in a row would fail more than half of the time. The problem wasn’t limited to machines of a certain type, where they were hosted (e.g. certain data centers), and didn’t correlate with other firmware or version specific attributes of the machines. This caused more than a bit of head scratching.
an insane a detail-oriented person who can often be found deep inside technical rabbit holes together with Cheshire Cats, I commenced my tumble into wonderland by gathering some initial data points about the problem:
- The same host and OS could fail at different points in the provisioning process when repeatedly provisioned.
- A machine with identical specs in a different facility could potentially not exhibit the same behavior.
- The failure points clustered around requests to external services, e.g. our metadata service.
- It seemed as though DNS lookups in general could intermittently fail regardless of where the request was destined for.
The team I lead at Equinix oversees the health of tens of thousands of servers in dozens of globally distributed data centers, and those numbers are growing fast. Dealing with such an exploding scale means spending a lot of time looking at trends (like failures!); relationships between systems; feature functionality which traverses multiple engineering teams, microservices and hardware + firmware sets and the compounding compatibility between those things.
Continuing to build features which delight our customers while provisioning physical hardware with a high degree of consistency (our target global average success rate is >=99% across upwards of 100,000 provisions each day) requires keeping an eye on these trends. As you might have guessed, unexplained failure trends are kinda my teams’ kryptonite!
A Breadcrumb Trail to IP Anycast
The first major discovery in my investigation was a trend: when we hit the failure(s), most commonly the hosts were attempting to connect to endpoints that used an anycast IP advertised in each of our global data centers.
In case you’re unfamiliar with anycast (also known as “IP anycast” or “anycast routing”), it is an IP network addressing scheme that allows multiple servers to share the same IP address, allowing for various physical destination servers to be logically identified by a single IP. An IP anycast router decides which server to send a user request to based on a least-cost analysis that includes the number of hops required to reach a server, physical distance, transit cost and latency. IP anycast is associated with the core routing capabilities of the Border Gateway Protocol (BGP). The BGP anycast IP address, or prefix, is advertised from multiple locations. This route propagates across the internet, enabling BGP to advertise awareness of the shortest path to the advertised prefix and publish multiple secondary-source paths to the destination IP address. This ensures that a user’s data request goes to an anycast server that’s “relatively close” to the request’s origin.
Seeing an IP anycast-related provision failure trend was counterintuitive. These types of services are normally our most highly available and resilient. These IPs were advertised in every data center, so if the service or BGP advertisement in one facility went down anycast would route to the next-closest hop. Something didn’t smell right…
We run services in this way because we want customer hosts / client connections in each facility to have a low-latency in-building connection point and avoid requiring requests to leave the facility (or traverse the public internet). One of the problems that can arise from these types of services is you aren’t always certain which endpoint a request has gone to, unless you have built in a mechanism for responding with a hint. Care to guess if any of our anycast services had such hints to rely on when I started this journey? ?
Narrowing down the problem, I identified several instances where transit would travel intermittently between facilities (we host Equinix Metal in multiple buildings in most metros). For example, when a host in our SV16 data center made a DNS request or a request for metadata that transit could be observed intermittently traveling to both our SV16 and SV15 endpoints (in a building nearby). In other words, traffic that left the facility sometimes wouldn’t come back!
One of the unique challenges in provisioning bare metal servers at scale involves network addressing. In the not-so-distant future we aim to let customers bring and configure their own private overlapping IP space or network segment, e.g. VPC-style functionality. But today we control most of the private host addressing through routable, ACLed address blocks. We assign several private addresses to hosts for a number of reasons (security, isolation, seamless experiences, etc.) and one of those IPs is out of the 172.16.0.0/12 subnet, which is used during various provisioning lifecycle changes. This allows traffic destined for our microservices to originate from an IP assigned to the host from this block during certain machine states / events.
None of this is terribly interesting or magical, except that there was some lurking debt in there… (Queue up dramatic music.)
- A decision was made (or not) more than ~five years ago at little tiny Packet (Equinix Metal in its adolescence) to not uniquely address the IPs assigned to servers out of this 172 block.
- This decision was (presumably) made because when that IP was used at the time it likely didn’t matter that the IP space overlapped.
- As our feature sets and customer needs grew over time, we built certain capabilities that began to rely on using this 172 address more broadly, and because we advertised the same anycast IPs in every facility, everything for the most part “just worked.”
I was now facing the enemy: a half-decade-old oversight and highly repeated pattern of not carving up and assigning unique IPs out of one of our private subnets manifesting in failed provisions only when anycast transit would get “drawn out” of the facility—let’s face it, IP anycast was doing what it was supposed to do! A hilariously annoying aspect of pinning this problem down was that our data centers vary in size, and as we enumerated further into the IP block(s) the larger facilities would have genuinely unique 172 addresses; most of the other facilities would never have gotten that high in the subnet simply because they had fewer servers, so in larger facilities you couldn’t “reproduce” this problem.
The answer was obvious: time to re-IP all of the servers.
Easy, right? Okay…deep breaths. Count to ten. Pause.
I genuinely love what I do and the problem space we work in. The culture we’ve created around a passion for fundamental infrastructure automation, first at Packet and now Equinix, is pretty darn exciting. That said, sometimes I want to fly to one of our data centers and punch some silly servers in their silly blinking faceplates. That’s what I was feeling when I realized we would need to re-IP thousands of servers, in production, with customers actively using them. Frankly, we had to re-IP every machine in our fleet: not only was the problem causing failures for our customers, it meant we weren’t taking full advantage of our IP anycast architecture, which we still believe is a solid and reliable approach for many of our distributed microservices.
Let’s skip past the boring administrative bits about analyzing all our existing usage out of that subnet, sizing the subnets per facility, carving up all the blocks and getting them assigned. (Isolating parts of the problem to tackle also included considering when hosts would be using this IP—very tedious!) Luckily, a provisioned host in a customer’s account was assigned the IP in our systems but it wasn’t actually present on the switch, because these specific IP’s were unused after a host is provisioned. This meant that any customer server could have the “old 172” IP we assigned out of this block updated to the “new unique 172” via our existing APIs. It was a logical change in our systems only—nothing needed to get pushed out to switches or other ancillary systems. Whew!
Some of the scenarios where this IP was commonly used were during “deprovisioning” and when a host sat in a “provisionable” state, waiting for action. Neither of these states are easily visible to customers, and various private IP schemes are used to ensure that when moving a host through these necessary parts of the lifecycle the origin addressing is tightly controlled. This meant that we could push an update to our backend systems with the new IP and upon triggering the host to state transition, our software automation would update all the respective dependencies (the requisite switches, routing tables, dhcp, etc.). After fully cycling the host back into its pre reassignment state it would have its new and shiny unique IP!
If you’re unfamiliar with bare metal provisioning, this exercise is like a root canal—but a root canal we were able to perform safely for our customers and across our fleet. Now, when we pull the BGP advertisements for our IP anycast services in a facility, they reroute properly from all of our addresses and provide the resilient and highly available behavior you’d expect.
While I may sound semi-intelligent on the topic of networking, I’d be remiss not to mention how much I owe related to the identification and resolution of this problem to our amazing Network Operations and Network Architecture teams, specifically Shelby Lindsey and Luca Salvatore, who assisted me with basically everything I wrote above. (The word “assisted” doesn’t quite relay the extent of their help, since they manage our production network infrastructure and were the ones who did the lift around leaking all our new routes, getting them propagated across our backbone, etc.)
Read how other companies, like Cloudflare, use IP anycast for more useful information.
Ready to kick the tires?
Sign up and get going today, or request a demo to get a tour from an expert.