I have the privilege of managing a very talented team of engineers at Packet (now Equinix Metal) who have a big responsibility: create and maintain a platform that can provision thousands of servers, each with their own unique combination of facility, hardware manufacturer, server configuration, processor architecture, network configuration, operating system, user data, and more.
Every single one of these variables is a potential point of failure, sitting on top of the messy world of physical servers - in which drives die, firmware versions conflict, and technical gremlins thrive. Throw in global datacenters and distributed networks, and it’s a good thing that our passion is automating fundamental infrastructure.
Putting a Stake in the Ground
Since our founding about three years ago, the goal has always been that 100% of attempted provisions succeed. An API call starts the process, and somewhere between 5 to 10 minutes later, out pops however many instances that were requested, each with their various special snowflake elements.
For the last year, we hovered somewhere between a 93% - 97% success ratio. With expanded features, lots of new operating systems, an ARMv8 system, and more complex networking configurations, we got as low as 89%. Plus, certain very common setups (Ubuntu 16.04 on Type 0, for example) were skewing the data in our favor. For example, the reliability of installs for Windows 2012 R2 on Type 3s (which have NVMe drives and other variables) was nowhere near where we needed it to be.
Add to that the ever-growing bucket of important features and technical debt that needs to be addressed with the expectation that everything should “just work” all the time with no downtime - and we had our work cut out. We needed to ensure that failed provisions wouldn’t break our promise to customers.
So in January, we made it a goal: get to 100% provisions for one day. Here’s how we did it.
Adding Color to Failure
Provisioning tens of thousands of bare metal servers each month is a tricky task. Where we get into trouble is in dealing with things that are “nondeterministic”. For instance, you provision a machine and it starts going through the required boot-up processes: bringing up the NICs, bringing up the network, etc.
All of the sudden the RAM, which was working perfectly fine on the last install, is now not working. What gives? Something, of course, is wrong! But what the user sees is: failure.
The simple truth is that there are a wide variety of elements that are in flux: hardware goes bad, drives start to fail, RAM becomes “unfluid”, network switches go down, an ethernet cable wiggles out after a year in service, and ports go down.
Add in "userdata" - data that gets executed on first boot that sets up the system in the way the user wants it, and there's a hornet's nest of things that can go wrong, any of which can negatively impact a provision.
Peeking Under the Hood
Despite early flirtations with OpenStack as the basis of our provisioning system (read all about our failures here), we found that the only way to control our success and velocity was to isolate each step of the process in the famous "Linux mindset." So we designed and built a series of microservices (mainly in Go) that do one thing, and do that one thing well.
Here’s what we have today:
- PBNJ - Reboots machines, sets boot order, and performs BMC/BIOS related tasks.
- Tinkerbell - Handles DHCP requests, hands out IPs, serves up iPXE.
- OSIE - Installs all of our operating systems and handles deprovisioning.
- Narwhal - Configures all switches with proper ACLs, DHCP groups, BGP, Layer 2/3.
- Pensieve - Sets up reverse DNS for all instances.
- Soren - Bandwidth billing.
- Doorman - Provides VPN to customers.
- Kant - Our EC2-style metadata service.
- SOS - Provides out-of-band access to instances, if public connectivity is down.
- Magnum IP - Acts as our authoritative IPAM, hands out IP reservations and assignments.
- PacketBot - Our version of chaos monkey, which provisions all OSes, in all facilities, all day, every day.
This microservices-based approach allows our team to control each aspect of the provisioning service, which makes changes and updates faster and easier. The best aspect is that each service does only what it needs to do, follows its assigned role, and doesn’t introduce additional complications into the provisioning process.
And yet, even with this framework in place, we were struggling to reach our goal.
Architecting for 100%
Our customer success and engineering teams look at literally every failed install to determine the cause (it’s also a nice mind-numbing task in case you ever need one!). So we had plenty of clues about what was causing failures. We found that in most cases, one of these three major things was the culprit:
- User Data
We’ve all done it: a stray space or missing character in our cloud-init results in a failure to successfully provision. This was a pretty common cause of install failures, which - while not technically our fault - was frustrating for users who didn't know whose fault it was and had no ability to troubleshoot. We also had a lot of long-running user data which, while valid, blew past our ten-minute-install threshold.
To help solve these issue, we added two phone-home processes - one based on cloud-init, and a simpler shell-based backup. This ensures that even if the user passes us bad user data, the device will still phone-home and the user will be able to troubleshoot.
Our PacketBot does a lovely job of testing the provisioning and deprovisioning of every piece of hardware we have. During deprovisioning, we go through a series of checks to ensure that the hardware is in the best possible state before we attempt the next provision.
This means that the drives themselves have been completely wiped and that there are no known health issues reported from the CPU, RAM, NICs, disks, or NVMe drives. It also means that we return the hardware to a trusted state - checking that firmware and BIOS settings are as we expect.
If any part of this process doesn’t pass muster, then the hardware is marked as problematic and we put it aside for manual investigation by our operations team.
One of our favorite haikus is:
It's not DNS
There's no way it's DNS
It was DNS
When we originally built our platform, we relied heavily on DNS at every part of the provisioning process. DNS is always reliable, right? Not so fast. Between intermittent network issues, timeouts, and oddball lookup failures - a single DNS failure meant the entire provisioning process would fail.
We implemented a few things to resolve this issue: cached DNS entries - relying on asynchronous lookups - and setting up facility-local, dedicated DNS servers. Since implementing this, we've seen no failed provisions due to bad DNS.
Of course, this doesn't cover everything we've done. When you're dealing with a globally distributed system that relies heavily on a lot of moving parts, there are many things that can go wrong at any time. As our own Ed Vielmetti observed:
Among the things that we've had to deal with and fix are:
DHCP - When a request to provision comes in, one of the first things it does is DHCP. This makes a call to our API to lookup various things about the device. IP address details (of course), device state, CPU architecture, operating system, etc. If the API is down or too slow, DHCP fails and the provision can't continue. We've heavily optimized these calls to speed the DHCP process.
Network - We run an extremely reliable, global network, and when you provision into our Tokyo datacenter, we need to communicate certain details back to our headquarters in Newark, NJ. Undersea cables, intermittent latency, and peering issues all contribute to occasionally lost packets or reset connections. To combat this, we've pushed as much data and core logic as we can to our remote datacenters to reduce the amount and number of real-time dependencies on being able to communicate back to HQ.
Operating System Idiosyncrasies - Just when you think you can predict how an operating system will behave...disk devices start to show up unordered, teaming can't obtain a MAC address reliably and networking breaks, grub mysteriously can't install to the expected device, and you discover that older operating systems rely on features that don't exist in newer kernels.
All of these things cause us grief and consternation occasionally, which is why we publish and consume copious amounts of links and gifs in our #random Slack channel. There's no silver bullet to fix these - it just requires patience and slow, methodical work.
Also whiskey. (I prefer Monkey Shoulder, and am currently accepting donations.)
CI for our OSes
Needless to say, it doesn’t take much for an install to go haywire.
Small changes in hardware or our operating system images can collide for some seriously long rabbit holes. To avoid that happening in production, we've setup Drone which helps us run tests on our operating system install environment for each of our configurations and provisioning features.
Roundup and What’s Next
With all of this work, we hit our first day of 100% success on January 16th. Since then we’ve hit it another 7 times, and our overall success ratio is averaging at 98%.
The difference between 95% and 98% might not seem like a lot, but when you're installing tens of thousands of servers, you're talking about eliminating hundreds of provisioning failures that all contribute to a bad user experience.
What’s also clear is that this effort is never “done.” As our footprint continues to grow in both size and hardware variety, as well as more locations and customers and operating systems, we will simply continue to do everything we can do ensure our provisioning process remains consistent and reproducible.
And you can be sure we'll celebrate those days with zero failed installs!
Ready to kick the tires?
Sign up and get going today, or request a demo to get a tour from an expert.