Inside Our Network: Maintenance Events
Preventive maintenance on a global network isn’t all that different from the maintenance you probably do for your car. If deprioritized for too long, issues can start to compound.
Preventive maintenance on a network isn’t all that different from preventive maintenance you might do on your car - it keeps a complex operation running smoothly and at optimal performance. If deprioritized for too long, issues can compound and cause unplanned impact and short-notice maintenances, which can rack up quite a “bill” that includes work and downtime. Just like car maintenance, network maintenance is better to do when you have planned for it than when you’re running late for an appointment.
While there are many similarities, there is one big difference between the two: maintaining a network is like doing your vehicle’s maintenance while driving it down the road! As much as possible, networks are built with enough redundancy to route around devices that are taken offline for maintenance. When bonded ports uplinking to top of rack switch pairs are used (like at Equinix Metal) this redundancy can extend all the way down to the server.
If you wait for an incident to take care of regular maintenance, not only are you working on a car while you’re driving it, but the car is also likely on fire!
Under the Hood of Our Network
At Equinix Metal, we operate a network architecture that is typical of most service provider best practices. We have multiple “Backbone” or “Edge” routers on which we terminate transit, peering, and transport connections. We trunk those connections down to “Core” or “Spine” routers that help us aggregate traffic, then ultimately break that all the way down to our “Edge” or “Top of Rack” devices which bring dual 10G or 25G links to each server’s NICs.
Within the Network Operations team, we manage a few hardware vendors, multiple platforms, several Operating Systems, and a number of varying configurations. In order to keep things running in tip-top shape, we find regular maintenance works best.
That Maintenance Cadence
On the third Sunday morning of every month, we roll up our sleeves and take on the work that keeps our networkin fighting shape. We follow a few key guidelines that determine what gets selected:
- We choose maintenance work that is planned to be non-disruptive. Regardless of the fact that we use a weekend day, we know that daylight business hours are critical for many of our customers. Disruptive maintenance is typically planned for overnight slots.
- We set goals on a quarterly basis. For the second quarter of 2021, we worked through standardizing the software running all of our Backbone Switch/Routers (BSRs) to our gold standard version. On June 20th, we upgraded the last of these paired devices in our Atlanta metro.
- We leave ourselves open to other opportunities as they arise. Of the many great things about building new facilities in Equinix IBXs, we can shift traffic to more optimal routes, upgrade bandwidth to 100G connections, and more. These changes make our network more scalable and reliable.
Our Customers Want to Know “How”
The biggest lesson we’ve learned from the last year of running a monthly maintenance window isn’t that it helped us ensure network uptime - we figured that would be the case.
What we didn’t expect is how interested customers would be in the details of the maintenance activity we perform. In the same way you would ask a mechanic about how they filled the holes in your tires, cleaned the carburetor, or got that annoying “check engine” light to turn off, our customers are curious as to what and how we are doing our maintenance. While we still keep our notices for these maintenances quite generalized, we recently added an additional line to call out the scope of the maintenance activity, so customers are aware of the specific metros in which we’re making these adjustments. Here’s a sample before we made the addition:
“This includes backbone level route adjustments, non-impacting network infrastructure upgrades, and capacity enhancements. No customer impact is expected, although you may experience route changes or adjustments.”
As time goes on, we continue to be open to feedback and input around what information is helpful to our customers as they make their own business and maintenance plans.