Skip to main content

Considerations for a Highly Available Control Plane on Self-Managed Kubernetes

In this article, we look into how to make your self-managed kubernetes highly available by using technology like kube-vip and BGP.

Considerations for a Highly Available Control Plane on Self-Managed Kubernetes

Running your own Kubernetes cluster on Equinix Metal can be a fun but challenging endeavour. The first step to getting a Kubernetes cluster online is building a strong foundation with your control plane. You want your control plane to be highly available and resilient to failures. Best practices within the Cloud Native world encourage us to use Infrastructure as Code (IaC) and automate any manual processes, but this presents us with a chicken and egg problem.

The Problem

Running a single node control plane setup is rather trivial, because you can run kubeadm init and you’re good to go. The reason this doesn’t require any further configuration is because everything the Kubernetes control needs will be available over localhost and as such there’s no "discovery" required. Unfortunately, for a highly available control plane - each of our control plane nodes NEEDs to be able to communicate with one another. The reason for this is because each node will run an etcd node. Running etcd in a clustered, highly available, setup requires consensus, using Raft. This means that each etcd node MUST be able to communicate directly with every other node, which means that each etcd node NEEDS to know the IP address of every other node within the cluster.

So if we’re adopting best practice and using IaC to creation and bootstrap these nodes, as an atomic action, then the IPs won’t be known until execution time; more concretely, we can’t load the 3 IP addresses of each node into the user data of every other node.

Fortunately, etcd and the kube-api-server can handle this for us if we solve one simple thing: provide a single endpoint that resolves to the API server.

So what can we do?

Using Border Gateway Protocol (BGP), we can advertise the same IP address from each of our control plane nodes and allow the API server and etcd to do their jobs.

Equinix Metal offers BGP services per project, so we can use our IaC to request a global IP address and advertise it from any device within the project. We can use an open-source CNCF project, kube-vip, to handle the BGP advertisements for us.

Let’s take a look.

Note: We’ll use Terraform for any IaC and cloud-init for the user-data / shell scripts. You can adapt and use whatever methods you wish, but these examples should serve as a good base.

Enabling BGP

The first thing we need to do is enable BGP and request a global IP.

resource "equinix_metal_project" "our_project" {
  name = "our-project"
  bgp_config {
    deployment_type = "local"
    asn = 65000
  }
}

The bgp_config requires a deployment_type which can be "local" or "global". We’re using "local", because we’re keeping everything inside Equinix Metal. If you’re needs are more bespoke (Bring your own IP or ASN), then you can discuss with the Equinix Metal Solution Engineers around your global BGP needs. You also need to set the asn to "65000". This is pretty much fixed, but it could potentially change in the future. Check the Terraform docs as required.

Next we need to request an IP to use for the BGP advertisement.

resource "equinix_metal_reserved_ip_block" "bgp_ip" {
  project_id = equinix_metal_project.our_project.id
  type = "global_ipv4"
  quantity = 1
}

Lastly, we need to enable the BGP session on our device.

resource "equinix_metal_bgp_session" "bgp_session" {
  device_id = equinix_metal_device.our_device.id
  address_family = "ipv4"
}

Preparing the Devices

To broadcast the BGP advertisements, we need to ensure the device is prepared. What do I mean by prepared? Well, we can use the metadata APIs to ensure the BGP configuration is enabled and that routes are available to speak to the peers within the metro. This just takes a couple of commands within your cloud-config.

First, let’s fetch and store the metadata response. When using IaC, there’s a small chicken and egg problem with the BGP enablement on the project ... sometimes, your device can spin up a little quicker. So to ensure we handle that, do a quick loop:

curl -o /tmp/metadata.json -fsSL https://metadata.platformequinix.com/metadata

until jq -r -e ".bgp_neighbors" /tmp/metadata.json
do
  sleep 2
  curl -o /tmp/metadata.json -fsSL https://metadata.platformequinix.com/metadata
done

Next, we want to ensure that the routes are correct to communicate with the BGP peers. We can grab the peer and gateway information from downloaded metadata.

GATEWAY_IP=$(jq -r ".network.addresses[] | select(.public == false) | .gateway" /tmp/metadata.json)

for i in $(jq -r '.bgp_neighbors[0].peer_ips[]' /tmp/metadata.json); do
  ip route add $i via $GATEWAY_IP
done

BGP Advertisements

Now that everything is in-place, we can ask kube-vip to start advertising our BGP IP.

We’ll be using Kubernetes static manifests to run kube-vip as part of our Kubernetes control plane. Kube-vip, as one could infer from its name, is intended to run in this fashion and provides convenience functions for us to do this easily.

So let’s pull down the image and generate our static manifest.

ctr image pull ghcr.io/kube-vip/kube-vip:v0.6.0

ctr run \
      --rm \
      --net-host \
      ghcr.io/kube-vip/kube-vip:v0.6.0 \
      vip /kube-vip manifest pod \
      --interface lo \
      --address $(jq -r '.network.addresses | map(select(.public==true and
.management==true)) | first | .address' /tmp/metadata.json) \
      --controlplane \
      --bgp \
      --peerAS $(jq -r '.bgp_neighbors[0].peer_as' /tmp/metadata.json) \
      --peerAddress $(jq -r '.bgp_neighbors[0].peer_ips[0]'
/tmp/metadata.json) \
      --localAS $(jq '.bgp_neighbors[0].customer_as' /tmp/metadata.json) \
      --bgpRouterID $(jq -r '.bgp_neighbors[0].customer_ip'
/tmp/metadata.json) | tee /etc/kubernetes/manifests/kube-vip.yaml

# This is needed to avoid a port conflict
sed -ri 's#- manager#- manager\n    - --promethuesHTTPServer=:2113#g' /etc/kubernetes/manifests/kube-vip.yaml

This runs kube-vip on each control plane node in it’s --controlplane mode. This enables each control plane mode to discover the first API server, as it takes ownership of the lease, and in-turn discover the etcd nodes for consensus.

Conclusion

Running Kubernetes in a highly available fashion on bare metal doesn’t need to be complicated, you just need to understand how BGP can open up new patterns and understand the tools emerging in this space to make traditional networking patterns more Cloud Native friendly.

Kube-vip is a fantastic new tool that enables BGP for Cloud Native and Kubernetes organisations.

Last updated

03 June, 2024

Category

Tagged

Technical