How to Process Packets Outside the Linux Network Stack

When writing an application that deals with transferring data between different nodes, be it an HTTP server or client, API gateways, storage services or something else, some kind of networking subsystem has to be involved. The data needs to be serialized and prepended with various packet headers in order to make sure it reaches its final destination on the other side of the service.

In Linux systems, it’s the OS that handles all the network-specific tasks, from encapsulating the application data into network protocol headers to directing data from received packets to the right applications. Application developers commonly offload all networking-related logic to the kernel, which is suitable for the vast majority of use cases. Dealing with the actual packets and encapsulation intricacies isn’t the problem your application needs to solve. There are, however, applications that need to handle packet processing at very high rates, and that is a challenge we have been working on as we develop our cloud-native network offerings.

Over the years, new protocols were invented, standardized and made it to a released kernel; more code was added. The more code we run in the packet processing logic, the longer it takes to process a packet. Being general-purpose, the Linux network stack has a whole bunch of features, which sometimes get in the way when you’re trying to optimize it for bulk packet forwarding. On top of that, because networking code lives in the kernel, any updates or extensions to the network stack might need a kernel upgrade.

We realized that if we wanted a fast, extensible and decoupled networking stack we would need to look for it elsewhere. As the title suggests, we went down the path of taking the kernel out of the picture for packet processing tasks and doing it all in the user space. In this article we’ll share some of what we have learned so far in our journey and highlight some of the tools we’ve been using to power the next generation of Equinix products.

Reason 1: Packet Processing Overhead

Much has to happen before application data reaches the right process or is forwarded to the destination interface. On the ingress side (that is, receiving traffic), the packet is received by the network card. The network driver will usually interrupt the kernel so it knows there is a packet waiting to be processed. This interrupt causes the CPU to pause what it’s doing and go process that packet. This is called a “context switch.” The CPU then allocates a certain amount of memory and copies the packet data received from the NIC there. This part of the process is called SKB allocation, and it basically means that the kernel now has an internal structure to represent the packet.

The packet now sitting in memory is also parsed and filled with lots of metadata, which might be consumed by different parts of the kernel subsystems. Here we find references to the input interfaces, hashing value, tunnel keys and so on. As the packet moves up the processing chain, several operations are performed on it in order to determine where that data is supposed to go: either to a remote system (that is, to be forwarded) or to a local socket that is interested in packets matching certain criteria (a combination of address and port). Both forwarding and socket delivery also consume precious CPU time.

The long list of things that happen between packet arrival and packet forwarding or application delivery includes ingress queuing, route lookups, firewall filters, routing rules, socket matching, header rewrite, encapsulation, error handling and much more. A lot of CPU time is spent just figuring out what to do with a single packet.

This diagram should give you a good idea of the level of complexity involved:

Packet flow in Netfilter and general networking — Credit: Jan Engelhardt. Source: Wikimedia Commons

You can try to tune the system (use NAPI or GRO/GSO) or use lower-level sockets (such as AF_PACKET), but in most cases that is not enough (one exception being while getting data from XDP hooks or some zero copy mode supported on some socket families). A big part of the CPU time and effort is spent copying data from one memory block to another (from NIC to kernel, from kernel to application memory or egress NIC buffer). The same is true for data originated from an application and headed toward a remote system.

Reason 2: Upgrades

It’s natural to want to change a networking product to fit new use cases, fix bugs and support new features. If these new features had to go in the kernel code, keeping all that code up to date would be a lot of work. And releasing that code would be even more work, as the entire operating system would change as a result. On the other hand, if the thing that runs the product (like some process) is not tightly coupled to the thing that runs the thing (like the kernel), it is easier to upgrade one without changing the other. New functionality can go live faster and more smoothly.

Taking Packet Processing Over From the Kernel

Networking in the kernel is relatively simple. There are multiple ways one can influence the packet path in the kernel (like routes, IP rules, packet manipulation with TC, firewalls and so on) and lots of different customizations are possible. ARP, routing, ICMP handling and other control protocols are all implemented in the kernel out of the box. But this also means that if you want to take over a NIC from the kernel and handle packet processing yourself, you will likely need to do some extra work in order to speak fluent “networkish:” resolve MAC addresses, interpret and respond to ICMP messages, build and keep tables fresh and so on. That job is far from being straightforward.

When a Linux-based operating system boots up, it probes the devices connected to its motherboard. After discovering them, it loads a suitable driver (also called a kernel module) in order to use these devices. When the kernel discovers a network card, it loads a driver supported by that NIC, and the result is that the kernel can now use it to both receive and transmit packets. Ingress packets from the NIC will then follow the path along the networking subsystem that will decide whether to process that packet (deliver to a socket, forward it, handle control packets such as ARP/ND or ICMP) or drop it. Some NICs go even further and take over some extra work with offload capabilities.

But if we wish to set the kernel’s packet processing aside, we need to pull the NIC out of the kernel and use it directly from our application. This can be done by loading a passthrough-like driver for the NIC (the vfio-pci driver, for example). With these drivers, you could use the DPDK (Data Plane Development Kit) poll mode drivers to configure the NIC (and its queues) as well as accessing the packet descriptors on that device using DMA (Direct Memory Access), making packet access very fast. All the kernel knows is that there’s a PCI device—it doesn’t even know it’s a network card at this point—and doesn’t try to do anything related to networking at all.

Using DPDK

One popular use case for DPDK is on a virtual switch inside a virtualized or containerised system. Traditionally, these technologies rely on a host kernel to expose both hardware resources and networking connectivity. The networking connections of the guest systems would take the form of a virtual NIC plugged into a software bridge in the kernel or some other similar abstraction (macvlan, taps, etc). One problem here is that the host kernel becomes a bottleneck, handling packet forwarding for an increasing number of applications and guest systems.

A number of projects were born in the past few years that can address this shortcoming, and most of them consist of replacing the Linux host networking stack with a faster software bridge. “Faster” here really means just not trying to do everything that the kernel does, only the essentials; and doing them more efficiently. Examples of projects that use DPDK to enable fast packet processing include Snabb, OVS and FD.io (VPP). A more complete list of DPDK projects is available on the DPDK website.

Due to its wide range of protocols and a decent level of maturity, we chose VPP to be the underlying forwarding engine of our next-generation platforms. The sections that follow detail why we made this decision.

VPP

VPP is a high-performance packet processing stack. It was originally developed by Cisco and powers a fair amount of its product portfolio. This technology was open sourced, and it’s now maintained by the FD.io project.

It consists of breaking the packet processing steps into small pieces and joining those pieces together in what it calls a node graph. To process a packet, we walk a path through the node graph, depending on what’s in the packet, and when we get to the end we have done all the work that packet needs.

VPP grabs many packets at once (grouped in a vector) and instead of walking each of them one by one through all the work, it does one small piece of work on all the packets at once before moving to the next node in the graph. This is much more efficient on modern CPUs, which means we can process more packets with less CPU time. For a more detailed description of the technology refer to the official documentation.

In order to use VPP, you can either install it as a package or build it yourself from source.

Using VPP

If you’re familiar with configuring network equipment (hardware routers, switches) through a CLI, VPP has a CLI that can be used to manage its configuration. If you are more comfortable writing code, you can interact with VPP using its C or Golang APIs. VPP creates socket files that can be used to manage it on startup (cli.sock and api.sock) and the tools use those to communicate with it. Here is an example of using the CLI socket to connect to VPP:

$ vppctl -s /run/vpp/cli.sock
    _______    _        _   _____  ___
 __/ __/ _ \  (_)__    | | / / _ \/ _ \
 _/ _// // / / / _ \   | |/ / ___/ ___/
 /_/ /____(_)_/\___/   |___/_/  /_/

vpp# show interfaces
              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count

VirtualFunctionEthernet1a/6/0     1      up          9000/0/0/0     rx packets                    62
                                                                    rx bytes                    7320
                                                                    tx packets                     5
                                                                    tx bytes                     606
                                                                    drops                         59
                                                                    punt                           1
                                                                    ip6                           62
VirtualFunctionEthernet1a/e/3     2     down         9000/0/0/0
local0                            0     down          0/0/0/0

vpp#

As in any decent networking stack, VPP supports packet capture and counters, and will even tell you which nodes in the graph are being run and how often. These are very useful for debugging and troubleshooting.

Next, let’s take a quick look at some of the basic configurations we will often see in VPP instances.

Exposing NICs to VPP

As mentioned before, VPP uses DPDK libraries to manage NICs. Before starting VPP, we need to configure it to bind to the right interfaces. This can be done by either providing the interface PCI addresses in the configuration file (startup.conf) or by not specifying any interface and letting VPP take over all interfaces it can find. Refer to the DPDK section in the documentation for more info.

In this example, I am running VPP in a container, giving it two NICs by means of putting them in its container. VPP tries to bind to all interfaces, but that is fine, because the host won’t let it take over interfaces we haven’t explicitly allowed:

vpp#   sh log

2023/09/29 09:21:10:028 notice     vat-plug/load  Loaded plugin: acl_test_plugin.so
2023/09/29 09:21:10:030 notice     dpdk           EAL: Detected CPU lcores: 56
2023/09/29 09:21:10:030 notice     dpdk           EAL: Detected NUMA nodes: 2
2023/09/29 09:21:10:030 notice     dpdk           EAL: Detected static linkage of DPDK
2023/09/29 09:21:10:030 notice     dpdk           EAL: Selected IOVA mode 'VA'
2023/09/29 09:21:10:030 notice     dpdk           EAL: No free 1048576 kB hugepages reported on node 0
2023/09/29 09:21:10:030 notice     dpdk           EAL: No free 1048576 kB hugepages reported on node 1
2023/09/29 09:21:10:030 notice     dpdk           EAL: VFIO support initialized
2023/09/29 09:21:10:030 notice     dpdk           EAL: Using IOMMU type 1 (Type 1)
2023/09/29 09:21:10:030 notice     dpdk           EAL: Probe PCI driver: net_iavf (8086:154c) device: 0000:1a:06.0 (socket 0)
2023/09/29 09:21:10:030 notice     dpdk           EAL: Failed to open VFIO group 142
2023/09/29 09:21:10:030 notice     dpdk           EAL: 0000:1a:06.1 not managed by VFIO driver, skipping
2023/09/29 09:21:10:030 notice     dpdk           EAL: Failed to open VFIO group 143
2023/09/29 09:21:10:030 notice     dpdk           EAL: 0000:1a:06.2 not managed by VFIO driver, skipping
2023/09/29 09:21:10:030 notice     dpdk           EAL: Failed to open VFIO group 144
2023/09/29 09:21:10:030 notice     dpdk           EAL: 0000:1a:06.3 not managed by VFIO driver, skipping
2023/09/29 09:21:10:030 notice     dpdk           EAL: Failed to open VFIO group 145
2023/09/29 09:21:10:030 notice     dpdk           EAL: 0000:1a:0e.0 not managed by VFIO driver, skipping
2023/09/29 09:21:10:030 notice     dpdk           EAL: Failed to open VFIO group 146
2023/09/29 09:21:10:030 notice     dpdk           EAL: 0000:1a:0e.1 not managed by VFIO driver, skipping
2023/09/29 09:21:10:030 notice     dpdk           EAL: Failed to open VFIO group 147
2023/09/29 09:21:10:030 notice     dpdk           EAL: 0000:1a:0e.2 not managed by VFIO driver, skipping
2023/09/29 09:21:10:030 notice     dpdk           EAL: Probe PCI driver: net_iavf (8086:154c) device: 0000:1a:0e.3 (socket 0)

The logs say that VPP was able to bind to two NICs: 0000:1a:06.0 and 0000:1a:0e.3. We can also check them with the show hardware-interfaces command:

vpp# show hardware-interfaces

              Name                Idx   Link  Hardware
VirtualFunctionEthernet1a/6/0      1     up   VirtualFunctionEthernet1a/6/0
  Link speed: 10 Gbps
  RX Queues:
    queue thread         mode
    0     main (0)       polling
  TX Queues:
    TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers]
    queue shared thread(s)
    0     no     0
  Ethernet address e6:ff:30:5c:e3:65
  Intel iAVF
    carrier up full duplex max-frame-size 9022
    flags: admin-up maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-unmaskable
    rx: queues 1 (max 256), desc 1024 (min 64 max 4096 align 32)
    tx: queues 1 (max 256), desc 1024 (min 64 max 4096 align 32)
    pci: device 8086:154c subsystem 1028:0000 address 0000:1a:06.00 numa 0
    max rx packet len: 9728
    promiscuous: unicast off all-multicast on
    vlan offload: strip off filter off qinq off
    rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip
                       outer-ipv4-cksum vlan-filter vlan-extend scatter rss-hash
    rx offload active: ipv4-cksum scatter
    tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
                       tcp-tso outer-ipv4-cksum qinq-insert vxlan-tnl-tso
                       gre-tnl-tso ipip-tnl-tso geneve-tnl-tso multi-segs
                       mbuf-fast-free outer-udp-cksum
    tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs
    rss avail:         ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4
                       ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6
    rss active:        none
    tx burst function: (not available)
    rx burst function: (not available)


    tx frames ok                                           5
    tx bytes ok                                          606
    rx frames ok                                          67
    rx bytes ok                                         7910
    extended stats:
      rx_good_packets                                     67
      tx_good_packets                                      5
      rx_good_bytes                                     7910
      tx_good_bytes                                      606
      rx_bytes                                          8178
      rx_unicast_packets                                   4
      rx_multicast_packets                                63
      tx_bytes                                           606
      tx_unicast_packets                                   3
      tx_multicast_packets                                 2
VirtualFunctionEthernet1a/e/3      2     up   VirtualFunctionEthernet1a/e/3
  Link speed: 10 Gbps
  RX Queues:
    queue thread         mode
    0     main (0)       polling
  TX Queues:
    TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers]
    queue shared thread(s)
    0     no     0
  Ethernet address e6:86:4e:83:8b:74
  Intel iAVF
    carrier up full duplex max-frame-size 9022
    flags: maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-unmaskable
    rx: queues 1 (max 256), desc 1024 (min 64 max 4096 align 32)
    tx: queues 1 (max 256), desc 1024 (min 64 max 4096 align 32)
    pci: device 8086:154c subsystem 1028:0000 address 0000:1a:0e.03 numa 0
    max rx packet len: 9728
    promiscuous: unicast off all-multicast off
    vlan offload: strip off filter off qinq off
    rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip
                       outer-ipv4-cksum vlan-filter vlan-extend scatter rss-hash
    rx offload active: ipv4-cksum scatter
    tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
                       tcp-tso outer-ipv4-cksum qinq-insert vxlan-tnl-tso
                       gre-tnl-tso ipip-tnl-tso geneve-tnl-tso multi-segs
                       mbuf-fast-free outer-udp-cksum
    tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs
    rss avail:         ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4
                       ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6
    rss active:        none
    tx burst function: (not available)
    rx burst function: (not available)


local0                             0    down  local0
  Link speed: unknown
  local

The interfaces passed by the host are usually connected to the network (a network switch/router). This means that VPP has direct access to the network as if it were running on bare metal. The container is giving us some isolation so that we use the right interfaces, but it’s not adding any overhead at all.

Plugging Workloads to VPP

If we want to use VPP to connect to VM-based workloads, it supports exposing or connecting to a vhostuser socket that can be hooked to a hypervisor and then exposed as a NIC to a virtual machine. The process of exposing a vhostuser socket to a VM as a network interface is done by the hypervisor itself. Inside the VM, the kernel finds the right device driver (in this instance virtio-user) to communicate with that NIC, just like any Linux kernel would do on any NIC. The result is that VPP sees all packets coming out of that virtio NIC, and it can send packets to the VM through the NIC. From here onward, it's business as usual. Let’s create a vhostuser interface in our VPP instance:

vpp# create vhost-user socket /run/vpp/mysocket.sock server

VirtualEthernet0/0/0
vpp# show int VirtualEthernet0/0/0
              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count
VirtualEthernet0/0/0              3     down         9000/0/0/0
vpp#

And we also see our socket in the target directory:

$ ls -al /run/vpp/
total 28
drwxr-xr-x. 1 vpp  vpp  4096 Sep 29 11:12 .
drwxr-xr-x. 1 root root 4096 Sep 29 09:21 ..
srwxr-xr-x. 1 vpp  vpp     0 Sep 29 09:21 api.sock
srwxr-xr-x. 1 vpp  vpp     0 Sep 29 09:21 cli.sock
srwxr-xr-x. 1 vpp  vpp     0 Sep 29 11:05 mysocket.sock

You can read more about vhostuser usage here.

Conclusion

In this post we talked about what user space networking is, why the Linux network stack doesn’t meet our needs, and how the DPDK libraries can help overcome that. We also demonstrated a living VPP application, gave it a DPDK NIC and created a vhost user socket.

There are other uses for VPP, including IPSec termination, CGNAT, Network Tunnel termination and MPLS and SRv6 forwarding, but covering all that would make this post extremely long and boring. You can learn more in the official documentation.

Partners & Examples

How to Build Network Services in the User Space

Reason 1: Packet Processing Overhead

Reason 2: Upgrades

Taking Packet Processing Over From the Kernel

Using DPDK

VPP

Using VPP

Exposing NICs to VPP

Plugging Workloads to VPP

Conclusion

Published on

Category

Tags

Ready to kick the tires?

Reason 1: Packet Processing Overhead

Reason 2: Upgrades

Taking Packet Processing Over From the Kernel

Using DPDK

VPP

Using VPP

Exposing NICs to VPP

Plugging Workloads to VPP

Conclusion

Published on

Category

Tags

Ready to kick the tires?

Subscribe to our newsletter