eBPF Explained: Enhancing System Observability and Monitoring

Extended Berkeley Packet Filter, or eBPF, is a lightweight and flexible option for executing sandboxed observability programs and scripts within your systems. It gives your scripts the benefit of a privileged security context, direct access to all user and kernel-space memory and resources.

Compared to traditional methods, such as kernel modules and user-space agents, eBPF offers significantly lower overhead and memory usage by executing only when specific events trigger it, all within verified, secure kernel sandboxes. Its low overhead and efficient data collection minimize the impact of observability scripts on your systems while providing deep visibility into kernel-level events, network packets and system calls.

Typically, eBPF captures and processes events in real time within the kernel but may use asynchronous methods, such as ring buffers, to transfer data to user-space applications for analysis. Access to kernel-level data enables more granularity and near-real-time insights into your system's operations. In addition, eBPF's ability to enforce security policies at the kernel level makes it a powerful tool for efficient security monitoring and network security enforcement. It can replace older conventional kernel data collection tools like SystemTap and perf directly. In this article, you'll learn how eBPF works and explore some examples of it in action, enhancing observability and monitoring in real-world large-scale environments.

What Is eBPF?

eBPF is an improvement on the original Berkeley Packet Filter (BPF), which was developed for network packet filtering. It has since evolved into a generalized tool that extends Linux kernel functionality without modifying your kernel's source code, loading kernel modules, or otherwise disrupting regular program executions. eBPF also offers an efficient kernel-userspace communication model, using shared memory maps and ring buffers to enable data exchange between your kernel and userspace.

Typically, you're restricted from accessing kernel space, which is reserved for your operating systems and specialized processes. While meant to safeguard your system, this restriction can limit your ability to monitor kernel-level networking and security events, leaving you unable to identify and fix potential issues.

eBPF gives you access to your systems’ inner processes, enabling you to address a wide range of networking, observability and security challenges. For instance, you can attach an eBPF program to your kernel that intercepts system calls. Within your program, you can set conditions to trigger actions on the intercepted system calls, such as terminating or logging the event.

eBPF is open source and has a rich ecosystem of applications and tools designed to enhance observability, networking and security. With an eBPF runtime integrated into the Linux Kernel, applications like Cilium, Falco and Calico use it to secure container and Kubernetes networks.

How eBPF Works

The main eBPF components include your program bytecode, loaders, verifiers, hooks, maps, a just-in-time (JIT) compiler to switch bytecode to actionable machine code and kernel helper functions. You can write your eBPF bytecode directly or use apps like Cilium or bytecode compilers like LLVM.

eBPF architectural diagram. Credit: Sooter Saalu

Once your bytecode has been written, it's packaged into your kernel via a loader, which takes the code through a verification process to ensure safety and compliance with security requirements. This verification checks that the program has the necessary privileges and capabilities, does not contain harmful instructions and avoids behavior that could destabilize the system (infinite loops, for example).

Once verified, the bytecode is translated into machine-readable instructions by either a JIT compiler or another interpreter. The verified and translated code is then attached to your specified kernel hook (system calls, network events, etc.), which serves as an entry point for the eBPF program to execute in response to relevant events.

eBPF maps facilitate data sharing and storage between your eBPF program and user systems. eBPF supports various map types, each designed for specific use cases, providing flexible data handling for programs. These maps require precise memory layout matching between your kernel and user space. This alignment is simple when sharing C headers between both sides, but using other languages requires exact byte-for-byte compatibility with kernel structures.

Finally, kernel helper functions extend your eBPF program's functionality within the kernel space. They allow eBPF programs to interact with the kernel as if calling a function, enabling you to access system information with probes and tracepoints, perform network stack operations and manipulate kernel data structures like maps. You can also use helper functions to direct the control flow within your program, enabling you to support operations like redirecting traffic to different interfaces, logging kernel events or calling other eBPF programs dynamically through tail calls.

eBPF Use Cases with Real-World Examples

Let's take a look at some of the ways teams use eBPF to manage and fix observability and security challenges.

eBPF in Infrastructure Monitoring

eBPF can efficiently instrument your infrastructure and provide data for deep real-time insight into system health and performance. By attaching eBPF probes to critical system functions, you can continuously assess and monitor your CPU, memory and I/O usage without adding significant overhead. You essentially get a live feed into your systems and can detect issues and bottlenecks immediately.

Continuous data collection also enables you to analyze trends and patterns in your system's operations for optimization and greater accuracy in predicting issues and threats. eBPF programs are flexible and granular, which allows you to create custom data variables and metrics.

LinkedIn uses eBPF for infrastructure monitoring. For a platform like LinkedIn, which handles billions of connections and requests, traditional monitoring approaches would introduce significant overhead and blind spots. eBPF's kernel-level visibility and low overhead allow LinkedIn to monitor its entire infrastructure without impacting service performance. Their eBPF agents collect granular data about network flows, syscalls and service interactions without the performance penalties of user-space monitoring tools or the risk of kernel modules. This allows its team to analyze traffic patterns, identify traffic-heavy services and detect network bottlenecks at scale. In addition, LinkedIn uses eBPF data to visualize network interconnections by creating a service dependency graph. This graph helps the team understand how services operate within the infrastructure and enables better network and security optimization, as well as capacity planning and failure prevention.

eBPF in System Security

eBPF’s kernel-level access can enable a highly efficient approach to enforcing system security, surpassing application-level monitoring alone. By analyzing system calls and network packets in real time, eBPF offers deep visibility into system and network operations and granular control over which operations should be executed within your infrastructure. This capability enables dynamic and real-time enforcement of security policies, allowing real-time adjustments based on observed behavior without the need for preset rules or additional application agents. eBPF can implement fine-grained security policies by examining factors like protocols and application data, immediately blocking unauthorized access attempts and significantly reducing the system's exposure to vulnerabilities.

You can implement strict security monitoring, track network activity, collate audit logs of security events and identify intrusion patterns. The policies you set can be unique to your infrastructure, since you have granular control over where your program attaches to and the triggers used to restrict or allow processes. You can enforce complex rules based on various parameters, such as source/destination IP, port, protocol and even application-layer data. This level of control enables fine-grained network segmentation, preventing unauthorized access and mitigating the risk of security breaches.

eBPF is extensible and can be integrated into several other cloud-native solutions. Container Network Interface (CNI) plugins like Cilium leverage eBPF to implement Kubernetes network policies with beneficial security features, such as Layer 7 filtering and identity-based security. Integrating these policies helps you enforce granular network policies within Kubernetes clusters and ensure that only authorized traffic can access containerized applications.

Apple uses eBPF for security-conscious access to the kernel and flexibility in extending kernel functionality, allowing the company’s engineers to solve critical security monitoring challenges in their large-scale infrastructure. With millions of devices and strict security requirements, eBPF is greatly preferred for its sandboxed execution and adherence to the principle of least privilege. It is also directly compatible with the kernel and its different versions, ensuring third-party dependencies don't introduce vulnerabilities that could crash systems. This is crucial for Apple's ecosystem where a single security incident could affect millions of users. Specifically, Apple uses Falco, which builds on eBPF access with more security and threat detection programs and enables Apple to extend granular policies across its Kubernetes and cloud-based infrastructure.

eBPF in Network Traffic Analysis

Your network gets more complex as you scale, especially in cloud environments where resource details like an IP address can be ephemeral. eBPF enables you to build "bottom-up" visibility in your infrastructure, with all your resources instrumented and emitting necessary metrics with every operation. Pointing this capability at your network traffic facilitates kernel-level analysis of your network packets by looking deeply into each call and their performance during operations, such as large file delivery or downloads. This enables optimization and detection of performance bottlenecks, security threats and protocol-specific anomalies.

Your service mesh helps manage microservices in your infrastructure, and eBPF enhances observability within this software layer. Service meshes typically collect high-level metrics (like latency and request counts). Your service meshes can also be reliant on network proxies to gather this data and track communication between services.

eBPF offers flexibility in giving you deeper insights into system and security metrics by tapping directly into network, storage and CPU-level events. Unlike external services that require additional proxies, eBPF operates with lower overhead and better integration. Since it integrates directly with the Linux kernel, eBPF can observe, filter and manipulate network traffic without the need for external proxies or the context switches typically required for proxy interactions.

As mentioned previously, your eBPF data can be as granular and specific as you need it to be, and it can fit into the kind of network and instances you have in your infrastructure. This gives you the freedom to collect multiple metric categories to suit the spread of systems in your organization. For example, your infrastructure network could include TCP and UDP-connected applications to fit file transfer and streaming/video conferencing services, respectively. With eBPF, you can collate and use particular metrics, such as round-trip time (RTT) and packet loss rate, to optimize your application-specific network performance.

Netflix uses eBPF to enhance network observability across its fleet of systems, cloud services and instances. Traditional network monitoring tools would either miss critical data or impose unsustainable overhead costs for a platform streaming billions of hours of content to over 230 million subscribers globally. Its eBPF-based solution collects and correlates billions of eBPF network logs per hour to provide visibility into its network availability, performance and security. All this is done while using less than one percent of CPU and memory on any instance in its infrastructure. This efficient monitoring enables Netflix to address critical operational challenges by providing deep visibility into application dependencies and data flows, allowing for early detection of systemic issues before they impact streaming quality. It supports improvements in reliability, security and capacity planning and offers insights into network bottlenecks and approaching limits. This translates to better streaming quality, faster incident response and efficient capacity planning across their global infrastructure.

eBPF in Container and Kubernetes Monitoring

Kubernetes can be a complex ecosystem, with its interconnected components generating logs and metrics in different ways. Monitoring your system performance and behavior can require multiple tools, often with significant overhead. eBPF offers a unified and efficient solution for Kubernetes observability. With Linux-based containers often used in Kubernetes deployments, eBPF can be easily integrated to grant deep visibility into both the application layer and the underlying infrastructure without intrusive instrumentation or additional agents.

Using eBPF programs deployed on your cluster pods, you can implement data flow traces in real time and gain insights into the communication patterns between pods. You can also troubleshoot and optimize communication within your cluster by visualizing the inter-pod network flows. In addition, monitoring syscalls is easy with eBPF, which gives you foundational insights into how your operations are executed within your cluster.

eBPF can also be used to measure the performance of service-to-service interactions within a Kubernetes cluster without depending entirely on additional proxies or sidecars. This enables in-depth performance analysis of your service mesh and its applications. For instance, Microsoft uses eBPF to collect critical data from Kubernetes clusters to solve complex monitoring and security challenges in their massive cloud infrastructure. eBPF enables Microsoft to observe pod-to-pod communications, track container lifecycle events and monitor system calls with minimal performance impact across their clusters. By iterating over multiple Kubernetes objects with eBPF programs, Microsoft collects data for further enrichment by downstream services. This process gives deep visibility and helps manage its Kubernetes clusters while fueling optimization and security efforts. Additionally, Microsoft's initiative to extend eBPF functionality to Windows aims to solve cross-platform observability challenges in hybrid environments, enabling consistent monitoring and security policies across Linux and Windows workloads, which is crucial for their enterprise customers running mixed infrastructure.

Conclusion

In this article, you learned how eBPF works and saw how it can serve as a foundational tool to emit critical data for observability and security purposes. However, it should be noted that you can face a few challenges when building your eBPF programs. As noted in the LinkedIn case study, key features such as eBPF's Compile Once, Run Everywhere (CO-RE) can be obstructed by architectural support. In addition, eBPF features can vary across different kernel versions, Linux distributions and architectural systems, impacting compatibility and functionality.

eBPF and How Tech Giants Use It to Uplevel Observability