Skip to main content

How to Design for Low Latency at Scale

The basics of keeping latency low, from application design through hardware and platform optimization.

Headshot of Damaso Sanoja
Damaso SanojaEngineer and Author
How to Design for Low Latency at Scale

From online gaming to digital advertising to AI inference, the need to deploy low-latency applications at massive scale is quickly growing. Ensuring no millisecond is wasted isn’t easy and requires expertise at every level: the code itself, the platform it runs on and the infrastructure and hardware underneath. 

Latency is an important performance metric for applications in general. In many cases, however, it’s critical, and every additional split-second delay can have serious consequences. Those range from a streaming service losing subscribers to an electronic-trading firm missing out on a lucrative deal to a grid operator being unable to balance the load on their system.

In this article we will highlight the common approaches to ensuring low latency at scale across the relevant layers: application design, platform, hardware, geographic workload placement and connectivity.

Best Practices for Keeping Latency Low in Large-Scale Applications

Let's start with the basics: application-level optimizations.

Application Design for Low Latency

There are multiple tactics that can be deployed in the application-design phase to make a big impact on latency.

Implement Efficient and Reusable Data Structures

An application’s performance is tied to its ability to process data efficiently. One important element to keep an eye on is how optimal a data structure you use.

For instance, a search engine must rapidly sift through large volumes of data. In this case, using hash maps can provide constant-time complexity for insertion and retrieval operations, as opposed to a list with linear-time complexity. This can dramatically reduce the time taken to process user queries and directly impact the perceived speed of the engine.

Use Asynchronous Processing and Multithreading

Asynchronous processing allows an application to handle multiple tasks concurrently rather than queuing them up to be executed sequentially. This is particularly effective in web applications where I/O operations can be a bottleneck.

With asynchronous I/O, a web server can continue to process new incoming requests without waiting for responses to previous ones, like fetching data from a database. This keeps the system responsive even under heavy load.

Similarly, multithreading enables an application to perform several operations in parallel. For instance, during a flash sale, an e-commerce platform can use multithreading to handle multiple user checkouts simultaneously, preventing the delay that could otherwise result from single-threaded queue processing.

Favor Data Caching

Caching frequently accessed data in memory can drastically reduce the need for repeat data retrieval operations, which is especially beneficial when dealing with networked database systems. For example, when user profiles on a social network are cached, the system avoids the latency introduced by querying a remote database every time a user's profile is accessed.

Design Effective Database Indexing Strategies

Well-designed database indexing can significantly expedite search queries, which is critical when dealing with large data sets. An e-commerce site, for instance, may index its inventory database based on product categories, enabling rapid query results when users filter their searches.

Each of these practices serves to streamline the flow of data and enable applications to serve users with minimal delay.

Platform Optimization for Low Latency

Equally critical to optimizing apps for low latency is having an infrastructure in which computing, networking and storage can handle the demand. The following are several recommended best practices for platform optimization.

Use Content Delivery Networks

Content delivery networks (CDNs) can play a critical role in reducing latency. 

CDNs are an essential element of online gaming platforms, for example. They store game assets on distributed servers globally in order to enable players to download content from the nearest location, thereby reducing latency. A popular massively multiplayer online (MMO) game might use a CDN to serve up new patches or game assets, which allows for faster download times and reduces the strain on the central servers. This ensures that even during a massive, global rollout of an update, players experience minimal lag.

Use Load Balancing

Load balancing is a critical practice for maintaining low latency. By distributing incoming traffic across multiple servers, load balancers ensure no single server gets more requests than it can handle without excessive delay.

Autoscale with Kubernetes

Kubernetes is an orchestration platform that excels in scenarios requiring autoscaling capabilities. Going back to the online gaming example, where the number of concurrent users can fluctuate dramatically, Kubernetes can automatically scale the gaming service up or down based on demand. For instance, when a new game launches or an online event occurs, Kubernetes can spin up additional pods to handle the influx of players, ensuring the latency remains low as the load increases.

In financial services, Kubernetes can help manage workloads during market spikes. When trading volume peaks, the autoscaling feature of Kubernetes can dynamically allocate more resources to the trading applications. This ensures that the transaction processing doesn't suffer from increased latency, which is crucial in a sector where timing is everything.

In both use cases, Kubernetes not only helps maintain consistent performance but also optimizes cost, as resources are scaled down during periods of low demand. The result is a platform that is responsive, cost effective and capable of adapting to varying loads with minimal latency.

Overall, platform optimization for low latency involves a combination of strategies suited to the specific demands of the application domain. Whether it's a gaming platform requiring speedy content delivery or a financial service needing fast transaction processing, employing CDNs, load balancing and autoscaling are best practices that help maintain a competitive edge.

Location Strategy and Connectivity for Low Latency

While optimizing your apps and the platform they run on is essential, physical locations of the various “nodes” that comprise your infrastructure—and how they all connect to each other—also greatly affect latency.

Deploy at the Edge

Edge computing brings processing closer to the end user, minimizing latency by reducing the distance data must travel. An online gaming company, for example, would deploy its multiplayer servers at the edge, across different regions, to ensure fast-enough response times for a pleasant gaming experience; a trading firm would deploy its software in the same region—or in the same data center—as the trading engines of the markets it trades on, so that its algorithm can react to changes in the market faster than other traders; a SaaS provider would deploy its servers in metropolitan areas with high concentration of end users, so they can enjoy a fast and smooth experience using its service.

Use Private Connectivity for Clouds and Edge Nodes

If your application runs in an environment that combines private IT infrastructure and a public cloud (or multiple public clouds), how the different environments are connected to each other greatly affects latency. All major cloud providers give customers the option to connect to their networks using direct, private network connections (AWS Direct Connect, Azure ExpressRoute, Google Cloud Interconnect, IBM Cloud Direct Link, Oracle FastConnect). While adding to your overall operating cost, these cloud onramps are faster, more reliable and provide lower and more consistent latency than the public internet.

Private connectivity also helps lower latency when connecting edge nodes to the core. A user, for example, may use private connectivity to link their private enterprise environment and their cloud provider(s), and to link their core and edge nodes, while using the internet only to serve user traffic at the edge. Another user may directly connect private storage arrays to a cloud service to run analytics on that data. Having the cloud onramp in the same data center as the arrays greatly reduces that analytics workload’s latency. 

There are many more use cases for private connectivity to the edge and to and between clouds. If your strategy for lowering application latency includes using direct network connections, you can do so in Equinix data centers around the world, launching and managing the connections virtually and remotely. Private connectivity to all the major cloud providers and network operators is available in these facilities to ensure the shortest data-transfer paths possible. The Network-as-a-Service capabilities Equinix provides enable a set of powerful networking building blocks to be consumed on demand and managed using a powerful API and all the common Infrastructure-as-Code tools.

Hardware Configuration for Low Latency

Running your workloads on optimally configured hardware (CPU, NIC, memory, storage) plays a critical role in low-latency applications.

Rightsize RAM and CPU Core Count

If your workload requires a lot of data to be kept in memory and benefits from running many processes in parallel, it’s best to go with high-core-count CPUs and lots of RAM. These are applications like analytical or Online Transaction Processing (OLTP) databases, which need quick access to data in memory and the ability to process many queries simultaneously; hypervisors, which need to allocate memory and CPU resources to multiple VMs and avoid resource contention; and scientific and engineering simulations, which involve complex mathematical calculations on lots of data and are often parallelized, using separate CPU cores to process different parts of a simulation in parallel.

Select the Right Network Interfaces

How quickly the network card can transfer data between your servers and the network is a crucial factor. When deciding on a server configuration, make sure to choose Network Interface Cards, or NICs, that are fast enough and have enough bandwidth for the amount of data they are going to move back and forth without becoming a bottleneck.

Use SSD or NVMe Storage

Slow storage can create I/O bottlenecks that severely impact application performance. The faster stored data can be retrieved, the lower the overall latency. Systems that handle large volumes of data or systems where every last microsecond of latency matters benefit immensely from SSD or NVMe storage in place of optical disks.


Optimizing for low latency is a necessity when delivering large-scale applications and services to a global user base.

From leveraging CDNs and optimizing platform configurations to fine-tuning hardware, the best practices described in this article ensure that applications remain responsive and efficient. For a global infrastructure platform with enough flexibility to fine tune your scaled applications for the lowest latency possible, consider Equinix dedicated cloud.

Published on

02 February 2024


Subscribe to our newsletter

A monthly digest of the latest news, articles, and resources.