Leveraging OpenTelemetry Tracing to Tune Our Rails Monolith

Back in the early days of Packet, long before our evolution to Equinix Metal was even dreamed of, our server fleet was measured in the hundreds and even slow code was relatively fast. However, as we’ve grown from measuring in hundreds of machines to measuring in tens of thousands (and from a handful of individual sites to dozens), performance on our console.equinix.com customer portal degraded significantly. By early 2022, some pages were taking tens of seconds to load. Portal load times were especially bad for large customers with a high number of instances.

Since Equinix Metal internal organizations and projects can contain hundreds of test instances and other debugging data, Metal's own sales and customer success teams felt this pain as well. They escalated their concerns to leadership, who asked a handful of engineers to set aside planned sprint work for a few weeks to get load times down to acceptable levels.

It Wasn't (Entirely) the Portal's Fault

The team realized early on that portal response time was slow largely because backend response time was slow.

Chrome devtools network tab waterfall chart from loading the IPs page in the portal

While parts of our backend monolith were already in the process of being replaced by new services with a more efficient architecture and data model, we couldn't wait for that migration to complete—the code was still being written. Our teammates and customers were feeling this pain now, so we needed to act.

More on distributed tracing and OpenTelemetry:

As we discussed possible sources for latency, we were reminded of some known inefficiencies. Engineers around the org had long suspected that some endpoints were generating too many database queries per request, which can add up even if individual queries are fast. One ticket linked to an OpenTelemetry trace with over 23,000 spans for a single request, nearly all of which represented database queries. Here’s a video of us scrolling through the trace at 2x speed.

Warning: We do not recommend watching this video if you're photosensitive, due to frequent white flashes.

That ginormous trace was our smoking gun. It reeked of N+1 queries, which Rails and other object-relational mapping (ORM) systems are notorious for introducing. Since much of our API monolith code was written long ago, it wasn’t designed to handle tens or hundreds of thousands of servers. Our object model was literally slowing us down!

Our ActiveRecord instrumentation worked beautifully to demonstrate the problem in these huge traces, but it came at a cost. Alas, the explosion in the number of OpenTelemetry spans per request put too much pressure on the service and infrastructure. Faced with other priorities at the time, the team disabled the ActiveRecord instrumentation soon after those giant traces were observed.

Additionally, this 23,000-span example trace wasn't conclusive evidence that N+1 queries were a problem for all of our slow endpoints. In order to debug our performance issues, we needed better observability into the problematic parts of the app.

ActiveRecord Tracing

For readers who are new to tracing, the OpenTelemetry docs provide a good overview of the concepts and benefits.

Our ActiveRecord tracing is custom code that behaves like auto-instrumentation, thanks to the ActiveSupport instrumentation built into Rails. It works by subscribing to ActiveSupport::Notifications for sql.active_record and then generating spans from that data. This is similar to how the Rails integration works in other Rails instrumentation libraries like Honeycomb's Ruby Beeline, NewRelic's Ruby Agent, and OpenTelemetry's ActiveRecord instrumentation, which was added to the Ruby SDK late last year.

Here's what ours looks like:

event = ActiveSupport::Notifications::Event.new(*args)

name = "DB"
name.concat(" #{event.payload[:name]}").strip! if event.payload[:name]

@tracer.start_span(
  name: name,
  kind: OpenTelemetry::Trace::SpanKind::CLIENT,
  start_timestamp: event.time,
  attributes: {
    "db.system" => "postgresql",
    "db.statement" => event.payload[:sql],
    "cached" => event.payload[:cached],
  },
)

The instrumentation code generates a new span for each ActiveRecord event with an attribute for the SQL statement being run, plus it tracks whether or not ActiveRecord hit the Rails query cache for that event.

Here's an example span (represented as JSON) that was generated by this instrumentation:

{
  "Timestamp": "2022-03-29T22:56:19.784382499Z",
  "duration_ms": 0.390804,
  "name": "DB Network::IPAddress Load",
  "db.statement": "SELECT  \"ip_addresses\".* FROM \"ip_addresses\" WHERE \"ip_addresses\".\"deleted_at\" = '0001-01-01 00:00:00 BC' AND \"ip_addresses\".\"deleted_at\" = '0001-01-01 00:00:00 BC' AND \"ip_addresses\".\"id\" = '...' ORDER BY \"ip_addresses\".\"networkable_type\" ASC, \"ip_addresses\".\"networkable_id\" ASC, \"ip_addresses\".\"public\" DESC, \"ip_addresses\".\"address_family\" ASC, \"ip_addresses\".\"management\" DESC, \"ip_addresses\".\"address\" ASC LIMIT 1",
  "db.system": "postgresql",
  "service.name": "PacketApi",
  "trace.parent_id": "43f...",
  "trace.span_id": "c8a...",
  "trace.trace_id": "e1c..."
  ...
}

We may decide to move to OpenTelemetry's ActiveRecord integration now that it's available. It wasn't on our radar when we began this investigation.

Minimizing the Blast Radius

As mentioned above, we disabled the ActiveRecord instrumentation soon after releasing it last year because of the pressure it put on the API service. Enabling it would offer much better observability into the service's behavior, but after a year of OpenTelemetry-related incidents and with ongoing customer pain, we needed to tread lightly.

To minimize any negative impact from our ActiveRecord auto-instrumentation, we wrapped it in a conditional that could be set dynamically via a LaunchDarkly feature flag. This allowed us to enable ActiveRecord tracing only for requests from specific user IDs—namely, the handful of us investigating these performance issues.

def with_active_record_tracing
  # LaunchDarkly requires a user, so only enable ActiveRecord tracing on requests with a user.
  return yield unless current_user && LaunchDarklyClient.instance

  begin
    user_record = { key: current_user.id }
    Trace::ActiveRecordSubscriber.enable(
      LaunchDarklyClient.instance.variation(ACTIVE_RECORD_TRACING_KEY, user_record, false)
    )
    yield
  ensure
    # restore thread-local state
    Trace::ActiveRecordSubscriber.enable(false)
  end
end

While it might not be the ideal use case for feature flags, it allowed us to quickly proceed without taking out the Ruby garbage collector, OpenTelemetry's BatchSpanProcessor, or Honeycomb's ingest API.

LaunchDarkly's requirement for a user ID meant that we couldn't enable ActiveRecord tracing on requests that don't have a logged-in user, like some background jobs. We discussed ways to get around it, but a user-based approach was actually perfect for testing. Long-term we want to get the API service to a place where we can confidently enable ActiveRecord tracing everywhere, but in the meantime the feature flag gives us much-needed guardrails.

Performance Overhead

Even with the small volume of requests from a limited number of users, we wanted to make sure this additional trace data didn't have a significant negative impact on API performance. OpenTelemetry is designed to be safe and performant for production workloads but it never hurts to double check, especially since we were generating thousands of spans per request.

While we didn't have a large sample size, we observed that enabling ActiveRecord tracing didn't meaningfully affect latency for the same requests. The following screenshot shows a list of traces from our early testing, all making the same request to the problematic /ips endpoint. Requests with ActiveRecord tracing enabled (those with thousands of spans) took about the same amount of time as the same requests with it disabled (those with only two spans).

Table of `IpReservationsController#index` traces and their durations. Traces with a high number of spans have ActiveRecord instrumentation enabled, and traces with two spans have it disabled. — Table of IpReservationsController#index traces and their durations. Traces with a high number of spans have ActiveRecord instrumentation enabled, and traces with two spans have it disabled.

Seeing similar results on the other endpoints was enough for us to feel confident and proceed with our investigation.

Detangling Rails

With ActiveRecord tracing enabled for a handful of team members, we could see how many times the service was querying the database or hitting the cache for individual requests. This level of observability allowed us to quickly confirm that our slowest and most customer-impacting performance issues were caused by excessive database activity.

As mentioned earlier, the N+1 query problem is common in Rails applications, particularly in a scaling scenario like ours. At its core, the issue comes from the ActiveRecord ORM lazy loading individual rows from the database unless explicitly told to prefetch data.

The official Rails recommendation is to cache everything. There is another path, though: with tracing we can see exactly where things happen over the entire lifecycle of each request, turning an intractable problem into a manageable optimization process.

Identifying the Culprit

Looking closely at the traces, it was clear that our database problems were a product of our templating architecture. At Metal we use RABL to define our responses, which are composed by including child templates in objects and collections. For example, this is our template for loading a facility:

object @facility
cache

attributes :id, :name, :code, :features

deleted_at

node(:ip_ranges) { |facility| facility.ip_ranges.map(&:address_with_subnet) }

node(:address) do |facility|
  partial('addresses/show', object: facility.try(:address))
end unless exclude?(:address)

show_only_on_include :labels, as: :labels, template: 'labels/show'

child(metro: :metro) { extends 'metros/show' } unless exclude?(:metro)

Nesting makes it easy to define a response template in one place, but it also means that modifying a template changes the response everywhere that template is used. When combined with the N+1 query problem and the growth in complexity of the Metal API over several years, this design decision had turned into a major engineering problem.

Fixing the Slow Paths

Getting excess database querying out of a years-old Rails monolith is equal parts science and art. In some cases, simply adding includes to a method called in many places was enough to have an impact:

Other situations required a bit more rewriting. A common pattern was code that did a redundant query to look up an object's ID in several places instead of referencing the object directly.

These minimal examples belie the extent of the changes. With such a large amount of code that could be hiding extra queries, we needed a way to test which changes had a measurable impact on performance.

Canarying and Verifying

The N+1 query problem is inherently a scale problem.

In a dev or test environment, it's rare for the volume of data to match what's present in a production environment. Maintaining a parallel environment with a large volume of test data and fake traffic requires significant work and resources, and in the end you're still unlikely to reproduce all of the quirks that inevitably manifest with real users. Even if we had wanted to do that work (and for some orgs it's absolutely worth the effort), it wouldn't have been feasible in a short time. To find out if our code-side optimizations were working, we needed to test in production.

Thankfully, our system already supported canary deploys. The central API service has an api-internal deployment that's not exposed to customer traffic, so any change we make there uses production data but without direct customer impact.

The Process

Canarying provided us with a tight (and very granular) feedback loop for verifying optimizations in the code. The verification process went like this:

Before canarying, manually cURL a bunch of requests to api-internal from a user account that has the ActiveRecord tracing feature flag enabled
Query Honeycomb for these baseline traces so we have them handy
Deploy the PR image to api-internal
Make the same cURL requests again
Compare the traces before and after canarying

The comparison we did involved choosing exemplars from the baseline and canary traces and then further querying the spans within each exemplar trace. To see how many times ActiveRecord was querying the database or hitting the cache in a request, we looked at the count of ActiveRecord spans in that trace, grouping by the span name (which references the Rails model being looked up) or by the database query statement itself.

Here's what the data looked like for a request hitting the /ips endpoint, which was probably the worst-performing of the ones we worked on:

Table listing the most frequent ActiveRecord spans on a single request to the `/ips` endpoint, grouped by the query statement. — Table listing the most frequent ActiveRecord spans on a single request to the /ips endpoint, grouped by the query statement.

Note that cached is true for all the queries shown above. ActiveRecord's query caching is scoped to the controller action, so in the case of PriceMatrix on this request, it queried the database once and then hit the query cache 2,299 times for those exact same query results:

Part of a trace with `DB PriceMatrix Load` spans highlighted. The first PriceMatrix query (highlighted in blue) is not cached, and the rest (highlighted in yellow) are cached. — Part of a trace with DB PriceMatrix Load spans highlighted. The first PriceMatrix query (highlighted in blue) is not cached, and the rest (highlighted in yellow) are cached.

Query caching alone wasn't enough to maintain fast performance as we scaled. It turns out that doing anything hundreds or thousands of times within one request adds up, even if each one takes microseconds. Go figure!

The Results

The first canarying we did was on the /plans endpoint, and we saw an immediate drop in the query count compared to the baseline. This chart shows the count from a representative request hitting different versions of the code.

Number of ActiveRecord lookups (queries or cache hits) per request to `/plans`, grouped by table name — Number of ActiveRecord lookups (queries or cache hits) per request to /plans, grouped by table name

A tight feedback loop, granular traces, and the ability to query high-cardinality data made it possible to quickly iterate and test improvements, with some really great results. This is a comparison of latency for requests to the /ips endpoint at the end of January 2022 vs. the end of February 2022:

Time over time comparison of latency on the `/ips` endpoint the week of February 24th (right) compared to one month prior (left) — Time over time comparison of latency on the /ips endpoint the week of February 24th (right) compared to one month prior (left)

There's ongoing work to continue improving performance in the API service as well as larger projects to re-architect parts of the system and improve our data model, which will be important as we continue to grow. Still, we're thrilled to hear from customers as well as folks within Metal that they're already experiencing a noticeable improvement in portal load times.

Database Optimizations

While this post has mostly covered the code-side changes we made to address the N+1 query problem, we also saw significant performance gains thanks to some very strategic database optimizations.

Beyond the ability to observe the volume of different queries, a benefit of ActiveRecord tracing is that we gain visibility into actual query parameters without needing to increase logging at the database. Having complete query details inside traces makes it easy to get query plans from the database, both for expected query execution time as well as the outliers.

Looking at both ActiveRecord trace data and pgAnalyze (a PostgreSQL performance monitoring tool) our database engineers were able to determine two areas with clear opportunity: Planner statistics and indexing.

Planner statistics – There are numerous blog pages and mailing list posts around how the PostgreSQL query planner uses the statistics gathered on each column to develop a query plan. In more recent versions of PostgreSQL, the ability to create custom statistics on two or more columns of correlated data can dramatically improve planner choices. Identifying a suboptimal query plan can still be challenging, but when detailed traces are just a few clicks away it gets much easier.
Indexing – Indexing is a topic closely related to planner statistics. The query planner has to estimate that it is more effective to scan the index and then read parts of the table for those rows rather than just scan the table for a given query. Generally speaking, smaller indexes can be far more effective than larger ones, as they will be preferred when planning the query. By using partial indexes where we had larger composite indexes—which may use many columns—our indexes have become significantly smaller and much faster.
More Indexing – Another area around indexes in PostgreSQL is the ‘include’ clause when creating a new index. Typically indexes are based on desired filter conditions, but under certain conditions users can further improve join performance by including the column for the join condition into the index. Using include rather than a traditional composite index can be slightly more efficient when not filtering on the additional column. Included column(s) can also be highly effective when you want a single unique index on some composite of columns but want to include one that shouldn’t be considered for row uniqueness. Since indexes aren’t free, you want to be mindful of their impact to maintain.

Impact

Moving the needle in the right direction always feels good.

Graph showing a steep drop in P99 query runtime for backend database

Graph showing a significant drop in P95 query runtime for backend database

It's worth noting that these database-side optimizations occurred independently of the code-side optimization work, and were live in production weeks before the code changes rolled out. In fact, one day of database optimization work reduced overall API latency to the point that we reached our Q1 performance target overnight. Engineering leadership has since lowered the latency target by 40% and this new, stricter target is being met consistently.

We're No Heroes. We Just Had Better Data.

Database bottlenecks in Rails applications are a well-known problem, but they aren’t something customers need to live with or engineers need to paper over with caching alone. With tracing and a little bit of domain knowledge, even the most tangled monoliths can be cleaned up in a methodical, repeatable fashion.

As engineers, it's rare that we get the opportunity to make a major impact on performance in such a short amount of time. In this case, entropy and scale joined together to slow down our service and cause pain for our customers and internal teams. These were long-running, known issues that, until recently, never got bad enough to require this level of intervention. And while the team has done plenty of work over the years to reduce latency, they lacked the observability and tight feedback loops that enabled us to make such a big difference now.

The performance gains alone are worth celebrating, but this has also been an important milestone in our observability journey as an engineering organization. As Metal continues to grow in size and complexity, it'll be increasingly important for our engineers to be able to identify and debug issues across the entire stack. This performance work was a proving ground for our investments into OpenTelemetry tracing and modern observability tooling. It's not about having a couple heroes who save the day—it's about giving everyone better tools and better data, so we can all have an easier time working toward production awesomeness.

Bonus Graphs

All three graphs show data from January 28th up to March 29th.

IPs

Note the shape of the heatmap. Customers who experienced poor performance were pretty much only looking at IPs on weekdays, whereas we get fast traffic at a much more consistent volume throughout the week.

Also, a note: the increase in traffic mid-February was limited to a few customers who weren't among the outliers experiencing poor performance.

Traffic load and response time on `/ips` endpoint — Traffic load and response time on /ips endpoint

Devices

Initially we have two upper bands of slow traffic that merge into a single, relatively faster one around February 10th, which was when the database-side optimizations took place. The lower band is fast traffic, which also gets faster around that time.

We maintain those two bands until the end of March, when both drop even further thanks to a code-side improvement in how we load device data.

Traffic load and response time on `/projects/<project_id>/devices` endpoint — Traffic load and response time on /projects/<project_id>/devices endpoint

Plans

It's like someone wrapped our traffic in a SpaceSaver bag and vacuumed out all the latency. We shipped the code-side improvements for /plans first before the other endpoints, but this graph is so cool that we had to save it for last.