Distributed Tracing with Hypertrace

Published in

Razorpay Engineering

15 min readApr 7, 2022

As a continuation to our previous article on Razorpay’s distributed tracing journey, this article covers Hypertrace, an OSS distributed tracing platform from traceable.ai that forms the backbone of our current distributed tracing infrastructure.

Once Upon a Time, There was Jaeger

As stated in the previous blog, Razorpay’s distributed tracing journey started with Jaeger. There were multiple reasons behind choosing it at that time:

Jaeger was (and still is) one of the most popular tools for distributed tracing.
It is compatible with OpenTracing, OpenTelemetry and Zipkin. We were especially cautious about using open standards, as using any proprietary instrumentation would have meant a difficult migration down the line.
It uses a familiar tech stack and is easy to get up and running. We already had sufficient expertise in-house with Kafka and Cassandra.
It is OSS with active community support.

At that time, when the platform’s adoption at Razorpay was sub-optimal and our developers were still warming up to the idea of distributed tracing (as a means of application debugging and performance monitoring), the focus was on the platform’s evangelisation and training developers on effective instrumentation, rather than finding the best tracing platform out there. So going with Jaeger was a no-brainer for us. However, as the platform’s adoption grew (both in complexity and scale), we quickly realized that Jaeger was going to be insufficient for our long-term needs.

From Jaeger to Hypertrace

The fundamental reason why we moved from Jaeger to Hypertrace was that our eventual (observability) platform was envisioned to be capable of more than just end-to-end trace debugging and service dependency analysis –the primary features of Jaeger.

While the fundamental building blocks would remain the same (traces), we wanted the system to derive much more juice from them than we could from Jaeger. To achieve this, some features were needed that Jaeger fell short on:

Derived Metrics: Jaeger does not provide metrics. You cannot use it to answer questions like:

What is the P99 duration and error rate of all calls to a particular DB?
How has my service’s P99 latency deviated from the normal in the last one hour?
What is the rate of failure b/w two services shown in the dependency graph?
What is the P95 response time b/w them?
What are the top 5 slowest endpoints of my service?

At the Time of Evaluation, Jaeger Didn’t Do Much for Metrics

Hypertrace, on the other hand, derives RED (rate, error, duration) metrics about your services out-of-the-box, as can be seen in the following dashboard.

Having these metrics right alongside their traces keeps developers from having to switch b/w metrics and tracing dashboards, saving time and reducing context-switch fatigue.

The above screenshot shows the overview of splitz-prod in Hypertrace. Two things are of particular interest:

Radar Chart (under HEALTH): The radar chart illustrates how your service’s RED metrics have changed over time. In this example, it compares these metrics in the last 15m to their values in the last one hour. Such information can be handy for example after a new deployment, when you want to monitor if your service is deviating from the norm.

Top Endpoints By…: You can sort your endpoints based on metrics like calls count, error rate, latency, etc.

Hypertrace also derives metrics from custom tags. For example, if your traces have a merchantId tag, you can use Hypertrace to get metrics like duration, call count, error count, etc. for a particular merchant. The ability to do such analysis on high-cardinality tags like merchantId and customerId is a powerful feature of Hypertrace that we’ll explore in detail in later sections.

2. Custom Trace Enrichment: We often need to enrich our traces with custom tags that are derived from existing tags. For example, our traces can have an IP Address attribute, and we might want to add a confidence score to it. A low value of this score indicates that the IP might be of a malicious source. Such information can be very handy to do security analysis later. Hypertrace lets us add such custom attributes to our traces in the ingestion pipeline. There is no way to do this in Jaeger.

3. Entity-Based Drilldowns: Entities are a standard set of tags that have a bounded set of values. For example, databases, endpoints, HTTP methods, etc. Many production debugging scenarios start with the developer just knowing some entity name. For example, you may observe that calls to a particular DB are failing in high numbers. With Jaeger, there is no easy way to use this information and get a head start - We need to search on the tag db.name = someDB. With Hypertrace, however, we can search for this specific DB in the Backends view:

Now we can get both metrics and traces for this DB:

Similarly, starting with the endpoint name, you can look for it in API Endpoints and drill down from there.

4. Sevice Dependency Analysis: Service dependency diagrams generated by Jaeger are pretty basic, offering nothing more than information about application dataflow and call counts. The following two screenshots compare the service dependency diagrams generated by Jaeger and Hypertrace for the ever-famous HotRod app.

Hypertrace not only shows dependencies among the system components but also derives the RED metrics corresponding to these interactions. Such metrics are very useful to quickly get an idea of how well the service is communicating with its dependencies. In the above example, the developer can quickly know what percentage of calls from the driver to Redis are failing by simply hovering on it. Why would he do so? Because Hypertrace color-codes Redis in red, indicating something is wrong with it. These service interaction diagrams are also generated on a per-service basis.

In addition to the above advantages over Jaeger, Hypertrace is also:

K8s native, simple, and easy to operate without a huge learning curve.
OSS and has an active Slack community.
A PAAS offering. This would be useful if we decide to move to a managed solution later down the line.

Keeping these points in mind, using Hypertrace was the best choice under the prevalent circumstances. We transitioned from Jaeger to Hypertrace in Jan 2021.

Traces

Traces are the bread-and-butter of any distributed tracing platform, and Hypertrace is no exception. Users can see the traces associated with an entity (backend, endpoint, service) by going to that entity’s trace view. For example, traces in beethoven-prod in the past 1h can be seen as:

Trace View lets us do quite a bit of slicing and dicing on traces. It lets us filter them by specifying predicates on attributes like duration, status code, tags, etc. It also lets us sort on comparable attributes like Duration, Start Time, etc. Clicking on a trace generates its waterfall model.

A waterfall model illustrates relationships among spans (of a trace) by using the causality information derived by Hypertrace in the backend (the OpenTracing data model is a fantastic read to learn about these causal relationships in detail).

Clicking on any span brings up its context. This context comprises attributes added by the developer during instrumentation and any additional attributes added in the ingestion pipeline.

Godmode with Explorer

Hypertrace empowers developers to ask arbitrary questions about their services using Explorer, perhaps one of its most powerful features. If your applications are instrumented with the right business and operational context, Explorer can give you deep insights into it. Let us see this with some examples.

Say we are getting a usually high number of creating payment calls and want to find out which merchant(s) is making these calls first. We start by stating the following predicates on Explorer:

Service Name = API
Tags.http.path = /v1/orders

To get the temporal distribution of call rate, we choose Calls as the metric and Rate(sec) as the aggregate under Select metrics. Hypertrace provides several other metrics and aggregates, such as Duration, Exception Count, Average, Pxx, etc. We finally group the results by the tag merchant. You can group by a whole lot of attributes like Endpoint Name, HTTP Method, Service Name, etc. From the final results, we can see that merchant TtPtHDFd1d has been calling this endpoint at a high rate.

Let us figure out the routes with the highest error rate in the last hour in API:

Note that we’ve specified a regex in the tag’s value, as we want to look for any status code in the 5xx range. Hypertrace supports regex search on tags as well. Grouping the result by endpoint yields what we wanted. We can immediately see that v1/settlements/{id} has the highest error rate in the last hour.

With the right instrumentation in place, Explorer empowers developers to debug their services effectively without depending on experts. A huge win!

Operating Hypertrace at Scale

Ingestion

The journey of a span in Hypertrace starts with the collector pushing spans to the ingestion pipeline. The collector receives spans from agents running as daemons on application nodes. The ingestion pipeline is a stream of spans/traces in Kafka, with various transformers acting on it:

Span Normalizer: It adds a standard set of attributes to spans as they pass through it. Span normalizer is compatible with different trace formats like OpenTracing, OpenTelemetry, Jaeger or Zipkin. This compatibility came in very handy for us, as no instrumentation changes were needed when we migrated from Jaeger to Hypertrace.
Raw Spans Grouper: This is the only stateful transformation in the entire ingestion pipeline. It consumes “raw” spans emitted by Span Normalizer and groups them by traceId, to create a StructuredTrace. In addition to this, RSG can be used to implement tail sampling in the pipeline. Tail sampling looks at the attributes of a trace at the end of the workflow and decides whether it should be sampled or not. For example, we might want to drop all traces with status code = 200, and sample only those traces with status code = 5xx. Since RSG has access to the entire trace, this logic can be implemented here.
Trace Enricher: As explained in the previous sections, Trace Enricher enriches your traces with custom attributes. For example, Trace Enricher will add a label db.instance = db_name for all calls to a particular database. It will also label a span as an ENTRY or EXIT span.
All Views Generator: This component consumes from Trace Enricher and creates views that map 1:1 to Pinot tables.

The following diagram illustrates these series of transformations in greater detail. Dotted arrows represent synchronous communication, while solid arrows represent asynchronous communication.

All of these transformers are realtime streaming applications written in KStreams. With an average ingestion of ~100k spans/s, our P99 trace freshness remains under 1m.

Querying

Hypertrace exposes GraphQL endpoints to query data from Pinot. Requests to these endpoints are translated into PQL (Pinot Query Language) by the Query Service that interfaces with Pinot using a PinotBasedRequestHandler. Query Service can interface with other datastores using different implementations of RequestHandler (PinotBasedRequestHandler implements this interface).

Query Patterns

The following histogram shows the distribution of query intervals over a period of one week:

96% of the queries lay within the last 24 hours. Just 1.6% of the total calls lay in the last 30 days. This skewed distribution gives us an opportunity to optimize our storage - Optimizing for the common case first, we can store the most frequently queried data in fast storage (like SSDs) and less frequently accessed data in slow storage (like HDDs). Trading off speed for latency will only affect a minuscule fraction of our queries (less than 5%). Doing this becomes even more important to us as our data ingestion rate increases as more services onboard the platform along with increases to the retention period.

Here’s an example of such storage separation currently in the works. A Spark job writes data coming in from Kafka to the data lake (S3). The data lake has much larger retention (currently 30d). Both Pinot and data lake are queried by Presto. Alluxio is used to bring the data lake closer to Presto (compute). This speeds up reads.

The query service queries both sources based on a time boundary. For example, if the user requests for data older than 7 days, it’ll only query the data lake. If they request data within the last 7 days, it’ll only query Pinot. If the time period staggers this 7 day boundary, both the data sources are queried and the results are merged. The tables in the data lake are created 1:1 to Pinot’s, so very minimal changes are needed in the query generator to query.

Further, Pinot identifies this heterogeneity as well and automatically moves its segments (data partitions) b/w multiple storage tiers. More recent segments are stored in faster, more expensive tiers (with SSDs) while older segments are moved to slower, cheap tiers (with HDDs). Combined with the previous solution, this optimizes storage costs to a large extent.

High Cardinality Tags

One particular feature worth mentioning is how Hypertrace handles high cardinality tags. In systems like Prometheus, queries become very slow if data has many labels with high cardinality. This is because of the multiplicative growth in time series data that Prometheus needs to store and index as a label's cardinality increases. Due to this limitation, one cannot choose a label like merchantId that can have thousands of values. So if you would like to have separate dashboards for a particularly high-value merchant, Prometheus isn’t the right tool for it.

Hypertrace faces no such problems when dealing with such high cardinality tags. This lets developers add tags like merchantId and customerId that can have millions of unique values, which can later be used to do all sorts of slicing and dicing using Explorer.

In the above example, we use Explorer to get the count of calls per endpoint for a particular merchant in API.

The ability to handle such high-cardinality tags combined with its powerful analytic capabilities makes Hypertrace a very capable system for fine-grained monitoring and analysis.

Challenges and Learnings

Over the course of the last year, from the time the need for a distributed tracing platform was realized at Razorpay to now when Hypertrace has been adopted by ~35% of the services in the org, the journey has had its fair share of challenges, as explained below:

Sampling: We sample 100% of traces in production. While this may seem wasteful and expensive, it was done for the following reasons:

We do not have a very good heuristic for head-based sampling. For example, we cannot sample for a specific set of merchants for specific routes, as this number can explode quickly (For 500k merchants and 50 routes, we have around 25M combinations to decide on). However, we do skip /health and /status endpoints, in addition to any other endpoints that developers explicitly ask to omit.
We could have resorted to tail sampling using Raw Spans Grouper but that would have required us to add more compute resources, as each and every trace would need to be analyzed for certain predicates. With 100% sampling, the trade-off was more storage costs, which is still cheaper than the other option.
Hypertrace is a first-class debugging tool for us, and any missing data can lower developer confidence in the platform. Additionally, as Hypertrace derives its metrics from spans, missing spans can lead to incorrect metrics. This is again unacceptable as Hypertrace is a source of preliminary metrics for many teams.
Extending the previous point, developers also use Hypertrace to set up alerts for their services. Inaccurate metrics will make these alerts effectively useless.

We will be covering sampling in detail in a future blog post.

2. Long Term Retention: 100% sampling at our scale leads to increased storage requirements. As described in the ‘Querying’ section, data received only in the last 24 hours is queried most frequently. Our current retention period is 7 days but we have had instances in which users want to debug older issues. To support such requirements, we need to increase the retention period to a higher value (30 days currently but should be configurable). However, the cost of storing a month's worth of data in Pinot’s servers can be prohibitive. Also, both the daily ingestion volume and the retention period can increase in the future. Therefore, our storage implementation has to be independent of these values. In addition to the solutions explained in the previous sections, we are looking at two more ways to solve this:

Using the deep-store (S3 in our case) as a possible (and only) storage tier, besides Pinot servers. “Freezing” old segments into the deep-store and “thawing” them on-demand is a WIP. This solution has the drawback that “thawing” segments on-demand will have high response times. However, for our use case, it will affect only a minuscule number of queries (less than 5%).
Use StarTree’s distribution of Pinot that already has tiered storage productionized.

3. Quality of Instrumentation: As we have seen in the section on Explorer, having the right metadata in your traces is essential for leveraging Hypertrace fully. The job of effective instrumentation lies with developers, who often are not aware of this “caveat” of Hypertrace. This leads to multiple rounds of instrumentation (and disappointment), with developers reinstrumenting their applications once they find out what they need is not actually present in Hypertrace. We have tried to address the problem in two ways:

Creating a standard tracing package for Golang and PHP. This package adopts a “convention over configuration” approach, by adding important sets of tags for different types of applications. For example, for Kafka, it adds a type=consumer/producer tag by default.
Allocating additional bandwidth to help developers with any instrumentation issues they might be facing. This also includes taking ownership of any code reviews related to Hypertrace integration.

Notwithstanding these measures, effective instrumentation is a constantly moving target.

Razorpay and Hypertrace

Hypertrace is a strategic platform for Razorpay, and we have committed to contributing to it in the long term. Most of our contributions are results of our own requirements. Some of them include drill-downs in charts to help investigate anomalies, improving caching capabilities for Better UX, a widget for loading custom iFrame into the layout, multiple markers and tooltip correlation amongst multiple graphs to help narrow down on the root cause, amending the explorer behavior to retain certain filters on switching tabs, analytics telemetry support (adds support for Rudderstack), custom dashboards from saved queries (WIP), support for Postgres in document store, support for follows-from construct, etc.

The community can be found on Slack.

Edit: Jaeger now has support for metrics. However, this capability was added after we had evaluated both the tools in December 2020.

We have written on our observability stack earlier in the following posts:

A special thanks to Arjun for helping us with the design elements in this post.

We are Hiring!

If you love working on exciting challenges, we are actively hiring for our Engineering team. We are a bunch of ignited minds simplifying financial infrastructure for businesses, and we’re always looking out for great folks. We are looking for both Engineering Managers and Senior Engineers to break barriers and build for the future. Apply now.