How Razorpay’s Notification Service Handles Increasing Load

Published in

Razorpay Engineering

7 min readMay 16, 2022

What is the Notification Service?

Notification service is a platform service that handles all the customer notifications requirements, such as SMSs, Emails, webhooks, etc. This system was designed a while ago, and has been performing reliably. But, as the number of transactions grew exponentially, a few challenges popped up especially with webhooks, the most widely used form of notifications (we will focus on webhooks only for this blog). To ensure we meet our SLAs and for better long-term evolution of the service, we had to go back to the drawing board and re-think the design.

Webhooks (Web Callback, HTTP Push API or Reverse API) are the one way that web application can send information to another application in real-time when a specific event happens.
Example
Suppose you have subscribed to the order.paid webhook event, you will receive a notification every time a user pays you for an order.

Existing Notifications Flow

API pods receive the notification request, which after validation, is sent to the SQS queue to be read by the worker, and a notification is sent.
Workers write the result of this execution in the database and push it to the data lake.
A Scheduler runs periodically, reads the requests that have to be retried, and then pushes them to the SQS for processing.

The existing system could handle a load of up to 2K TPS. Generally, we experience a peak load of up to 1k TPS. But, at these levels, the system performance starts degrading like p99 increases from ~2 secs to ~4 secs, which may further degrade based on the load, customers’ response time, etc.

Challenges We Found While Scaling Up

While we were onboarding new use cases and adopting services, we encountered a few scalability issues:

The Read query performance degraded with increase in data in the table. Although there is a replica DB, it also peaks as the number of write operations increases.
The worker pods can be scaled to any limit. However, due to the limitation in IOPS on the DB, the pods couldn’t be scaled as an increase in Pods results in DB degradation. Each time the DB became a bottleneck due to high IOPS, our short-term fix was to increase the size of the database from 2x.large to 4x.large to 8x.large. But this was not a long-term solution.
In the Payments system, all the operations must be performed within a specified SLA. But various variable factors like customers’ response time for webhooks, load TPS, etc., posed a challenge to complete the operations within the SLAs as we could not scale up due to DB limitations.
The load also increases unexpectedly for specific events or days like Quarter Closing, Special market trading occasions like Festivals, Annual Flagship sales by e-com, World Cup, IPL etc. During these days, the notifications were getting impacted.

The solution was tricky: we were not looking at developing a system to handle a certain load as this would further increase year-on-year. What could be a possible solution? We wanted to tame the problem and picked up a few initiatives in that direction.

Prioritizing incoming load
Reducing DB bottleneck
Managing SLAs with unexpected response times from customer’s servers
Reducing detection and resolution times

Prioritizing Incoming Load

There are two significant findings with the incoming requests:

Not all notification requests are equally critical.
One type of notification should not impact the delivery of other notifications.

To find the best fit for the above, we tried to understand the expectations of the consumers’ SLAs and the impact of each notification. Once we had a fair understanding of each notification and its impact and SLAs, we created priority queues and pushed the events in them based on their criticality.

P0: All the critical events with the highest impact on business metrics are pushed to this queue. This is an exclusive queue and a very few notification types go into this queue.

P1: This is a default queue and all notifications other than P0 go into this queue.

P2: All burst events (very high TPS in a short span) go to this queue.

Thus, by segregating the priorities, the impact of any rogue event on other events was reduced. But, what if one of the events in one of the queues goes rogue?. In such a case, this leads to SLA breaches. To reduce this impact, we introduced Rate Limiter to these queues and events.

Rate Limiter Workflow

For the P0 queue, all the events have an X limit. If this limit is breached for an event, that event is sent to the Rate Limit queue. Thus, it ensures that other events are not impacted.
For the P1 queue, all the events have a Y limit. The number of events in this queue is much larger. As the events are not of high priority, their maximum limit is less than the P0.

Further, the Rate Limits can be customised based on the consumers’ requests and the load pattern received for an event.

Event Prioritisation with Rate Limiting secured the system from different DOS attacks and helped us build a multi-tenant system.

Reducing DB Bottleneck

In our earlier implementation, as the traffic on the system increased, the worker pods that process events also kept auto-scaling along with the load. With the increase in pods count, IOPS on DB also increased and slowed down the DB, which increased the overall SLA of the system. Of course, we scaled the DB vertically to keep up with this growth. This approach worked well for some time, but soon, we reached a cliff where increasing the size of the database was not helping anymore!

We needed either a horizontal scalable DB or a change in architecture to contain the ever-increasing load on the system. We had to handle time constraints and regression possibilities. We finally found a fix, by writing the data to DB asynchronously. The below diagram explains the new flow:

We introduced a stream between DB writes and workers in this flow. All the messages are pushed to the stream, and the DB writer writes the messages on the DB (all the messages pushed on kinesis are idempotent). This gave us control over the rate at which we write on the DB.

And the Results…

As IOPS on databases could be throttled, even if the load is high, the number of messages written to the DB remains constant. Hence, the DB is protected from degradation, and we could scale up better with the existing DB.
This approach separates the activity to send data to the data lake and writes the attempt audit to DB from the worker, giving better separation of concern.
Worker pods can be scaled to higher numbers without degrading the database performance.
This also segregated the Retry queue from the Main queue. Thus, increasing the overall service quality of the system.

Managing SLAs With Unexpected Response Times from Customer’s Servers

One of the key variables in maintaining the delivery SLAs is the customer’s response time. When the webhook call is made, the system waits to get the response from the customer to ensure that the execution is complete. But, some of the customer servers might not respond well and take more than expected time to respond, blocking our workers. These blockers from a few customers have the potential to affect the overall system performance.

We came up with the concept of quality of service (QoS) for customers. If the response time from the customer’s servers goes beyond a limit, say 5 mins, we decrease the priority of the events for that customer for the next few minutes. We re-check if the customer’s servers respond well or not after the time limit.

And the Results

One or more mischievous customers’ systems no more impact all the events. Thus, we could better control our SLA misses, reducing the GMV loss.
When a few customers’ servers degrade, alerts can be configured to inform their stakeholders. This helps the customers to correct issues at their ends.

Reducing the Detection and Resolution Time

Good observability is essential in a critical system like a notification service to reduce the time to detect and resolve the issue. We worked on all the three pillars of observability for better monitoring of the system:

Designed dashboards and alerts in Grafana to detect the anomalies and monitor the system’s health regularly, like alerts on rogue events, rate-limited events, high response time customers, success rate, etc.
Created dashboards on the logs to detect various error patterns.
We integrated with distributed tracing to better understand the behaviour of the various components of the system.

Notifications are a critical part of any financial system, and they must be delivered within the time limits. With the above few initiatives, we could reduce the quantum of the problem without making any breaking changes. This gave us the space to think deeper and devise better solutions for the long term. If there is a big problem and the answer is complex, try reducing the size of the problem. At times, it yields results, as in our case!

Up for the challenge? 🚀

This wouldn’t have been possible without our awesome team (Ashish Tyagi, Abdul Rauf, Manas Bagai, Varun Mukundhan, Anand Prakash, and Gorang). If our work excites you, we are actively hiring for our Engineering team. We are a bunch of ignited minds simplifying financial infrastructure for businesses and we’re always looking out for great folks. Apply via our jobs page.