Go Consuming All Your Resources?

Published in

Razorpay Engineering

6 min readJun 8, 2022

COVID-19 acted as a catalyst for accelerating digital payments in India, and the traffic served by Razorpay began growing multi-folds. One of our primary microservices started underperforming at a certain scale and became a bottleneck in the overall payments scale growth.

We had recently decommissioned a microservice, which was acting as a Rule Engine platform. We converted and abstracted the logic of this Rule Engine as a library instead and incorporated it into a new host microservice. This rule engine library, which was memory/CPU heavy by design, began to create unexpected performance issues on the host service:

~60% drop in performance
Memory spiked up to 4 GBs
Response latencies crossed more than 2 secs
High horizontal scaling required during peak traffic = expensive infra
Error rates were increasing

As per our infrastructure setup, it was a standard Go lang web service running on Kubernetes (EKS) over on-demand EC2 instances. This issue was happening in one of the most critical services of our core payment ecosystem. Hence, it was time to act, fast.

It was a rollercoaster ride, discarding one theory after another before we cracked it.

Problem Identification and Fix

Memory Leak

We looked at this graph from a non-production environment. A similar graph was seen on production. The memory usage was high even after the traffic stopped!

**Heap (in-use) before traffic surge : 62.46 MB**

**Heap (in-use) after traffic surge : 67.38 MB**

Findings: Upon analysing the profiling data using pprof (before and after traffic), we got the above graphs. It was clearly not a case of memory loss because heap(in-use) before (~62 MB) and after (~67 MB) were almost the same.

Conclusion: The memory leak theory was discarded.

Further Investigation

After ruling out memory leak as a possible root cause, we started investigating by adding extensive log traces to get information regarding metrics like heap memory, idle memory, released memory etc.

The following code snippet was added in an infinite loop for dumping various garbage collector metrics (MBs) every 2 minutes in the micro-service:

**Memory stats before traffic surge (in MBs)**

**Memory stats after traffic surge (in MBs)**

Findings: From the above stats too, it was pretty clear that the current heap was back to normal (as shown in pprof as well). But, there was a drastic difference in idle heap memory, and it seemed to be an outlier and holding the remaining memory.

What exactly is HeapIdle ?

As per official Go documentation, HeapIdle is bytes in idle (unused) spans. Idle spans have no objects in them. These spans could be(and may already have been) returned to the OS, or they can be reused for heap allocations, or they can be reused as stack memory.
Also, HeapIdle minus HeapReleased estimates the amount of memory that could be returned to the OS, but is being retained by the runtime so it can grow the heap without requesting more memory from the OS.

Now, the next question was why free memory from HeapIdle not getting released to the OS (even after a cool-off time of 5–10 minutes).

On further analysis, we found this from golang1.12 release statement :

On Linux, the runtime now uses MADV_FREE to release unused memory. This is more efficient but may result in higher reported RSS. The kernel will reclaim the unused data when it is needed. To revert to the Go 1.11 behavior (i.e. MADV_DONTNEED), set the environment variable GODEBUG: madvdontneed=1.
So, unless you have other services that are running on the same instance and are also memory hungry, the RSS, which is basically the apparent amount of memory that the service is consuming, will not drop.

The change mentioned in 1.12 release notes was very simple and could have resolved the issue. But, it was a ticking time-bomb since backward compatibility could be removed anytime leaving us with the same problem, again! We were running the micro-service in discussion on the Golang 1.13 version.

Fast forward 3 Golang versions, above mentioned changes were reverted in Golang 1.16 version (golang 1.16 release statement). But 1.16 was just released when we were working on it and hence was too new to be included in the tech stack.

Conclusion: Although we could have reverted to 1.11 behaviour or might have just upgraded to 1.16, we decided to find a long-term solution instead and started working on controlling the application memory to go this high.

Finding Answers

Why was heap idle spiking during peak traffic?

Root Cause 1
There was one object in the memory (an array of big hash-maps holding important data for rule engine processing) that was getting created and destroyed for every payment request. During traffic surge, this swapping rate was spiking crazy and resulting in heapIdle to go till 3–4 GBs.

Here is a sneak peak of how heap (in-use) and heap(idle) changed with the count of hash-maps in the array.

Note: CPU and request/sec were kept constant with no limits on memory

Root Cause 2
We observed another interesting fact while doing the above tests. The count of hash-maps, CPU required and request/sec had a relationship with the fixed max memory limit.

**Note**: Memory allowed was constant with no limits on CPU

Count of hash-maps ∝ CPU ∝ 1/request per second
Memory(in-use) remained under 250 MB and did not deviate much

Findings

The microservice was consuming an insane amount resources (primarily memory). Since the microservice was critical, horizontal autoscaling rules were very lenient. We used to horizontally scale the application aggressively resulting in a large number of servers getting spawned during high traffic.
Newly spawned servers used to increase database CPU usage by sending a large number of calls (since the application needed to cache data during boot-up).
This created a ripple effect impacting multiple services and hence the overall stability of the payment system.

This further caused an inevitable increase in the infrastructure cost. Every traffic surge was a nightmare for the on-call engineers as it led to a lot of noise in terms of alerts. The on-call engineers were suffering!

Here Comes the Fix

Fixing an issue becomes easy when we know the cause. We gathered the following points from our RCA:

We can control heap idle to stay in limits (under 300 MBs) if we can reduce the count of hash-maps getting created/destroyed per payment request.
We can fix other cascading issues like database CPU going beyond threshold if we provision more CPU/server that will result in spawning less servers during peak traffic.

And, that’s it. We fixed both of these problems and eventually our infra was at peace with Golang internals!

To Summarise

The new changes that we incorporated had a direct impact on reducing the memory usage and the number of servers, thereby reducing the infrastructure cost drastically.

As fewer servers were spawned, thanks to the new scaling configurations, database calls during boot-up had also reduced significantly.

The numbers say it all…

Come Work With Us! 🚀

If the work we do excites you, we are actively hiring for our engineering team. We are always looking out for great folks. Apply via our jobs page or reach out to us at tech-hiring@razorpay.com