Reducing Data Platform Cost by $2M

Published in

Razorpay Engineering

9 min readJun 21, 2023

Welcome to the Razorpay technology blog! In this blog, we are primarily focusing on cost savings by the Data platform team.

We have some exciting news to share with you! Through our efforts to reduce costs, we successfully managed to cut down our platform expenses by approximately $2M per year.

In this blog, we will reveal the secrets behind our success and provide valuable insights into how we improved our operations, increased efficiency, and paved the way for a more sustainable and cost-effective future.

High-level Data Platform architecture

Data Platform

We collect data from transactional aka OLTP systems like MySQL etc and push them to S3 via CDC pipelines. These CDC pipelines push the data to Kafka from Mysql via the Maxwell replicator.
Spark pipelines run every few minutes and read the latest data from Kafka and push it to S3 in parquet/delta-lake format. Also updates the event metadata.
We have Trino as a query engine, which powers the majority of analytics at Razorpay.

Merchant Reporting

We have exposed a dashboard to the merchants, to download reports regarding transactions made for their platform via Razorpay.
We serve almost 10k reports daily to merchants.
Reports are basically data fetched from data-lake tables and provided in the required format to the merchant. The data read size can go to a few TBs as well.
Maintaining the data at scale and allowing reading of the data with the handling of spikes is a challenging task.

Cost Concerns

As part of the cost optimisations initiative across the organisation in Razorpay, we calculated the cost of running the workloads and managing the platform. The total cost of operating the overall data platform was very high.

As it is visible from the picture below, the AWS and DBU cost for the data platform was increasing every month from August 2022 to December 2022. The major concern was to optimise the cost such that it does not increase again in future and be stable even without manual efforts.

Thus our main goal was to reduce the cost without affecting the availability, success metrics, and SLA guarantees.

Identification Of Problems

One of the trickiest things to identify is “where to start with cost optimisations?”. More often than not things are not being in an ideal way in the first place, so identifying and segregating is the first thing to do if that fits the use case. The scope of this section is just problem identification. We will go with the solutions to the problems in the next section.

The two major components where we started looking were:

Storage

S3 storage is generally cheap but that’s why most of us take it for granted.
We are using S3 for development, and testing data storage. The general behaviour was to run the testing and keep the S3 files in staging buckets without being used ever in the future.
Stakeholders check: We did reach out to stakeholders to see if they need the data for longer amounts of time or we can set any lifecycle policy to purge if the data is not being used.
Merchant reporting data locality: Merchant reporting data is served using denormalised(pre-joined) tables created on top of raw tables. Most of the queries on data-lake for merchant reporting are filtering data on merchant_id. There is no data localisation, such that one single file can contain data for various merchants in a non-orderly fashion. Thus reading only for one merchant would require reading all the files and doing shuffle and filters to get the required data.

EBS savings

Amazon Elastic Block Store (Amazon EBS) is an easy-to-use, scalable, high-performance block-storage service designed for Amazon Elastic Compute Cloud (Amazon EC2).

How did we use it?

Jobs had been configured with EBS autoscaling on. Which means it can upscale and downscale multiple times while running a job.

Due to the above reason, jobs were scaling to more than what was actually required. For example, some jobs were scaling the EBS storage to 3 volumes of 100GB each, while the actual need was lesser.

Compute

EC2 savings

For all the jobs, we analysed the cluster utilisation for both Memory and CPU for the whole run of the job.
We used ganglia metrics for this info. Ganglia metrics dashboard is provided for every job.
Ganglia dashboard shows us that the cluster configuration is bigger than necessary, by analysing the low use of memory, CPU, and a balanced server load distribution.

It’s also possible to select any particular node and understand CPU and memory usage. For many applications, not all nodes are underutilised but few are. Many times a driver is underutilised while a worker isn’t and vice versa.
Merchant reporting generation happens via job clusters on Databricks with Spark as a processing engine. For each report request, new job clusters start from inception which requires instance negotiation + init-scripts initialisation. Additionally, it may take more time depending on the availability of EC2 instances in the region.

Workload optimisation

As part of merchant reporting, we also provide a way to directly query raw tables existing in the data lake, which is not queryable using denormalised tables.
There were many un-optimised queries due to a lack of filtering on partition columns in joined tables or unnecessary nested queries.

Reduce, Remove, Replace and Reuse

We have bifurcated the solution into multiple parts i.e. Reduce, Remove, Replace, Reuse. Either we can reduce the infrastructure by reducing workers, remove the unused S3 objects, or we reuse the same cluster for running multiple jobs in sequence to reduce the cluster startup time or we can replace one instance type with another in Databricks.

Let’s look at each of these approaches.

Storage

Remove unused stage objects

Did a rigorous activity with the team to account for data lying on s3 Buckets, and categorised stale, and unaccounted data. This led to a data cleanup of around 300–400 GB. But what is stopping for this data to accumulate again?

Enter Lifecycle(LC) Policies and strict processes for staging and dev buckets. We set Lifecycle policies for most buckets that don’t allow data to go beyond a certain size and set a process for writing data on dev and staging buckets so that it comes under certain prefixes and with tags so that it is processed by the associated LC. We also checked that there were TBs of application logs written and after due diligence applied an LC policy for logs as well.
Total Savings by LC and deletion of stage data = ~2PB

[Reduce] Data skipping while read of S3 objects by data reordering

All of our reports have a date and merchant_id filters. For most of the reports, we use denormalized tables, to read and create the reports. These denormalized tables are partitioned on created_date. Thus date_filter optimization is already in place.

But we don’t have a partition for merchant_id, which is equally important. But there is a challenge in having another partition on merchant_id. Since merchant_id is a high cardinality column, hence partitioning on it is a very costly write operation.

So we went ahead with Zorder provided by Databricks.

What is Z-ordering?

Z-ordering is a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms. This behaviour dramatically reduces the amount of data that Delta Lake on Databricks needs to read. To Z-order data, you specify the columns to order on in the Z-ORDER BY clause:

OPTIMIZE events WHERE date >= current_timestamp() — INTERVAL 1 day ZORDER BY (eventType)

If you expect a column to be commonly used in query predicates and if that column has high cardinality (that is, a large number of distinct values), then use Z-ORDER BY.

For example, Delta Lake automatically maintains the min and max values for up to 32 fields in the delta table and stores those values as part of the metadata. By leveraging min-max ranges, Delta Lake is able to skip the files that are out of the range of the querying field values (Data Skipping). In order to make it effective, data can be clustered by Z-Order columns so that min-max ranges are narrow and, ideally, non-overlapping. To cluster data, run OPTIMIZE command with Z-Order columns.

OPTIMIZE ENGAGEMENT_DATA Z-ORDER BY (EngagementDate)})

Here in the above snap, we can see files having min_values and max_values for the engagement date. This helps filter out files which have filtering outside of the file’s range. Let’s say we have engagement_date = “2019_08_29” , then only the 2nd file will be read and the other two will be skipped.

Benefits

Reduced report creation time: Since a lot of files are skipped while reading the data from s3, the average report creation time is reduced by approximately 20%. Since the files are fewer, there are very few or no shuffles to get a single merchant’s data. Thus both input and shuffles are reduced. The cost has also been reduced by approximately 25%.
Reduces cost on S3 GetObject and ListBucket operations: Files are skipped by reading the metadata, thus ListBucket and GetObject operations will be less. Therefore overall optimisation on the S3 API cost for these operations.

Computation

EBS Savings

EBS autoscaling in Databricks overestimated the EBS volumes for the majority of workloads. Due to this, the majority of provisioned EBS volumes were underutilized leading to extra costs.

We reduced the EBS volume provisioning by disabling the “Enable autoscaling local storage” checkbox, we need to set what static EBS volume we need for the job.

Static EBS volume works for us because our jobs are light and only read a single denormalised table without any join. Thus shuffles are fewer too.

EC2 Savings

“Few lost their fortunes in gambling, few in drinking, while a little few left their EC2 instance on”

We run all our workloads on Ec2 instances. All the Databricks jobs or Trino clusters use Ec2 instances. As we have already seen how we identified the Ec2 saving opportunities, let’s look at how we actually implemented them.

Reduce trino workers when no load

We migrated all our Trino clusters to a multi-arch model, which helped us leverage graviton instances to reduce infrastructure costs without affecting performance.
Apart from this, we also tweaked our autoscaling rules to be more aggressive during downscaling events to avoid resource wastage when there is no load, helping us save up further on compute costs.
Both these changes along with minor cluster capacity changes helped us reduce Trino Infrastructure costs by more than 60%.

Reduce the driver

We reduced the driver size of report generation for Databricks spark jobs to half.
We reduced the driver from r5a.4xlarge to r5a.2xlarge.
This we did with information from Ganglia metrics.

Replace the memory-optimized instances with compute-optimized instances

Merchant reporting workloads read single denormalized tables and do a lot of transformations to get the final desired result.
For this kind of use-case, we generally need compute-optimized instances.
Compute-optimized instances worked for us as they are more reasonable with the same CPU and less memory.

Reuse the clusters

In the case of merchant reporting, we can keep a single job cluster for running more than one report run.
Thus a single job cluster can sequentially create 5 reports. This can help in reducing the total wait time of the EC2 instances by 60%.
This also helps in reducing costs by a good amount.

Workload optimisation — Raw Queries

Raw Queries needed the addition of partition filters(created_date) for all the tables used in SQL query.
Removing useless nested sub-queries, and doing the heavy lifting as part of the common code.

Conclusion

The Razorpay data team has come a long way in optimizing the total cost of operation. We have reduced and reused the infrastructure wherever applicable. We still want to optimize the cost even further and become the most optimal version of ourselves.

After the whole exercise of a few quarters, we ended up saving the overall cost of operation of the data platform by a good amount. We have many major learnings along the way and this is a small effort to share that with the world.

We would like to hear from you on what we can do more on our path to be the most optimal tech data platform available.

Reducing Data Platform Cost by $2M

High-level Data Platform architecture

Data Platform

Merchant Reporting

Cost Concerns

Identification Of Problems

Storage

Compute

Reduce, Remove, Replace and Reuse

Storage

Computation

Conclusion

Written by Aakash Mandlik