Just a couple of months ago, when most of the world was mandated to work from home where possible and avoid unnecessary travel due to the COVID-19 pandemic, Index Exchange’s data pipeline experienced a surge of traffic.
The High Tide of Data Volume
As a large proportion of the global population remained inside, people were turning to the internet more than ever, creating a surge in digital content consumption. This translated to a massive influx of online transactions and in turn, a substantial influx of data for IX.
At this time, volume coming into our data pipeline surged more than 50% month over month. Each year, IX forecasts data growth based on historical trends. Despite the preparations we had made heading into 2020, nothing could have prepared us for hitting the original volume forecasted for year-end 2020 within just a few short weeks.
Early Warning Signs
Well before the data volume peak in March, data volume growth, cluster resource consumption, and jobs-specific metrics such as runtime and input data sizes had started fluctuating and becoming more volatile. This level of visibility on system and application metrics provided key insights and allowed us to respond to this abnormality proactively. Our monitoring systems also helped us identify bottlenecks and guide us to focus on the most impactful components that would yield the highest value in return. Most importantly, they suggested that a bigger storm may be on the horizon.
During the initial increase of data volume, everything was smooth sailing, and our data pipeline was operating normally. Part of this was due to our regular practice of capacity planning and forecasting. Planned memory upgrades were also underway at the time, giving us the much needed boost of cluster resources (the resources required to process data and jobs on the pipeline). By constantly surfacing the health and pulse of our data pipeline via the strong monitoring system we have in place, we were given enough lead time to prepare for what was to come.
Because of this headstart and proactivity, we delivered a series of quick performance gains in the early phase of the traffic increase. Our foundational delivery process was critical to our early success. Our delivery pipeline is fully CI/CD integrated with an automated testing framework in place, so we were able to deploy changes quickly with a very short lead time. The automated testing framework covers the validation of a versioned master copy of testing data with corresponding expected data output. This minimized the risks of data inaccuracy, therefore maintaining our high standard of data quality from the frequent and quick production pushes. Our performance metrics dashboards ultimately serve as a fast feedback loop to validate and quantify the optimization gain from each of our releases. These little wins provided us with the cluster resources and time that we needed to pave the way for a much needed big win.
Riding the Wave
By late March, our data pipeline reached a saturation point. Our systems began to alert us of impending impacts to our Service Level Objectives. In the midst of developing plans to pivot in response to this surge, we saw an opportunity. Acting quickly, we gathered a team of data engineers together with our Engineering Lead to come up with a creative approach to processing our data, with the goal of optimizing our cluster resource usage.
After some deliberation we decided to restructure one of our largest data assets into a nested column to process at a different phase of the pipeline, which resulted in a 90% reduction in record processing. The table joins became less expensive and significantly lowered the number of reads and fetches on any downstream aggregation jobs. This design would benefit our job’s runtime, throughput, and resource consumption.
Implementation and performance testing showed promising results. By week 2, this optimization release was ready for deployment. The design change required deployments of 7 major jobs on our data pipeline. Even with a CI/CD fully integrated delivery pipeline, the team was doing everything they could to ensure success such as doing dry runs and data validation on the staging environment.
These dry runs led to several improvements, such as the creation of a one button click Zeppelin notebook made for data validation checkpointing. This was to ensure minimal risks and delays in rolling back the changes in the event of unexpected results. And in the end, thanks to the significant amount of time invested in automation and preparation, the deployment was a success.
Ultimately, we worked to gain 20% of cluster resources back within 2 weeks. The volatile spikes and uncertainty in job performance steadied again following the release.
Considering the short amount of time we had to address the issue, the team’s efforts could not be qualified as anything but impressive. Having mobilized to understand the critical business needs and work towards a common goal, one dedicated engineering team quickly pivoted to build a solution to address an issue that no one could have anticipated. The strong sense of focus, dedication, and ownership were what propelled them to keep IX’s systems operating smoothly. The commitment, determination, and ability to act proactively brought us over the finish line.