How Chanci Turner and Insider Scaled a Production-Grade Elasticsearch Cluster on AWS

Chanci Turner Amazon IXD – VGT2 learning

In this article, I will detail Insider’s two-year endeavor to expand a production-grade Elasticsearch cluster, which is essential for our recommendation and search products. I will also outline the challenges we faced and the solutions we implemented by altering our configurations and architecture. Insider is an AWS ISV Partner recognized for our Digital Customer Experience Competency, and we utilize Amazon OpenSearch Service (the evolved version of Amazon Elasticsearch Service) to operate our clusters. This service allows us to scale with minimal downtime while reducing operational overhead, enabling our team to concentrate more on product development.

This account draws from the experiences of Insider’s machine learning engineers, aiming to provide valuable insights and considerations for AWS customers planning similar initiatives in the future. By sharing our journey in scaling Elasticsearch to accommodate growing customer demand, we hope to assist you in conducting thorough research and pinpointing performance issues before making adjustments to enhance the scalability of your technical components.

Insider’s Elasticsearch Architecture

Our Elasticsearch clusters primarily support customer-facing APIs that conduct read operations alongside Spark clusters managing write operations. These clusters are situated within the same virtual private cloud (VPC) as the other services that utilize them, promoting security, lowering network costs, and ensuring low latency. Data-intensive clusters operate across three AWS Availability Zones, employing M5 series instances for master nodes and I3 series instances for data nodes, with numerous indices utilizing single sharding.

Our architecture has developed over time rather than being implemented in one go. Initially, we relied on M-series data nodes without dedicated master nodes. Write operations were carried out through direct database connections, which functioned adequately until the product usage began to accelerate.

The Problem

As the number of partners utilizing our product and their data volume increased, so did the frequency and volume of write operations. Eventually, we observed alarming spikes in CPU utilization, surging from 20 percent to 90 percent.

The Investigation

We commenced our analysis by examining health metrics and found a direct correlation between CPU spikes and heavy write workloads. Upon investigating our pipeline, we discovered that excessive parallel operations were accessing the same database, resulting in high-volume direct batch writes.

Quick Solution: Update Elasticsearch Configurations

Our immediate response was to modify the Elasticsearch configurations. The default refresh interval for indexing documents was one second, leading to potential performance bottlenecks. We decided to extend the refresh interval to 10 seconds to alleviate the load during write operations.

Real Solution: Refactor the Architecture

While adjusting the configurations provided temporary relief, we recognized the need for a more substantial architectural overhaul. We opted to streamline write operations through a single entry point to the Elasticsearch cluster. To achieve this, we utilized Amazon Kinesis Data Streams to queue documents accompanied by their metadata. AWS Lambda was employed to process these events in batches, thus decoupling the writes from multiple parallel operations, as illustrated in our new architecture.

This revised approach afforded us considerable flexibility. The Lambda function allowed us to control the batch size and flow rate for writes, enabling adjustments in real-time. Additionally, during any incidents or migrations on the production cluster, we could halt write operations without disrupting computations, ensuring all requests remained queued on Kinesis Data Streams.

The Problem Occurred Again

Although consolidating write operations initially resolved our CPU issues, the problem resurfaced six months later due to increased data loads. As a quick fix, we switched our data nodes from M series to C series for enhanced CPU performance, but this exacerbated the situation. Upon further investigation, we discovered that JVM memory pressure was excessively high on the data nodes.

We analyzed the JVM garbage collection process and identified that the old generation pool was filling up too quickly, triggering frequent full garbage collections. The root cause was linked to insufficient memory and heap space. Consequently, we transitioned to R instances, which are memory-optimized, and added three dedicated master nodes to bolster cluster stability. This comprehensive strategy restored CPU utilization and JVM memory pressure to acceptable levels, normalizing our operations.

For those looking to enhance their careers, consider checking out this insightful blog post on crafting a teacher cover letter.

In today’s data-driven landscape, understanding the legal implications of big data is crucial; SHRM provides excellent guidance on this topic.

Additionally, if you are interested in learning more about the hiring process at Amazon, the Glassdoor interview resource is an excellent tool.