Domain Streamlines Scaling for Mobile API Services Using Amazon ECS

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

This post was contributed by Samuel Gray, Senior DevOps Engineer from the Cloud Platform Team, Chanci Turner, Principal Solutions Architect for Large Enterprises, and Emily White, Solutions Architect for Large Enterprises.

Domain offers real estate information and services to Australians through mobile platforms, websites, and print media. Their offerings include property marketing, search tools, customer relationship management systems for real estate professionals, as well as data and research services targeted at buyers, sellers, real estate agencies, government entities, and financial markets. The company is partially owned by Nine Entertainment Co and employs between 500 to 1,000 individuals.

The Cloud Platforms Team oversees a diverse array of applications, including some hosted on a collection of older Amazon EC2 instances. One such application provides consumer-facing API services for Android and iOS applications.

Challenges Faced

We faced three primary challenges:

  1. Upgrading to Newer Amazon EC2 Instances
    Our attempt to transition from third-generation to fifth-generation EC2 instances was hindered by compatibility issues with our existing setup.
  2. Eliminating Scaling Dependencies
    Each scaling event prompted a series of processes to configure the operating system and application. Any failure in these processes could lead to application runtime issues. For instance, one evening, two installation dependencies caused every scaling action to fail, resulting in significant service disruptions for mobile users.
  3. Enhancing Scaling Speed, Reducing Unused Capacity, and Minimizing Costs
    Our standalone EC2 configuration struggled to scale rapidly enough during peak traffic, which led to performance issues and resource shortages. We implemented a scheduled scaling action before peak times, but this often resulted in excessive unused capacity.

Solution Overview

Objectives

  • Enhance the efficiency and performance of mobile workloads by achieving near-real-time scaling using Amazon ECS Capacity Providers.
  • Implement changes to Amazon ECS via Infrastructure as Code with AWS CloudFormation.
  • Optimize networking performance by appropriately sizing EC2 instances for ECS clusters.
  • Decrease costs by quickly scaling down unnecessary ECS tasks after peak traffic.

Risks

We aimed to mitigate two significant risks:

  • The infrastructure not scaling swiftly enough to meet demand, leading to poor user experiences during peak times.
  • Scaling dependencies failing, causing unplanned outages and potential reputational damage due to high consumer traffic.

Implementation Steps

  • Transition the existing Infrastructure as Code setup into a container build process for deployment to Amazon ECS, using the Amazon Elastic Container Repository.
  • Load testing was conducted by the Cloud Platforms Team to ensure scalability under production loads.
  • Traffic was gradually redirected over two weeks, increasing the percentage of traffic to Amazon ECS daily through a canary deployment approach.
  • A task placement strategy was employed to facilitate scaling down after peak traffic subsided. The binpack placement strategy was utilized to remove tasks from hosts with the fewest running tasks, enabling the ECS Capacity Provider to scale down instances post-high traffic events.
  • The application leverages Amazon ElastiCache for Redis, and upgrading to fifth-generation cache nodes improved cache efficiency, enhancing performance with higher connection counts from a larger number of smaller containers.
  • Amazon CloudWatch Container Insights enabled efficient resource assignment to individual tasks, facilitating continuous resource reservation improvements.
  • Using AWS Cost Explorer, we confirmed a 25% reduction in costs due to right-sizing the ECS cluster instances and scaling according to demand.
  • We monitored scaling activity with Amazon CloudWatch to ensure alignment with actual load.

Architecture Diagram

The solution encompasses:

  • Elastic Load Balancing for traffic distribution among running containers (Tasks).
  • Amazon CloudWatch alarms to monitor resource usage, adding or removing containers as needed.
  • ECS Capacity Providers combined with an Auto Scaling group to manage the container host capacity necessary for the desired number of containers.
  • Logs and metrics published to CloudWatch logs and Container Insights.

Reliability

A major focus of this initiative was enhancing reliability, particularly during peak times. Our effective scaling strategies alleviated the impact of significant traffic increases, resulting in fewer alerts and an improved user experience.

Performance

One key goal was to ensure our infrastructure scales with usage, avoiding costs associated with unused capacity. The correlation between request numbers and running tasks was evident, along with the scaling of container instances in response to task counts.

Cost Efficiency

While reliability was paramount, we achieved a 25% reduction in costs and optimized scaling based on usage. We also lowered our EC2 expenses by analyzing custom reports via AWS Cost Explorer. Moving away from outdated architecture has opened doors for future cost-saving initiatives, such as:

  • Enhanced client caching and better utilization of Amazon ElastiCache.
  • Implementing Amazon EC2 Spot Instances for a portion of the workload.
  • Utilizing AWS Graviton2 processor EC2 instances and Bottlerocket AMI for ECS.

Conclusion and Future Opportunities

We are exploring the integration of this workflow with AWS CodePipeline and AWS CDK to automate Amazon ECS deployment. Opportunities to apply this architecture to existing and future workloads within Domain are promising.

Next Steps:

For those interested in becoming an executive assistant, here’s a helpful resource on the necessary skills and job description. Additionally, if you want to learn more about what works councils can do with AI, check this informative article.

Chanci Turner