Creating a Resilient Architecture Using the Bulkhead Pattern on AWS App Mesh

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

NOTICE: October 04, 2024 – This article is no longer the best guidance for configuring a service mesh with Amazon ECS and Amazon EKS, and its examples may not function as described. For workloads on Amazon ECS, please consult the latest content on Amazon ECS Service Connect, and for Amazon EKS workloads, refer to Amazon VPC Lattice.

When deploying APIs within containerized services, it’s common for a single service to handle multiple responsibilities or interact with numerous downstream dependencies. In these situations, a failure in one area can lead to a cascade of issues affecting the entire application. For instance, consider an e-commerce platform that manages pricing with a REST API featuring two main endpoints served by the same containerized code:

  • GET /price/$id – retrieves the latest listing price from an in-memory cache – a lightweight, quick request
  • POST /price – creates or updates a listing price – a long-running request since it requires ensuring the price is stored and the cache updated

The write endpoint is more resource-intensive, needing to persist updates to a database and clear the cache. If a surge of traffic hits the write endpoint simultaneously, it may consume all available connection pools and memory, effectively blocking requests to other endpoints. This not only disrupts users trying to update prices but also those merely seeking the latest pricing information.

To address this challenge, it’s crucial to segregate specific functions across distinct resource pools. This article will demonstrate how the bulkhead pattern, applied at the service mesh level, can facilitate this separation and provide a practical implementation within Amazon Elastic Kubernetes Service (Amazon EKS).

The Bulkhead Pattern

The bulkhead pattern derives its name from a naval engineering technique, where ships feature internal chambers to isolate the hull. This design prevents water from spreading throughout the ship if a breach occurs. In software development, the goal is to isolate resources and dependencies to prevent systemic failures, enhancing system availability and fault tolerance. The isolation involves dividing resources into pools, distributing actions based on CPU, memory, network, or any other resource that could be depleted by other processes.

Solution Overview

The following solution exemplifies how to implement the aforementioned scenario using AWS App Mesh alongside Amazon EKS. This is just one approach to illustrate the concept, which can be applied from edge traffic management to individual lines of code, effectively isolating resources to reduce the impact of failures.

The focus here is on practicality; rather than modifying code across all applications that could benefit from this, implementing it at the infrastructure level simplifies the process and avoids unnecessary complexity.

To bolster the solution’s resilience, leveraging the newly released App Mesh Circuit Breaker capabilities can help set traffic thresholds for application nodes, preventing overload during high-traffic periods and ensuring some customers can still access features instead of risking a total failure.

The architecture is depicted in the following diagram:

[Diagram of an AWS App Mesh level bulkhead isolating resources by routes]

Implementing the Solution

As mentioned earlier, this solution utilizes Amazon EKS with AWS App Mesh. However, it could also be implemented using other container orchestration tools, such as Amazon Elastic Container Service (Amazon ECS) or even with Amazon EC2 instances. The service mesh serves as the infrastructure layer that manages access, while Kubernetes Deployments handle the computation reservations.

Once AWS App Mesh and Amazon EKS are established, follow the steps outlined below:

  1. Deploy the setup
  2. Test bulkhead failure isolation
  3. Configure the circuit breaker
  4. Test additional resiliency

Prerequisites

To implement this solution, you will need an operational Amazon EKS cluster with AWS App Mesh configured. The blog post “Getting Started with AWS App Mesh and Amazon EKS” is an excellent resource to guide you through the setup. Here’s what you’ll need:

  • An AWS account
  • An Amazon EKS cluster
  • AWS App Mesh configured for use with Amazon EKS
  • A terminal with:
    • The latest version of the AWS CLI
    • Docker
    • Git
    • jq
    • kubectl
    • httpie

Deployment Setup

Begin by deploying the demo application into your cluster. Ensure that kubectl is configured correctly for the intended cluster and that the default AWS CLI user has the necessary permissions.

The deployment script will:

  • Build a Docker image for the sample application
  • Create an ECR repository
  • Push the image to the ECR repository
  • Create an EKS namespace named “bulkhead-pattern”
  • Establish and configure the application with two deployments: price-read and price-write
  • Set up a mesh called bulkhead-pattern with a virtual gateway, virtual service, virtual routers, and specific virtual nodes
  • Create a load balancer to expose the virtual gateway via a public URL
  • Deploy a Vegeta Load Testing instance to simulate failures (Vegeta is used for load testing)

To clone the repository and execute the deployment script:

git clone https://github.com/aws/aws-app-mesh-examples.git
cd blogs/eks-bulkhead-pattern-circuit-breaker/
AWS_ACCOUNT_ID=? AWS_DEFAULT_REGION=? ./deploy.sh

Check the deployed resources:

kubectl -n bulkhead-pattern get deployments

You should see:

NAME          READY   AVAILABLE
ingress-gw   1/1     1
price-read   1/1     1
price-write  1/1     1
vegeta       1/1     1

Retrieve the load balancer endpoint:

PRICE_SERVICE=$(kubectl -n bulkhead-pattern get services | grep ingress-gw | sed 's/|/ /' | awk '{print $4}')

To test the API GET endpoint (note that it may take a few minutes for the DNS to propagate):

http GET $PRICE_SERVICE/price/7

You should receive:

HTTP/1.1 200 OK
server: envoy
x-envoy-upstream-service-time: 2
{
   "value": "23.10"
}

To test the API POST endpoint, which may take around five seconds to respond:

http POST $PRICE_SERVICE/price

The response should be:

HTTP/1.1 200 OK
server: envoy
x-envoy-upstream-service-time: 5001
{
   "status": "created"
}

For failure simulation, the price-write endpoint can be stressed, providing insight into the bulkhead pattern’s effectiveness. This example highlights the importance of implementing resilience into your architecture, a crucial step in today’s fast-paced digital landscape.

For additional insights into organizational employee development, check out this article based on industry research. Also, if you’re interested in exploring augmented reality, this blog might be useful. Lastly, for a personal perspective on onboarding experiences, this Reddit thread offers valuable insights.

Chanci Turner