Amazon Onboarding with Learning Manager Chanci Turner

Introduction

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

In this second installment of our blog series, we take a closer look at how seamless onboarding experiences are orchestrated within Amazon. Part 1 provided insights into the structural considerations behind this initiative. Here, we explore how “zero-downtime” deployment strategies have been effectively implemented to enhance the onboarding process for new employees.

Zero-Downtime Deployments

The backbone of our zero-downtime deployment strategy is the use of AWS Step Functions. This orchestration tool operates based on a state machine, where each state—represented by the boxes in the accompanying diagram—can be classified as a task, wait, or choice type. Each state saves its output to a designated ResultPath variable, which is then passed as input to the subsequent state. Here’s a breakdown of the different state types:

  • Task: In this deployment, all task states trigger a Lambda function that executes actions such as updating AWS resources or conducting health checks.
  • Wait: This state pauses the Step Function for a predetermined period, which is particularly useful for waiting on services to become active or inactive.
  • Choice: This evaluates the output from preceding states to determine the next course of action.
  • End states: These indicate the outcome of the Step Function, which can be either “succeed” or “failed.”

Alongside the standard parameters essential for a deployment—such as the target environment, the new service version, and the region—specific parameters for the zero-downtime deployment pipeline include:

  • switch_active_cluster: Activating this option reroutes traffic from the active to the inactive cluster (the default is true). If set to false, the new service version is deployed to the inactive cluster, keeping traffic on the active cluster, useful for additional testing before full deployment.
  • shutdown_inactive_cluster: This allows the old (inactive) cluster to continue running post-deployment; the default is true. If set to false, the old cluster remains operational, facilitating swift traffic switching between versions or enabling a rollback if necessary.

Example of Zero-Downtime Deployment

Let’s walk through the sequential steps of the state machine during a typical zero-downtime deployment:

  1. Initial State: The settings for a standard zero-downtime deployment are configured to: switch_active_cluster = true and shutdown_inactive_cluster = true. Initially, version 1 operates in the blue cluster, as indicated by the parameter store values.
  2. Setup_parameters: A Lambda function compiles user-provided inputs and sets default values.
  3. Deploy_new_infrastructure: The new green cluster is provisioned with updated resources, and the parameter store value is refreshed via Lambda.
  4. Is_new_cluster_healthy, wait_for_cluster, check_new_cluster: The process enters a loop, waiting for the green cluster to pass health checks, executing API calls to verify service status.
  5. Integration_tests, check_test_results: Optional integration tests are executed post-health check. If they fail, the deployment transitions to error_handling/deploy_failed to compile the failure reports. Conversely, if the switch_active_cluster parameter is false and tests are successful, the deployment is marked complete.
  6. Switch_blue_green: The parameter store value for current_active_color is updated to reflect the new color. App Mesh routes are adjusted to direct active traffic to the new cluster, with connected clients in the blue cluster receiving a prompt to reconnect once they finish ongoing games or transactions.
  7. Shutdown_inactive_cluster: If set to true, the Step Function executes the is_old_cluster_empty loop, waiting for existing connections to clear. If false, the process concludes here with the old cluster still active.
  8. Is_old_cluster_empty, wait_for_old_connections, check_old_connections: This loop monitors the blue cluster while players complete their sessions. The check_old_connection Lambda uses an API call with the version 1 identifier to track connected players. The loop concludes once all connections are drained or a timeout occurs.
  9. Cleanup_old_cluster: Finally, ECS tasks in the blue cluster are terminated, and the service node count is reduced to zero, completing the zero-downtime deployment.

Integrating with Terraform

One challenge faced by the team was the potential for conflicts between Step Functions/Lambda and Terraform in managing infrastructure as code. Any resource changes prompted Terraform to recreate them, risking downtime. To resolve this, parameters were established in the Systems Manager Parameter Store to act as a definitive source for resource updates through either Terraform or the Step Function pipeline.

This created a challenge with new Task Definition revisions generated by either Terraform or the Step Function pipeline, necessitating that the ECS Service always utilizes the most recent revisions. We addressed this by creating Task Definitions in Terraform while importing them as data resources. This allowed Terraform to identify the latest version of the Task Definition family, ensuring optimal configuration.

Future Considerations

While our current approach to zero-downtime deployments has proven effective, we have plans for future enhancements. Currently, separate Step Function executions manage the two regions where the program is deployed. Consolidating these executions into a single Step Function that runs in parallel could streamline updates across all regions. Additionally, implementing a message-based approval system could allow for an approval pause before active traffic switching, fostering a more controlled deployment environment.

For anyone looking to further explore their potential in the workplace, this blog post provides great insights. Moreover, the importance of integrating humanity into HR practices is discussed in this SHRM article which can further enhance the onboarding experience. If you’re interested in joining a dynamic team, consider checking out this job opportunity that could align with your career goals.

Chanci Turner