Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

In this article, we present a comprehensive guide on utilizing AWS DevOps tools to model and provision AWS Glue workflows, while adhering to the DevOps principle of infrastructure as code (IaC). This approach emphasizes the use of templates, source control, and automation for efficient resource management. The cloud resources are defined through AWS CloudFormation templates and provisioned using automation features from AWS CodePipeline and AWS CodeBuild. These DevOps tools are designed to be flexible and interchangeable, ensuring seamless deployment of AWS Glue workflows across various environments, including development, testing, and production, which usually exist in distinct AWS accounts and regions.

AWS Glue workflows are integral for managing dependencies among multiple components in an end-to-end ETL data pipeline. They allow you to consolidate related jobs, crawlers, and triggers into a single logical unit of execution. Many organizations, as they start using AWS Glue workflows, initially define their pipelines via the AWS Management Console before transitioning to monitoring and troubleshooting through the console, AWS APIs, or the AWS Command Line Interface (CLI).

Overview of the Solution

The focus of this solution is on COVID-19 datasets. For detailed information regarding these datasets, you can explore the public data lake for the analysis of COVID-19 data, which houses a centralized repository of free and up-to-date curated datasets provided by the AWS Data Lake team. Our primary objective is to demonstrate how to model and provision AWS Glue workflows using AWS CloudFormation and CodePipeline, leaving out intricate transformation capabilities that AWS Glue jobs can perform. The accompanying Python scripts feature business logic tailored for clarity and extensibility, allowing you to easily identify functions that aggregate data over monthly and quarterly periods.

The ETL pipeline directly reads the source COVID-19 datasets and only writes the aggregated data to your S3 bucket. The solution presents the datasets in the tables listed below:

Table Name Description Dataset Location Provider
countrycode Lookup table for country codes s3://covid19-lake/static-datasets/csv/countrycode/ Rearc
countypopulation Lookup table for population of each county s3://covid19-lake/static-datasets/csv/CountyPopulation/ Rearc
state_abv Lookup table for US state abbreviations s3://covid19-lake/static-datasets/json/state-abv/ Rearc
rearc_covid_19_nyt_data_in_usa_us_counties Data on COVID-19 cases at the county level in the US s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-counties/ Rearc
rearc_covid_19_nyt_data_in_usa_us_states Data on COVID-19 cases at the state level in the US s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-states/ Rearc
rearc_covid_19_testing_data_states_daily Data on COVID-19 cases at the state level in the US s3://covid19-lake/rearc-covid-19-testing-data/csv/states_daily/ Rearc
rearc_covid_19_testing_data_us_daily US total test daily trend s3://covid19-lake/rearc-covid-19-testing-data/csv/us_daily/ Rearc
rearc_covid_19_testing_data_us_total_latest US total tests s3://covid19-lake/rearc-covid-19-testing-data/csv/us-total-latest/ Rearc
rearc_covid_19_world_cases_deaths_testing World total tests s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/ Rearc
rearc_usa_hospital_beds Hospital beds and their utilization in the US s3://covid19-lake/rearc-usa-hospital-beds/ Rearc
world_cases_deaths_aggregates Monthly and quarterly aggregates of the world s3:///covid19/world-cases-deaths-aggregates/ Aggregate

Prerequisites

To follow this guide, you should have the following:

  • Access to an AWS account
  • The AWS CLI (optional)
  • Permissions to create a CloudFormation stack
  • Permissions to create AWS resources, including AWS Identity and Access Management (IAM) roles, Amazon Simple Storage Service (S3) buckets, and various other resources
  • Basic familiarity with AWS Glue resources such as triggers, crawlers, and jobs

Architecture

The CloudFormation template glue-workflow-stack.yml specifies all the AWS Glue resources depicted in the associated architecture diagram.

Modeling the AWS Glue Workflow with AWS CloudFormation

Let’s delve into the template used to model the AWS Glue workflow: glue-workflow-stack.yml. We concentrate on two resources in the subsequent snippet:

  • AWS::Glue::Workflow
  • AWS::Glue::Trigger

From a logical standpoint, a workflow encompasses one or more triggers that are responsible for invoking crawlers and jobs. The process of building a workflow begins with defining the crawlers and jobs as resources within the template, subsequently associating them with triggers.

Defining the Workflow

This stage marks the initiation of the workflow definition. In the snippet below, we specify the type as AWS::Glue::Workflow and the property Name as a reference to the parameter GlueWorkflowName.

Parameters:
  GlueWorkflowName:
    Type: String
    Description: Glue workflow that tracks all triggers, jobs, crawlers as a single entity
    Default: Covid_19

Resources:
  Covid19Workflow:
    Type: AWS::Glue::Workflow
    Properties: 
      Description: Glue workflow that tracks specified triggers, jobs, and crawlers as a single entity
      Name: !Ref GlueWorkflowName 

Defining the Triggers

At this point, we define each trigger and link it to the workflow. In the snippet below, we set the property WorkflowName on each trigger as a reference to the logical ID Covid19Workflow. These triggers enable us to create a series of dependent jobs and crawlers as specified by the properties Actions and Predicate.

The trigger t_Start employs a type of SCHEDULED, meaning it initiates at a defined time (in this case, once daily at 8:00 AM UTC). Each time it executes, it triggers the job with the logical ID Covid19WorkflowStarted. Conversely, the trigger t_GroupA is of type CONDITIONAL, meaning it activates when the resources specified in the property Predicate achieve a certain state (when the list of Conditions equals SUCCEEDED). Each time t_GroupA runs, it starts the crawlers with logical IDs CountyPopulation and Countrycode, as designated in the Actions property.

TriggerJobCovid19WorkflowStart:
  Type: AWS::Glue::Trigger
  Properties:
    Name: t_Start
    Type: SCHEDULED
    Schedule: cron(0 8 * * ? *) # Executes daily at 8 AM UTC
    StartOnCreation: true
    WorkflowName: !Ref GlueWorkflowName
    Actions:
      - JobName: !Ref Covid19WorkflowStarted

TriggerCrawlersGroupA:
  Type: AWS::Glue::Trigger
  Properties:
    Name: t_GroupA
    Type: CONDITIONAL
    StartOnCreation: true
    WorkflowName: !Ref GlueWorkflowName
    Actions:
      - CrawlerName: CountyPopulation
      - CrawlerName: Countrycode
    Predicate:
      Conditions:
        - CrawlerName: CountyPopulation
          State: SUCCEEDED
        - CrawlerName: Countrycode
          State: SUCCEEDED

For those interested in entrepreneurship, you can check out another blog post on how to start your own business here. Also, if you’re looking into compliance issues, SHRM has some valuable insights regarding 401(k) fees that are worth reviewing. Lastly, for a deep dive into the onboarding experience at Amazon, refer to this excellent resource here.

Chanci Turner