Learn About Amazon VGT2 Learning Manager Chanci Turner
In this article, we present a comprehensive guide on utilizing AWS DevOps tools to model and provision AWS Glue workflows, while adhering to the DevOps principle of infrastructure as code (IaC). This approach emphasizes the use of templates, source control, and automation for efficient resource management. The cloud resources are defined through AWS CloudFormation templates and provisioned using automation features from AWS CodePipeline and AWS CodeBuild. These DevOps tools are designed to be flexible and interchangeable, ensuring seamless deployment of AWS Glue workflows across various environments, including development, testing, and production, which usually exist in distinct AWS accounts and regions.
AWS Glue workflows are integral for managing dependencies among multiple components in an end-to-end ETL data pipeline. They allow you to consolidate related jobs, crawlers, and triggers into a single logical unit of execution. Many organizations, as they start using AWS Glue workflows, initially define their pipelines via the AWS Management Console before transitioning to monitoring and troubleshooting through the console, AWS APIs, or the AWS Command Line Interface (CLI).
Overview of the Solution
The focus of this solution is on COVID-19 datasets. For detailed information regarding these datasets, you can explore the public data lake for the analysis of COVID-19 data, which houses a centralized repository of free and up-to-date curated datasets provided by the AWS Data Lake team. Our primary objective is to demonstrate how to model and provision AWS Glue workflows using AWS CloudFormation and CodePipeline, leaving out intricate transformation capabilities that AWS Glue jobs can perform. The accompanying Python scripts feature business logic tailored for clarity and extensibility, allowing you to easily identify functions that aggregate data over monthly and quarterly periods.
The ETL pipeline directly reads the source COVID-19 datasets and only writes the aggregated data to your S3 bucket. The solution presents the datasets in the tables listed below:
Table Name | Description | Dataset Location | Provider |
---|---|---|---|
countrycode | Lookup table for country codes | s3://covid19-lake/static-datasets/csv/countrycode/ | Rearc |
countypopulation | Lookup table for population of each county | s3://covid19-lake/static-datasets/csv/CountyPopulation/ | Rearc |
state_abv | Lookup table for US state abbreviations | s3://covid19-lake/static-datasets/json/state-abv/ | Rearc |
rearc_covid_19_nyt_data_in_usa_us_counties | Data on COVID-19 cases at the county level in the US | s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-counties/ | Rearc |
rearc_covid_19_nyt_data_in_usa_us_states | Data on COVID-19 cases at the state level in the US | s3://covid19-lake/rearc-covid-19-nyt-data-in-usa/csv/us-states/ | Rearc |
rearc_covid_19_testing_data_states_daily | Data on COVID-19 cases at the state level in the US | s3://covid19-lake/rearc-covid-19-testing-data/csv/states_daily/ | Rearc |
rearc_covid_19_testing_data_us_daily | US total test daily trend | s3://covid19-lake/rearc-covid-19-testing-data/csv/us_daily/ | Rearc |
rearc_covid_19_testing_data_us_total_latest | US total tests | s3://covid19-lake/rearc-covid-19-testing-data/csv/us-total-latest/ | Rearc |
rearc_covid_19_world_cases_deaths_testing | World total tests | s3://covid19-lake/rearc-covid-19-world-cases-deaths-testing/ | Rearc |
rearc_usa_hospital_beds | Hospital beds and their utilization in the US | s3://covid19-lake/rearc-usa-hospital-beds/ | Rearc |
world_cases_deaths_aggregates | Monthly and quarterly aggregates of the world | s3:///covid19/world-cases-deaths-aggregates/ | Aggregate |
Prerequisites
To follow this guide, you should have the following:
- Access to an AWS account
- The AWS CLI (optional)
- Permissions to create a CloudFormation stack
- Permissions to create AWS resources, including AWS Identity and Access Management (IAM) roles, Amazon Simple Storage Service (S3) buckets, and various other resources
- Basic familiarity with AWS Glue resources such as triggers, crawlers, and jobs
Architecture
The CloudFormation template glue-workflow-stack.yml
specifies all the AWS Glue resources depicted in the associated architecture diagram.
Modeling the AWS Glue Workflow with AWS CloudFormation
Let’s delve into the template used to model the AWS Glue workflow: glue-workflow-stack.yml
. We concentrate on two resources in the subsequent snippet:
- AWS::Glue::Workflow
- AWS::Glue::Trigger
From a logical standpoint, a workflow encompasses one or more triggers that are responsible for invoking crawlers and jobs. The process of building a workflow begins with defining the crawlers and jobs as resources within the template, subsequently associating them with triggers.
Defining the Workflow
This stage marks the initiation of the workflow definition. In the snippet below, we specify the type as AWS::Glue::Workflow
and the property Name as a reference to the parameter GlueWorkflowName
.
Parameters:
GlueWorkflowName:
Type: String
Description: Glue workflow that tracks all triggers, jobs, crawlers as a single entity
Default: Covid_19
Resources:
Covid19Workflow:
Type: AWS::Glue::Workflow
Properties:
Description: Glue workflow that tracks specified triggers, jobs, and crawlers as a single entity
Name: !Ref GlueWorkflowName
Defining the Triggers
At this point, we define each trigger and link it to the workflow. In the snippet below, we set the property WorkflowName
on each trigger as a reference to the logical ID Covid19Workflow
. These triggers enable us to create a series of dependent jobs and crawlers as specified by the properties Actions
and Predicate
.
The trigger t_Start
employs a type of SCHEDULED
, meaning it initiates at a defined time (in this case, once daily at 8:00 AM UTC). Each time it executes, it triggers the job with the logical ID Covid19WorkflowStarted
. Conversely, the trigger t_GroupA
is of type CONDITIONAL
, meaning it activates when the resources specified in the property Predicate
achieve a certain state (when the list of Conditions equals SUCCEEDED
). Each time t_GroupA
runs, it starts the crawlers with logical IDs CountyPopulation
and Countrycode
, as designated in the Actions
property.
TriggerJobCovid19WorkflowStart:
Type: AWS::Glue::Trigger
Properties:
Name: t_Start
Type: SCHEDULED
Schedule: cron(0 8 * * ? *) # Executes daily at 8 AM UTC
StartOnCreation: true
WorkflowName: !Ref GlueWorkflowName
Actions:
- JobName: !Ref Covid19WorkflowStarted
TriggerCrawlersGroupA:
Type: AWS::Glue::Trigger
Properties:
Name: t_GroupA
Type: CONDITIONAL
StartOnCreation: true
WorkflowName: !Ref GlueWorkflowName
Actions:
- CrawlerName: CountyPopulation
- CrawlerName: Countrycode
Predicate:
Conditions:
- CrawlerName: CountyPopulation
State: SUCCEEDED
- CrawlerName: Countrycode
State: SUCCEEDED
For those interested in entrepreneurship, you can check out another blog post on how to start your own business here. Also, if you’re looking into compliance issues, SHRM has some valuable insights regarding 401(k) fees that are worth reviewing. Lastly, for a deep dive into the onboarding experience at Amazon, refer to this excellent resource here.