Orchestrating an ETL Pipeline with AWS Glue Workflows, Triggers, and Custom Classifiers

Chanci Turner Amazon IXD – VGT2 learning

ETL (Extract, Transform, Load) orchestration is a fundamental method for constructing big data pipelines. The orchestration of parallel ETL processes necessitates the utilization of various tools to execute diverse operations. To streamline this orchestration, AWS Glue workflows can be leveraged. This article illustrates how to achieve parallel ETL orchestration using AWS Glue workflows and triggers. Additionally, we will show how to implement custom classifiers with AWS Glue crawlers to categorize fixed-width data files.

AWS Glue workflows serve as both a visual and programmatic tool for creating data pipelines by integrating AWS Glue crawlers for schema discovery and AWS Glue Spark and Python shell jobs for data transformation. A workflow comprises one or more task nodes organized in a graph structure. You can define relationships and pass parameters between these nodes, enabling the construction of pipelines with varying levels of complexity. Workflows can be triggered on a schedule or initiated on demand. You can monitor the progress of individual nodes or the entire workflow, simplifying the troubleshooting of your pipelines.

If your data does not conform to AWS Glue’s built-in classifiers, you will need to define a custom classifier to automatically create a table definition. For instance, if your data comes from a mainframe system that employs a COBOL copybook data structure, you will need a custom classifier when crawling the data to extract its schema. AWS Glue crawlers allow you to provide a custom classifier for data classification. You can create a custom classifier using a Grok pattern, XML tag, JSON, or CSV. Upon starting, the crawler invokes the custom classifier, and if it recognizes the data, it saves the classification and schema in the AWS Glue Data Catalog.

Use Case

In this article, we will use the ingestion of Automated Clearing House (ACH) and check payment data as a case study. ACH is a computer-based electronic network for transaction processing, while check payments are negotiable transactions drawn against deposited funds to pay the recipient a specific amount on demand. Both ACH and check payment data files, formatted in fixed-width style, need to be ingested incrementally over time. These two data types must be merged to create a consolidated view of all payments. The consolidated ACH and check records are stored in a table suitable for business analytics via Amazon Athena.

Solution Overview

We begin by defining an AWS Glue crawler with a custom classifier for each data type. An AWS Glue workflow orchestrates the entire process, triggering the crawlers to run concurrently. Once the crawlers finish, the workflow initiates an AWS Glue ETL job to process the input data files. The workflow monitors the completion of the ETL job, which performs the data transformation and updates the table metadata in the AWS Glue Data Catalog.

The accompanying diagram illustrates a typical workflow for ETL workloads.

This post includes an AWS CloudFormation template that creates the resources needed for the AWS Glue workflow architecture. AWS CloudFormation allows you to model, provision, and manage AWS resources by treating infrastructure as code.

The CloudFormation template generates the following resources:

An AWS Glue workflow trigger initiated manually. The trigger concurrently starts two crawlers for processing ACH payment and check payment data files.
Custom classifiers for parsing incoming fixed-width files containing ACH and check data.
AWS Glue crawlers:
- A crawler that classifies ACH payments in the RAW database, using the custom classifier defined for ACH payment raw data. This crawler creates a table named ACH in the Data Catalog’s RAW database.
- A crawler that classifies check payments, utilizing the custom classifier for check payment raw data. This crawler generates a table named Check in the Data Catalog’s RAW database.
An AWS Glue ETL job that executes when both crawlers are complete. This ETL job reads the ACH and check tables, applies transformations using PySpark DataFrames, writes the output to a designated Amazon S3 location, and updates the Data Catalog for the processed payment table with a new hourly partition.
S3 buckets named RawDataBucket, ProcessedBucket, and ETLBucket. RawDataBucket stores the raw payment data received from the source system, while ProcessedBucket holds the output after AWS Glue transformations. This data is ready for end-user consumption through Athena. ETLBucket contains the AWS Glue ETL code used for data processing within the workflow.

Create Resources with AWS CloudFormation

To create your resources with the CloudFormation template, follow these steps:

Choose Launch Stack.
Click Next.
Click Next again.
On the Review page, select the option acknowledging that AWS CloudFormation may create IAM resources.
Choose Create stack.

Examine Custom Classifiers for Fixed Width Files

Let’s review the definition of the custom classifier.

Navigate to the AWS Glue console and select Crawlers.
Choose the crawler named ach-crawler.
Select the RawACHClassifier classifier and examine the Grok pattern.

This pattern assumes that the first 16 characters in the fixed-width file are allocated for acct_num, and the subsequent 10 characters are reserved for orig_pmt_date. When a crawler identifies a matching classifier, the classification string and schema are used in the table definitions saved in your Data Catalog.

Run the Workflow

To execute your workflow, complete the following:

In the AWS Glue console, select the workflow created by the CloudFormation template.
From the Actions menu, select Run.

This action initiates the workflow.

Upon completion, navigate to the History tab and select View run details to see a graphical representation of the workflow.

Examine the Tables

In the Databases section of the AWS Glue console, locate the database named glue-database-raw, which contains two tables named ach and check. These tables are created by the respective AWS Glue crawler using the specified custom classification pattern.

Query Processed Data

To query your data, follow these steps:

In the AWS Glue console, select the database glue-database-processed.
From the Action menu, choose View data.

This action opens the Athena console. If you are new to Athena, you will need to set up the S3 bucket to store query results.

In the query editor, run the following query:

SELECT acct_num, pymt_type, COUNT(pymt_type)
FROM glue_database_processed.processedpayment 
GROUP BY acct_num, pymt_type;

You will see the count of each payment type associated with each account displayed from the processedpayment table.

Clean Up

To prevent ongoing charges, it’s essential to clean up your infrastructure by deleting the CloudFormation stack. However, you must first empty your S3 buckets.

In the Amazon S3 console, select each bucket created by the CloudFormation stack.
Choose Empty.
Navigate to the AWS CloudFormation console, select the stack you created, and choose Delete.

Conclusion

In this article, we explored how AWS Glue Workflows enable efficient ETL orchestration. For further reading on optimizing your resume for applicant tracking systems, check out this insightful post. It’s imperative to maintain civility in the workplace, and resources like this provide valuable guidance. Additionally, see how Amazon fulfillment centers train associates for effective onboarding.