Streamline AWS Glue Job Management and Oversight with Amazon MWAA

Organizations across various sectors face intricate data processing challenges for their analytical needs across multiple analytics platforms, including data lakes on AWS, data warehouses like Amazon Redshift, search systems such as Amazon OpenSearch Service, NoSQL databases like Amazon DynamoDB, and machine learning tools including Amazon SageMaker, among others. Data professionals are tasked with extracting insights from data housed in these distributed systems to enhance user experiences in a secure and cost-effective manner. For instance, media companies aim to merge and analyze datasets from both internal and external databases to create cohesive customer profiles that drive innovative features and boost user engagement.

In such contexts, customers seeking a serverless data integration solution rely on AWS Glue as a fundamental tool for data processing and cataloging. AWS Glue seamlessly integrates with AWS services and partner products, offering low-code/no-code ETL (extract, transform, load) solutions to facilitate analytics, machine learning, or application development workflows. AWS Glue ETL jobs may form part of a more intricate pipeline, making the orchestration and dependency management between these elements a crucial aspect of data strategy. Amazon Managed Workflows for Apache Airflow (Amazon MWAA) orchestrates data pipelines, utilizing distributed technologies that encompass on-premises resources, AWS services, and third-party components.

This post outlines how to enhance the monitoring of an AWS Glue job managed by Airflow through the latest capabilities of Amazon MWAA.

Solution Overview

In this article, we will cover:

How to upgrade an Amazon MWAA environment to version 2.4.3.
How to orchestrate an AWS Glue job using an Airflow Directed Acyclic Graph (DAG).
The observability improvements introduced in the Airflow Amazon provider package within Amazon MWAA, allowing you to consolidate AWS Glue job run logs on the Airflow console, making troubleshooting data pipelines much easier. Previously, support teams had to navigate the AWS Management Console and perform manual steps for this visibility. This feature is automatically available starting from Amazon MWAA version 2.4.3.

The diagram below illustrates our solution architecture.

Prerequisites

You will need:

Access to a console with permissions to create AWS Glue jobs, IAM roles, and policies, as well as manage or launch an Amazon MWAA environment.
Amazon Athena configured with a workgroup.
AWS CloudTrail set up with a trail logging into an Amazon Simple Storage Service (Amazon S3) bucket.

Setting Up the Amazon MWAA Environment

For guidance on creating your environment, see the instructions for creating an Amazon MWAA environment. If you are an existing user, we recommend upgrading to version 2.4.3 to leverage the observability enhancements discussed here.

The upgrade process for Amazon MWAA to version 2.4.3 varies depending on whether your current version is 1.10.12 or 2.2.2. We will cover both scenarios in this article.

Prerequisites for Setting Up an Amazon MWAA Environment

Ensure you meet the following requirements:

The Amazon MWAA environment should be version 2.2.2 or higher; while this post demonstrates an upgrade to 2.4.3, you can work with 2.2.2 by adjusting the apache-airflow-providers-amazon package constraints in requirements.txt.
The apache-airflow-providers-amazon package version must be 6.0.0 or greater.
The Amazon MWAA execution role needs permissions for glue:StartJobRun and glue:GetJobRun.
The requirements.txt used by Amazon MWAA should include any additional Python modules required by your DAGs.

Upgrading from Version 1.10.12 to 2.4.3

If your Amazon MWAA is currently at version 1.10.12, please refer to the guide on migrating to a new Amazon MWAA environment for instructions on upgrading to 2.4.3.

Upgrading from Version 2.0.2 or 2.2.2 to 2.4.3

For those on Amazon MWAA version 2.2.2 or lower, follow these steps:

Create a requirements.txt file for any custom dependencies with specific versions necessary for your DAGs.
Upload this file to Amazon S3 in the designated location where the Amazon MWAA environment retrieves its requirements.txt for dependency installation.
Follow the migration instructions to select version 2.4.3.

Updating Your DAGs

If you’ve upgraded from an older Amazon MWAA environment, you may need to adjust existing DAGs. In Airflow version 2.4.3, the Airflow environment will default to using the Amazon provider package version 6.0.0, which may include breaking changes, such as updated operator names. For instance, AWSGlueJobOperator has been deprecated and replaced with GlueJobOperator. To ensure compatibility, modify your Airflow DAGs by replacing any deprecated or unsupported operators with the new ones. Here’s how:

Visit Amazon AWS Operators.
Select the appropriate version installed in your Amazon MWAA instance (6.0.0 by default) to see the list of supported Airflow operators.
Make the necessary changes to your existing DAG code and upload the updated files to the DAG location in Amazon S3.

Orchestrating the AWS Glue Job from Airflow

This section delves into the specifics of orchestrating an AWS Glue job within Airflow DAGs. Airflow simplifies the development of data pipelines with dependencies across various systems, including on-premises operations, external dependencies, and other AWS services.

Orchestrating CloudTrail Log Aggregation with AWS Glue and Amazon MWAA

In this example, we will demonstrate a use case where Amazon MWAA orchestrates an AWS Glue Python Shell job that aggregates metrics based on CloudTrail logs. CloudTrail offers visibility into the AWS API calls made within your account. A typical use case for this data is to collect usage metrics for auditing and regulatory purposes.

As CloudTrail events are logged, they are stored as JSON files in Amazon S3, which are not ideal for analytical queries. Our goal is to aggregate this data and save it as Parquet files to optimize query performance. Initially, we can use Athena to conduct preliminary queries on the data before performing further aggregations in our AWS Glue job. For more information on creating an AWS Glue Data Catalog table, refer to the guide on creating the table for CloudTrail logs in Athena using partition projection data. Once we’ve analyzed the data via Athena and identified the metrics we want to retain in aggregated tables, we can create an AWS Glue job.

Creating a CloudTrail Table in Athena

First, we need to establish a table in our Data Catalog that allows querying of CloudTrail data through Athena. Below is a sample query that creates a table with two partitions based on the Region and date (referred to as snapshot_date). Be sure to replace the placeholders for your CloudTrail bucket, AWS account ID, and CloudTrail table name:

create external table if not exists `<<>>`(
  `eventversion` string comment 'from deserializer', 
  `useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncon

For additional insights on corrective actions, check out this blog post that covers best practices. Furthermore, if you’re interested in how HR and finance can collaborate more effectively for analytics, you can find valuable information here. Lastly, this article provides an excellent resource on Amazon’s innovative approach to employee training and its implications for the future of work.