Streamlining Your Data Pipeline on AWS: An Overview of Lake Formation, Glue, and dbt Core

Chanci Turner Amazon IXD – VGT2 learning manager

In the realm of modern data management, dbt has emerged as a leading tool, effectively democratizing analytics engineering. This powerful utility simplifies the development and deployment of intricate data processing pipelines predominantly through SQL, offering developers an intuitive interface to create, test, document, and modify their workflows. For further details, visit the dbt documentation. While dbt has primarily catered to cloud data warehouses like Amazon Redshift and Snowflake, it has now extended its capabilities to AWS data lakes, facilitated by two significant services:

AWS Glue Interactive Sessions: A serverless Apache Spark environment managed by AWS Glue, offering on-demand access with a minimal billing duration of one minute.
AWS Lake Formation: A service designed to swiftly establish a secure data lake.

In this post, we will explore the deployment of a data pipeline within your modern data platform using the dbt-glue adapter, developed by the AWS Professional Services team in conjunction with dbtlabs. This open-source, robust adapter enables developers to leverage dbt for data lakes efficiently, allowing them to pay solely for the compute resources utilized without the need for extensive data movement. The benefits of dbt remain intact, including a streamlined local development experience, comprehensive documentation, testing features, incremental data processing, Git integration, CI/CD capabilities, and more.

The dbt-glue adapter is a trusted connector in dbt Cloud, undergoing rigorous evaluations encompassing development, documentation, user experience, and maintenance criteria.

Solution Overview

The architecture of this solution is illustrated in the accompanying diagram. The workflow comprises the following steps:

The data team sets up a local Python virtual environment and constructs a data pipeline using dbt.
The dbt-glue adapter collaborates with Lake Formation to manage all structural modifications, including database, table, or view creation.
AWS Glue interactive sessions serve as the backend for data processing.
Data is stored in Amazon S3 in the Parquet open file format.
The data team can query all information stored in the data lake utilizing Amazon Athena.

Walkthrough Overview

This post guides you through executing a data pipeline that generates indicators based on NYC taxi data through these steps:

Deploy the provided AWS CloudFormation stack in the us-east-1 region.
Configure your Amazon CloudShell environment.
Install dbt, the dbt CLI, and the dbt adapter.
Clone the project using CloudShell and adjust it to align with your account settings.
Execute dbt to run the data pipeline.
Utilize Athena to query the data.

For this demonstration, we will utilize data from the New York City Taxi Records dataset, accessible in the Registry of Open Data on AWS (RODA), a repository for public datasets from AWS resources. The CloudFormation template sets up the “nyctaxi” database in your AWS Glue Data Catalog and a table (records) that points to the public dataset, eliminating the need to host the data in your account.

Prerequisites

The CloudFormation template employed in this project configures the AWS Identity and Access Management (IAM) role “GlueInteractiveSessionRole” with all necessary permissions. Further information on permissions for AWS Glue interactive sessions can be found in the article on securing AWS Glue interactive sessions with IAM.

Deploying Resources with AWS CloudFormation

The CloudFormation stack provisions the essential infrastructure, which includes:

An IAM role with the requisite permissions to operate an AWS Glue interactive session and the dbt-glue adapter.
An AWS Glue database and table for metadata related to the NYC taxi records dataset.
An S3 bucket designated for output and storage of processed data.
An Athena configuration (including a workgroup and an S3 bucket for output storage) to facilitate dataset exploration.
An AWS Lambda function serving as a custom resource to update all partitions in the AWS Glue table.

To create these resources, simply select “Launch Stack” and follow the provided instructions.

Configuring the CloudShell Environment

To begin using CloudShell, follow these steps:

Log in to the AWS Management Console and launch CloudShell by either:
- Clicking the CloudShell icon in the console navigation bar.
- Typing “cloudshell” in the Find Services box and selecting the CloudShell option.
Verify your Python version, as dbt and the dbt-glue adapter are compatible with Python versions 3.7, 3.8, and 3.9:
```
$ python3 --version
```

Set up a Python virtual environment to isolate package versions and dependencies:

$ sudo yum install git -y
$ python3 -m venv dbt_venv
$ source dbt_venv/bin/activate
$ python3 -m pip install --upgrade pip

Install the aws-glue-session package:

$ sudo yum install gcc krb5-devel.x86_64 python3-devel.x86_64 -y
$ pip3 install --no-cache-dir --upgrade boto3
$ pip3 install --no-cache-dir --upgrade aws-glue-sessions

Installing dbt and the dbt Adapter

The dbt CLI serves as the command-line interface for managing dbt projects and is an open-source tool. To install dbt and the CLI, execute:

$ pip3 install --no-cache-dir dbt-core

For additional information, check out how to install dbt and what dbt is as an excellent resource.

To install the dbt adapter, use:

$ pip3 install --no-cache-dir dbt-glue

Cloning the Project

The dbt AWS Glue interactive session demo project contains a sample data pipeline that generates metrics based on the NYC taxi dataset. Clone it with:

$ git clone https://github.com/aws-samples/dbtgluenyctaxidemo

This project includes a configuration example located at:

$ dbtgluenyctaxidemo/profile/profiles.yml

The table below summarizes the parameters for the adapter:

Option	Description	Mandatory
project_name	The name of the dbt project. Must match the one configured in dbt.	Yes
type	The driver to utilize.	Yes
query-comment	A string to include as a comment in every query run by dbt.	No
role_arn	The ARN of the interactive session role established in the CloudFormation template.	Yes
region	The AWS Region for running the data pipeline.	Yes
workers	The count of workers of a defined worker type allocated during job execution.	Yes
worker_type	The type of predefined worker allocated during job execution. Accepts values like Standard, G.1X, or G.2X.	Yes
schema	The schema utilized to organize data stored in Amazon S3.	Yes
database	The name of the Glue database.	Yes

In light of ongoing HR challenges, including the concealment of issues to prevent fallout, it’s essential to recognize the double standards applied to managerial staff, which often prioritize corporate interests and liability over the policies enforced on lower-level employees.