Introducing Native Support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark: Part 2 – AWS Glue Studio Visual Editor

Chanci Turner Amazon IXD – VGT2 learningLearn About Chanci Turner

In the initial installment of this series, we explored how AWS Glue for Apache Spark integrates with Apache Hudi, the Linux Foundation’s Delta Lake, and Apache Iceberg datasets through their native support. This capability streamlines the process of reading and writing data within these data lake frameworks, enabling you to effortlessly construct and manage your data lakes while maintaining transactional consistency. With this feature, the hassle of installing separate connectors is eliminated, and the configuration steps for utilizing these frameworks in AWS Glue for Apache Spark jobs are significantly reduced.

These data lake frameworks offer more efficient data storage solutions and provide quicker access to your data compared to basic file formats like Apache Parquet, CSV, and JSON. Unlike these simpler formats, which can store large datasets, data lake frameworks organize distributed big data files into tabular structures that replicate fundamental database constructs within data lakes.

Building on the capabilities announced at AWS re:Invent 2022, AWS Glue now offers native support for Hudi, Delta Lake, and Iceberg directly through the AWS Glue Studio visual editor. If you prefer using a visual tool for creating AWS Glue for Apache Spark jobs, you can now select any of these three data lake frameworks as a source or target via a graphical user interface (GUI) without needing to write custom code.

Even if you are new to Hudi, Delta Lake, or Iceberg, you can easily navigate typical use cases. In this post, we will illustrate how to ingest data stored in Hudi using the AWS Glue Studio visual editor. For further insights, you might find this previous blog post helpful: another blog post.

Example Scenario

To showcase the visual editor experience, we will work with the Global Historical Climatology Network Daily (GHCN-D) dataset, which is publicly available through an Amazon Simple Storage Service (Amazon S3) bucket. More details can be found in the Registry of Open Data on AWS. You can also explore how to visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight.

The Amazon S3 location s3://noaa-ghcn-pds/csv/by_year/ contains all observations from 1763 to the present, organized in CSV files—one for each year. Here’s a glimpse of the records:

ID,DATE,ELEMENT,DATA_VALUE,M_FLAG,Q_FLAG,S_FLAG,OBS_TIME
AE000041196,20220101,TAVG,204,H,,S,
AEM00041194,20220101,TAVG,211,H,,S,
AEM00041217,20220101,TAVG,209,H,,S,
AEM00041218,20220101,TAVG,207,H,,S,
...
AE000041196,20221231,TMAX,243,,,S,
AE000041196,20221231,PRCP,0,D,,S,
AE000041196,20221231,TAVG,202,H,,S,

The records contain fields such as ID, DATE, ELEMENT, and more. Each unique combination of ID, DATE, and ELEMENT corresponds to a distinct record in this dataset. For instance, the record with ID AE000041196, ELEMENT TAVG, and DATE 20220101 is unique.

In this tutorial, we will assume that the files are updated daily with new records and will focus on storing only the latest record based on the primary key (ID and ELEMENT) to make the latest snapshot data easily queryable. While one might consider performing an INSERT for all historical data and calculating the latest records during queries, this can introduce overhead. Instead, it’s more efficient to perform an UPSERT (update and insert) based on the primary key and DATE field to avoid duplicate entries and maintain a single updated row of data.

Prerequisites

Before proceeding with this tutorial, you will need to set up the following AWS resources:

  • An AWS Identity and Access Management (IAM) role for your ETL job or notebook, as per the guidelines in Set up IAM permissions for AWS Glue Studio.
  • An S3 bucket for data storage.
  • An AWS Glue database named hudi_native.

Processing a Hudi Dataset Using the AWS Glue Studio Visual Editor

Let’s create an AWS Glue job to read daily records from 2022 and write the latest snapshot into a Hudi table in your S3 bucket using the UPSERT method. Follow these steps:

  1. Open AWS Glue Studio.
  2. Select Jobs.
  3. Choose Visual with a source and target.
  4. For both Source and Target, select Amazon S3, then click Create.

A new visual job configuration screen will appear. The next step is to set up the data source to read an example dataset:

  1. Under Visual, select Data source – S3 bucket.
  2. In Node properties, for S3 source type, choose S3 location.
  3. For S3 URL, input s3://noaa-ghcn-pds/csv/by_year/2022.csv.

With the data source configured, the next step is to set up the data target for ingesting data into Apache Hudi on your S3 bucket:

  1. Choose Data target – S3 bucket.
  2. Under Data target properties for S3, select Apache Hudi for Format.
  3. Enter ghcn for Hudi Table Name.
  4. Choose Copy on write for Hudi Storage Type.
  5. Select Upsert for Hudi Write Operation.
  6. Choose ID for Hudi Record Key Fields.
  7. For Hudi Precombine Key Field, select DATE.
  8. Choose GZIP for Compression Type.
  9. Enter s3://<Your S3 bucket name>/<Your S3 bucket prefix>/hudi_native/ghcn/ for S3 Target location (make sure to include your S3 bucket name and prefix).

To facilitate sample data discovery and ensure queryability from Athena, configure the job to create a table definition in the AWS Glue Data Catalog:

  1. For Data Catalog update options, select Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions.
  2. Choose hudi_native for Database.
  3. Enter ghcn for Table name.
  4. For optional Partition keys, select ELEMENT.

Now your data integration job is fully set up in the visual editor. Let’s finalize the IAM role setting and run the job:

  1. Under Job details, choose your IAM role.
  2. Click Save, then Run.

Next, navigate to the Runs tab to monitor the job’s progress and wait for it to complete.

Querying the Table Using Athena

Once the job has successfully created the Hudi table, you can query it through various engines, including Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum, in addition to AWS Glue for Apache Spark.

To query using Athena, follow these steps:

  1. Open the query editor on the Athena console.
  2. Enter the following SQL command and click Run:
SELECT * FROM "hudi_native"."ghcn" limit 10;

The results will provide an overview of the data. To focus on specific records, particularly those with ID=’AE000041196′, you can run the following query:

SELECT * FROM "hudi_native"."ghcn" WHERE ID='AE000041196';

This query will yield the latest records corresponding to the specified ID. Although the original source file 2022.csv contains historical records for ID=’USW00012894′ from 20220101 to 20221231, the query should show only the most recent four records, one for each ELEMENT at the latest snapshot of either 20221230 or 20221231. This efficient approach is made possible by utilizing the UPSERT write option.

For additional insights into the topic, check out this excellent resource.

Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.

HOME