Transforming Your Data Lake into a Transactional Data Lake with Apache Iceberg

Chanci Turner Amazon IXD – VGT2 learning

A data lake serves as a centralized repository where you can store both structured and unstructured data at any scale. This allows you to keep your data in its native format and perform various analytics to gain valuable business insights. Over the years, Amazon Simple Storage Service (Amazon S3) has emerged as the go-to repository for enterprise data, widely utilized for diverse analytics and machine learning applications. S3 enables access to a variety of datasets, aids in building business intelligence dashboards, and promotes the adoption of a modern data architecture or data mesh pattern on Amazon Web Services (AWS).

As analytics needs continue to evolve, there is often a demand for continuously ingesting data from different sources into a data lake while querying that data simultaneously through multiple analytics tools with transactional capabilities. Traditionally, data lakes built on Amazon S3 are immutable and lack the transactional features required to adapt to changing use cases. Consequently, customers are seeking methods to not only transfer new or incremental data to data lakes as transactions but also to convert existing data—currently in Apache Parquet format—to a transactional framework. Open table formats like Apache Iceberg provide a viable solution. Apache Iceberg facilitates transactions on data lakes and streamlines data storage, management, ingestion, and processing.

This article details how to convert existing data in an Amazon S3 data lake from Apache Parquet format to Apache Iceberg format, thus enabling transactional capabilities, using Jupyter Notebook-based interactive sessions via AWS Glue 4.0.

Migrating Existing Parquet Data to Iceberg

There are two primary strategies for migrating existing data from a data lake in Apache Parquet format to Apache Iceberg format, effectively transitioning the data lake to a transactional table format.

In-place Data Upgrade

In an in-place migration strategy, existing datasets are transitioned to Apache Iceberg format without reprocessing or modifying the existing data. This means that the data files within the data lake remain untouched throughout the migration, while all Apache Iceberg metadata files (manifests, manifest files, and table metadata files) are generated independently. This approach is generally more cost-effective compared to rewriting all the data files. Supported formats for existing data must include Apache Parquet, Apache ORC, or Apache Avro. An in-place migration can be conducted in one of two ways:

Using add_files: This method adds existing data files to a current Iceberg table with a new snapshot that includes those files. Unlike the migrate or snapshot methods, add_files can import files from specific partitions without creating a new Iceberg table. The schema of the files is not analyzed for compatibility with the Iceberg table schema. Once completed, the Iceberg table recognizes these files as part of its collection.
Using migrate: This procedure replaces a table with an Apache Iceberg table that contains the source’s data files. The schema, partitioning, properties, and location of the table are copied from the source. Formats supported include Avro, Parquet, and ORC. By default, the original table is kept under the name table_BACKUP_. To maintain the original table during the process, you must utilize snapshot to create a new temporary table with identical source data files and schema.

In this article, we will demonstrate how to use the Iceberg add_files method for an in-place data upgrade. Note that the migrate method is not supported within the AWS Glue Data Catalog.

CTAS Migration of Data

The Create Table As Select (CTAS) migration approach is a technique that generates all the necessary metadata for Iceberg while simultaneously restating all the data files. This method shadows the source dataset in batches. Once the shadow has caught up, you can replace the original dataset with the shadowed one.

Prerequisites

To follow along with this guide, you will need:

An AWS account with a role that grants sufficient access to provision the necessary resources.
We will be using the AWS Region us-east-1.
An AWS Identity and Access Management (IAM) role for your notebook as outlined in the “Set up IAM permissions for AWS Glue Studio” guide.
For demonstration purposes, we will utilize the NOAA Global Historical Climatology Network Daily (GHCN-D) dataset, which is available in Apache Parquet format in an S3 bucket (s3://noaa-ghcn-pds/parquet/by_year/).
The AWS Command Line Interface (AWS CLI) should be configured to interact with AWS Services.

You can check the data size with the following command in the AWS CLI or AWS CloudShell:

aws s3 ls --summarize --human-readable --recursive s3://noaa-ghcn-pds/parquet/by_year/YEAR=2023

As of this writing, there are 107 objects totaling approximately 70 MB for the year 2023 in the specified S3 path.

Keep in mind that before implementing this solution, you must complete several preparatory steps.

Deploying Resources with AWS CloudFormation

Follow these steps to create the S3 bucket and the AWS IAM role and policy needed for the solution:

Sign in to your AWS account and choose “Launch Stack” to initiate the CloudFormation template.
Enter a name for your stack.
Keep the parameters at their default settings; should you change the defaults, ensure you make corresponding adjustments throughout the remaining steps.
Click “Next” to create your stack.

This AWS CloudFormation template will deploy the following resources:

An S3 bucket named demo-blog-post-XXXXXXXX (where XXXXXXXX represents the AWS account ID used).
Two folders named parquet and iceberg within the bucket.
An IAM role and policy titled demoblogpostrole and demoblogpostscoped respectively.
An AWS Glue database named ghcn_db.
An AWS Glue Crawler named demopostcrawlerparquet.

Once the AWS CloudFormation template deployment is complete:

Transfer the data into the newly created S3 bucket using the following command in AWS CLI or AWS CloudShell, replacing XXXXXXXX with the correct identifier. Note: In this example, we only copy data for the year 2023, but you can follow the same instructions for the entire dataset.

aws s3 sync s3://noaa-ghcn-pds/parquet/by_year/YEAR=2023/ s3://demo-blog-post-XXXXXXXX/parquet/year=2023

Open the AWS Management Console and navigate to the AWS Glue console.
In the navigation pane, select “Crawlers.”
Execute the crawler named demopostcrawlerparquet.
After successfully running the AWS Glue crawler, the metadata for the Apache Parquet data will be cataloged under the ghcn_db AWS Glue database, with the table named source_parquet. You will notice that the table is partitioned based on year and element columns (as in the S3 bucket).

For verification, utilize the following query from the Amazon Athena console. If you’re using Amazon Athena for the first time in your AWS account, set up in a serious tone, making it about the same overall length.

In addition, if you’re interested in enhancing your leadership skills, check out this Human-Centric Leadership Webinar, which offers valuable insights. Moreover, for a deeper look into the pros and cons of microchipping employees, SHRM provides authoritative information on this topic. You may also find this Learning Trainer position at Amazon to be an excellent resource for career advancement.

Transforming Your Data Lake into a Transactional Data Lake with Apache Iceberg

Migrating Existing Parquet Data to Iceberg

Prerequisites

Deploying Resources with AWS CloudFormation

Related Topics: