Streamline Your Data Processing with Amazon Data Firehose and SageMaker Lakehouse: A Guide by Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning

In today’s fast-paced business environment, organizations face the challenge of extracting real-time insights from their data while also performing in-depth analytics. This dual necessity raises an important question: how can companies effectively connect streaming data with analytical workloads without the hassle of complicated and difficult-to-maintain data pipelines? In this article, we introduce a simplified approach using Amazon Data Firehose (Firehose) to channel streaming data directly into Apache Iceberg tables within Amazon SageMaker Lakehouse. This method creates a more straightforward pipeline, significantly minimizing complexity and maintenance.

Real-time streaming data is essential for AI and machine learning (ML) models as it allows them to learn and adapt swiftly, which is vital for applications that demand immediate insights or adaptive responses. This capability opens new avenues for business agility and innovation. Key applications include predicting equipment failures from sensor data, real-time supply chain monitoring, and enabling AI to dynamically react to fluctuating conditions. Access to real-time streaming data empowers customers to make rapid decisions, fundamentally transforming business competition in real-time markets.

Amazon Data Firehose efficiently captures, transforms, and delivers data streams to lakehouses, data lakes, data warehouses, and analytical services, ensuring automatic scaling and delivery in seconds. For analytical tasks, the lakehouse architecture has emerged as a powerful solution, merging the best attributes of data lakes and data warehouses. Apache Iceberg, an open table format, facilitates this transition by offering transactional guarantees, schema evolution, and effective metadata management—features once limited to traditional data warehouses. SageMaker Lakehouse integrates your data across Amazon Simple Storage Service (Amazon S3) data lakes, Amazon Redshift data warehouses, and other sources, allowing you to access your data in-place with Iceberg-compatible tools and engines. By leveraging SageMaker Lakehouse, organizations can take advantage of Iceberg’s capabilities while enjoying the scalability and flexibility of a cloud-based solution. This integration dissolves traditional barriers between data storage and ML processes, enabling data professionals to interact directly with Iceberg tables in their preferred tools and notebooks.

In this post, we will guide you through the steps to create Iceberg tables within Amazon SageMaker Unified Studio and stream data to these tables using Firehose. This collaboration allows data engineers, analysts, and data scientists to work seamlessly together, building comprehensive analytics and ML workflows within SageMaker Unified Studio. This approach eliminates traditional silos, accelerating the transition from data ingestion to production ML models.

Solution Overview

The following diagram illustrates how Firehose can deliver real-time data to the SageMaker Lakehouse.

This article also includes an AWS CloudFormation template to set up necessary resources, enabling Firehose to stream data to Iceberg tables. You can review and customize this template to fit your needs. The template performs the following operations:

Creates an AWS Identity and Access Management (IAM) role with the necessary permissions for Firehose to write to an S3 bucket.
Sets up resources for the Amazon Kinesis Data Generator to transmit sample streaming data to Firehose.
Grants AWS Lake Formation permissions to the Firehose IAM role for Iceberg tables created in SageMaker Unified Studio.
Establishes an S3 bucket to back up records that fail to deliver.

Prerequisites

Before starting, ensure you have the following:

An AWS account – If you don’t have one, you can create it easily.
A SageMaker Unified Studio domain – For detailed instructions, refer to the guide on creating an Amazon SageMaker Unified Studio domain.
A demo project – Set up a demo project in your SageMaker Unified Studio domain. For guidance, see the project creation instructions. For this example, we will use “All capabilities” in the project profile and “streaming_datalake” as the AWS Glue database name.

Once you have completed these prerequisites, verify that you can log in to SageMaker Unified Studio and that your project has been successfully created. Each project in SageMaker Unified Studio will have its own location and IAM role.

Creating an Iceberg Table

For this solution, we will utilize Amazon Athena as our query editor engine. Follow these steps to create your Iceberg table:

In SageMaker Unified Studio, navigate to the Build menu and select Query Editor.
Choose Athena as the engine for the query editor and select the AWS Glue database you created for your project.
Execute the following SQL statement to create the Iceberg table. Be sure to specify your project’s AWS Glue database and S3 location (found on the project overview page):

CREATE TABLE firehose_events (
   type struct<device: string, event: string, action: string>,
   customer_id string,
   event_timestamp timestamp,
   region string)
LOCATION '<PROJECT_S3_LOCATION>/iceberg/events'
TBLPROPERTIES (
   'table_type'='iceberg',
   'write_compression'='zstd'
);

Deploying Supporting Resources

Next, deploy the required resources into your AWS environment using the CloudFormation template. Follow these steps:

Click on Launch Stack.
Proceed to the next step.
Keep the stack name as “firehose-lakehouse.”
Input the username and password you wish to use for accessing the Amazon Kinesis Data Generator application.
Enter the AWS Glue database name for DatabaseName.
For ProjectBucketName, input the project bucket name from the SageMaker Unified Studio project details page.
Specify the table name created in SageMaker Unified Studio for TableName.
Click Next.
Confirm that AWS CloudFormation may create IAM resources and proceed.
Complete the stack setup.

Creating a Firehose Stream

Follow these steps to create a Firehose stream to deliver data to Amazon S3:

On the Firehose console, select Create Firehose stream.
For Source, choose Direct PUT.
For Destination, select Apache Iceberg Tables.
Enter “firehose-iceberg-events” as the Firehose stream name.
Collect the database and table names from your SageMaker Unified Studio project for the next step.
In the Destination settings, enable Inline parsing for routing information and supply the database and table names from the previous step.

Ensure to enclose the database and table names in double quotes if you want to direct data to a single database and table. Amazon Data Firehose can also route records to various tables based on record content. For further details, refer to the section on routing incoming records to different Iceberg tables.

Under Buffer hints, reduce the buffer size to 1 MiB and adjust the buffer interval. These steps will help you optimize your data processing.

With this integration, data professionals can now leverage the full potential of real-time data processing in their workflows. For additional insights on career advancement and professional development, check out this blog post on Power Moves. Moreover, for authoritative information on talent acquisition, refer to this resource from SHRM. Lastly, if you’re interested in a firsthand perspective on Amazon’s onboarding process, this Glassdoor review is an excellent resource.