Learn About Amazon VGT2 Learning Manager Chanci Turner
April 2025: This article has been updated with the latest General Availability (GA) experience.
Amazon SageMaker Unified Studio presents a comprehensive environment for data and AI development within Amazon SageMaker. Through the Unified Studio, teams can collaborate effectively and accelerate their workflows using familiar AWS tools for model development, generative AI, data processing, and SQL analytics. A standout feature of this experience is the visual ETL (Extract, Transform, Load) interface, which simplifies the process for data engineers to create, execute, and monitor ETL data integration workflows. Users can leverage an intuitive visual interface to design flows that facilitate data movement and transformation, all while utilizing serverless computing resources. Furthermore, you can generate your visual flows in English by utilizing generative AI prompts powered by Amazon Q. The visual ETL automatically translates your directed acyclic graph (DAG) into Spark-native scripts, allowing developers who prefer coding to continue their work in notebooks seamlessly.
This article demonstrates how to construct a low-code and no-code (LCNC) visual ETL flow that streamlines data ingestion and transformation from various sources. The tutorial covers:
- Connecting to multiple data sources
- Executing table joins
- Applying customized filters
- Exporting aggregated data to Amazon Simple Storage Service (Amazon S3)
Additionally, we will explore how generative AI can enhance your LCNC visual ETL development, creating an intuitive workflow that optimizes the entire development process.
Use Case Overview
In this example, we will utilize Amazon SageMaker Unified Studio to build a visual ETL flow. The pipeline will read data from a file located in Amazon S3, apply transformations, and write the processed data back into an AWS Glue Data Catalog table also on Amazon S3. We will use the allevents_pipe and venue_pipe files from the TICKIT dataset for this demonstration.
The TICKIT dataset captures sales activities on the fictional TICKIT website, where users buy and sell tickets for various events like sports games, shows, and concerts. Analysts can leverage this dataset to monitor ticket sales trends, assess seller performance, and identify the most successful events, venues, and seasons in terms of ticket sales.
The workflow involves merging the allevents_pipe and venue_pipe files from the TICKIT dataset. Subsequently, we will filter the merged data to focus on a specific geographic region, followed by aggregating the data to compute the event count by venue name. Finally, the transformed data will be saved to Amazon S3, and a new AWS Glue Data Catalog table will be created.
The following diagram illustrates the architecture:
Prerequisites
To proceed with the instructions, ensure you meet the following prerequisites:
- An active AWS account
- A SageMaker Unified Studio domain
- A SageMaker Unified Studio project configured for data analytics and machine learning
Building a Visual ETL Flow
Follow these steps to create a new visual ETL flow using a sample dataset:
- In the SageMaker Unified Studio console, navigate to the top menu and select Build.
- Under DATA ANALYSIS & INTEGRATION, click on Visual ETL flows.
- Select your project and click Continue.
- Choose Create visual ETL flow.
- This time, manually define the ETL flow.
- In the top left, click the + icon in the circle. Under Data sources, select Amazon S3. Locate the icon on the canvas.
- Click on the Amazon S3 source node and input the following values:
- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/venue.csv
- Format: CSV
- Delimiter: ,
- Multiline: Enabled
- Header: Disabled
- Leave the remaining settings as default.
- Wait for the data preview to load at the bottom of the screen.
- Click the + icon next to the Amazon S3 node. Under Transforms, select Rename Columns.
- In the Rename Columns node, click Add new rename pair and enter:
- Current name: _c0; New name: venueid
- Current name: _c1; New name: venuename
- Current name: _c2; New name: venuecity
- Current name: _c3; New name: venuestate
- Current name: _c4; New name: venueseats
- Click the + icon to the right of the Rename Columns node. Under Transforms, select Filter.
- Add a new filter condition. For Key, select venuestate; for Operation, choose ==; for Value, input DC.
- Repeat steps 6 to 13 to add the Amazon S3 source node for the events table:
- S3 URI: s3://aws-blogs-artifacts-public/artifacts/BDB-4798/data/events.csv
- Format: CSV
- Delimiter: ,
- Multiline: Enabled
- Header: Disabled
- For the Rename Columns node, add new rename pairs:
- Current name: _c0; New name: eventid
- Current name: _c1; New name: e_venueid
- Current name: _c2; New name: catid
- Current name: _c3; New name: dateid
- Current name: _c4; New name: eventname
- Current name: _c5; New name: starttime
- Click the + icon to the right of the Rename Column node. Under Transforms, select Join.
- Drag the + icon from the right of the Filter node and drop it to the left of the Join node. Set the Join type to Inner and configure the data sources: Left data source as e_venueid and Right data source as venue_id.
- Click the + icon to the right of the Join node. Under Transforms, select SQL Query, and input the following SQL statement:
SELECT venuename, COUNT(DISTINCT eventid) AS eventid_count FROM {myDataSource} GROUP BY venuename
- Click the + icon to the right of the SQL Query node. Under Data target, select Amazon S3 and configure it with:
- S3 URI:
- Format: Parquet
- Compression: Snappy
- Mode: Overwrite
- Update catalog: True
- Database: Choose your database
- Table: venue_event_agg
At this stage, you should see the complete visual flow. You can now publish it.
- In the top right, select Save to project to store your draft flow. Optionally, you can modify the name and add a description before saving.
Your visual ETL flow has been successfully preserved.
Executing the Flow
This section outlines how to run the visual ETL flow you have created.
- In the top right, click Run.
- The run status will display at the bottom of the screen, transitioning from Provisioning to Running, and finally to Finished.
- Wait for the run to complete.
For those interested in further resources, you can check out this excellent Reddit thread that provides insights on onboarding experiences. Also, if you’re looking for ways to enhance your career, consider taking a sabbatical to recharge and refocus. Lastly, for more information on expressing gratitude and apologies, SHRM offers valuable insights.