Ten New Visual Transforms Available in AWS Glue Studio

Chanci Turner Amazon IXD – VGT2 learning

AWS Glue Studio is a user-friendly graphical interface that simplifies the creation, execution, and monitoring of extract, transform, and load (ETL) tasks in AWS Glue. It allows users to visually design data transformation workflows using nodes that represent various data handling steps, which are then automatically translated into code for execution.

Recently, AWS Glue Studio introduced ten additional visual transforms that empower users to create more complex jobs visually, without requiring coding expertise. In this article, we will explore potential use cases that align with typical ETL requirements.

The new transforms featured in this post include: Concatenate, Split String, Array to Columns, Add Current Timestamp, Pivot Rows to Columns, Unpivot Columns to Rows, Lookup, Explode Array or Map Into Columns, Derived Column, and Autobalance Processing.

Solution Overview

For this use case, we have a collection of JSON files detailing stock option transactions. Our goal is to perform several transformations on this data to facilitate easier analysis and generate a summary dataset.

In this dataset, each row corresponds to a trade involving option contracts. Options are financial instruments granting the right—but not the obligation—to purchase or sell shares of stock at a predetermined price (known as the strike price) before a specified expiration date.

Input Data

The data follows this schema:

order_id: A unique identifier
symbol: A short code representing the corporation that issues the underlying stock shares
instrument: The name identifying the specific option being traded
currency: The ISO currency code for the price
price: The purchase price for each option contract (typically, one contract allows the buying or selling of 100 shares)
exchange: The code for the trading venue where the option was executed
sold: A list of contracts allocated to fulfill the sell order for a sell transaction
bought: A list of contracts allocated to fulfill the buy order for a buy transaction

Here’s a sample of the generated synthetic data:

{"order_id": 1679931512485, "symbol": "AMZN", "instrument": "AMZN MAR 24 23 102 PUT", "currency": "usd", "price": 17.18, "exchange": "EDGX", "bought": [18, 38]}
{"order_id": 1679931512486, "symbol": "BMW.DE", "instrument": "BMW.DE MAR 24 23 96 PUT", "currency": "eur", "price": 2.98, "exchange": "XETR", "bought": [28]}
{"order_id": 1679931512487, "symbol": "BMW.DE", "instrument": "BMW.DE APR 28 23 101 CALL", "currency": "eur", "price": 14.71, "exchange": "XETR", "sold": [9, 59, 54]}
{"order_id": 1679931512489, "symbol": "JPM", "instrument": "JPM JUN 30 23 140 CALL", "currency": "usd", "price": 11.83, "exchange": "EDGX", "bought": [33, 42, 55, 67]}
{"order_id": 1679931512490, "symbol": "SIE.DE", "instrument": "SIE.DE MAR 24 23 149 CALL", "currency": "eur", "price": 13.68, "exchange": "XETR", "bought": [96, 89, 82]}
{"order_id": 1679931512491, "symbol": "NKE", "instrument": "NKE MAR 24 23 112 CALL", "currency": "usd", "price": 3.23, "exchange": "EDGX", "sold": [67]}
{"order_id": 1679931512492, "symbol": "AMZN", "instrument": "AMZN MAY 26 23 95 CALL", "currency": "usd", "price": 11.44, "exchange": "EDGX", "sold": [41, 62, 12]}
{"order_id": 1679931512493, "symbol": "JPM", "instrument": "JPM MAR 24 23 121 PUT", "currency": "usd", "price": 1.0, "exchange": "EDGX", "bought": [61, 34]}
{"order_id": 1679931512494, "symbol": "SAP.DE", "instrument": "SAP.DE MAR 24 23 132 CALL", "currency": "eur", "price": 15.9, "exchange": "XETR", "bought": [69, 33]}

ETL Requirements

This data possesses several unique traits, often associated with legacy systems, that complicate its usability. The ETL requirements include:

The instrument names contain valuable information intended for human interpretation; we aim to normalize these into separate columns for simplified analysis.
The attributes for bought and sold contracts are mutually exclusive; we plan to consolidate them into a single column with the contract quantities and an additional column indicating whether the contracts were bought or sold in each entry.
We want to maintain information about individual contract allocations, presenting them as separate rows rather than as an array. While we could aggregate the numbers, this would obscure details about how the order was filled, indicating market liquidity. Therefore, we opt to denormalize the dataset, ensuring that each row contains a single contract quantity, with orders containing multiple counts split into distinct rows. In a compressed columnar format, the increase in dataset size due to repetition is often negligible when compression is applied, so making the dataset more query-friendly is a reasonable trade-off.
We aim to create a summary table reflecting the volume for each option type (call and put) for each stock, providing insights into market sentiment for individual stocks, and the broader market (greed vs. fear).
To facilitate overall trade summaries, we want to provide a grand total for each operation and standardize currencies to US dollars, using an approximate conversion reference.
We intend to include the date when these transformations occur, which may be useful, for instance, for referencing when currency conversions were made.

Based on these requirements, the job will produce two outputs:

A CSV file summarizing the number of contracts for each symbol and type.
A catalog table to maintain a history of the orders after applying the indicated transformations.

Prerequisites

To follow along with this use case, you will need your own S3 bucket. For instructions on creating a new bucket, refer to the Creating a bucket guide.

Generating Synthetic Data

To experiment with this dataset, you can synthetically generate the required data. The following Python script can be executed in a Python environment with Boto3 installed and access to Amazon Simple Storage Service (Amazon S3).

To generate the data, follow these steps:

In AWS Glue Studio, create a new job using the Python shell script editor.
Assign a name to the job, select an appropriate role, and specify a name for the Python script on the Job details tab.
Expand Advanced properties in the Job details section and scroll down to Job parameters.
Input a parameter named --bucket and assign it the value of the bucket you’ll use to store the sample data.
Paste the following script into the AWS Glue shell editor:

import argparse
import boto3
from datetime import datetime
import io
import json
import random
import sys

# Configuration
parser = argparse.ArgumentParser()
parser.add_argument('--bucket')
args, ignore = parser.parse_known_args()
if not args.bucket:
    raise Exception("This script requires an argument --bucket with the value specifying the S3 bucket where to store the files generated")

data_bucket = args.bucket
data_path = "transformsblog/inputdata"
samples_per_file = 1000

# Create a single file with synthetic data samples
s3 = boto3.client('s3')
buff = io.BytesIO()

sample_stocks = [("AMZN", 95, "usd"), ("NKE", 120, "usd"), ("JPM", 130, "usd"), ("KO", 130, "usd"),
                 ("BMW.DE", 95, "eur"), ("SIE.DE", 140, "eur"), ("SAP.DE", 115, "eur")]
option_type = ["PUT", "CALL"]
operations = ["sold", "bought"]
dates = ["MAR 24 23", "APR 28 23", "MAY 26 23", "JUN 30 23"]
for i in range(samples_per_file):
    stock = random.choice(sample_stocks)
    symbol = stock[0]
    ref_price = stock[1]
    currency = stock[2]
    strike_price = round(ref_price * 1.1)
    # Additional code to complete data generation...

This new functionality in AWS Glue Studio enhances the capabilities for data transformations and provides an excellent resource for users looking to streamline their ETL processes. For more insights into workplace dynamics, you may find this article on culture fit versus culture add insightful.

If you’re interested in further exploring this topic, the SHRM blog on forced ranking could also be beneficial.