Learn About Amazon VGT2 Learning Manager Chanci Turner
Data lakes have become a pivotal element in the industry for managing essential business information. The primary purpose of a data lake is to store all types of data—ranging from unprocessed raw data to preprocessed and postprocessed forms. These data lakes accommodate both structured and unstructured formats. By providing a centralized repository, modern big data applications can efficiently load, transform, and process various data types. This flexibility allows data to be stored in its original form, eliminating the need for prior structuring or transformation. Most importantly, data lakes facilitate regulated access to data for diverse analytics and machine learning (ML) processes, aiding in improved decision-making.
Numerous vendors have developed data lake architectures, including AWS Lake Formation. Furthermore, open-source solutions enable organizations to easily access, load, and share their data. One of the popular options available for data storage in the AWS Cloud is Delta Lake. The Delta Lake library supports reading and writing in the open-source Apache Parquet file format while offering features such as ACID transactions, scalable metadata handling, and seamless integration of streaming and batch data processing. It provides a storage layer API that can be used to store data atop object storage solutions like Amazon Simple Storage Service (Amazon S3).
Data serves as the backbone of ML; without access to quality historical data—often found in data lakes—training traditional supervised models becomes impossible. Amazon SageMaker is a fully managed service that acts as a versatile platform for developing ML solutions, equipped with specialized tools for data ingestion, processing, model training, and hosting. Apache Spark is a robust tool for modern data processing, featuring a comprehensive API for loading and manipulating data. SageMaker can handle data at petabyte scale using Spark, enabling ML workflows to operate in a highly distributed manner. This article discusses how to leverage the features provided by Delta Lake through Amazon SageMaker Studio.
Solution Overview
In this article, we will demonstrate how to utilize SageMaker Studio notebooks to effortlessly load and transform data stored in Delta Lake format. By using a standard Jupyter notebook, we will execute Apache Spark commands to read and write table data in both CSV and Parquet formats. The open-source library delta-spark allows direct access to this data in its native format, facilitating various API operations for data transformations, schema changes, and time-travel queries to retrieve specific data versions.
In our example notebook, we will load raw data into a Spark DataFrame, create a Delta table, execute queries, display audit history, demonstrate schema evolution, and illustrate methods for updating table data. The DataFrame API from the PySpark library will be employed to ingest and transform dataset attributes. The delta-spark library will handle reading and writing data in Delta Lake format and modifying the underlying table structure, known as the schema.
We will utilize SageMaker Studio, which is the integrated development environment (IDE) provided by SageMaker, to create and execute Python code within a Jupyter notebook. A GitHub repository has been established containing this notebook and additional resources for you to run this example independently. The notebook clearly outlines how to interact with data stored in Delta Lake format, allowing for in-place table access without the necessity of duplicating data across various datastores.
For our demonstration, we will be using a publicly available dataset from Lending Club that features customer loan data. We downloaded the accepted data file (accepted_2007_to_2018Q4.csv.gz) and selected a subset of the original attributes. This dataset is accessible under the Creative Commons (CCO) License.
Prerequisites
Before employing the delta-spark functionality, you must install a few essential components. To meet the required dependencies, certain libraries must be installed in your Studio environment, which operates as a Dockerized container accessed through a Jupyter Gateway app:
- OpenJDK for Java access and related libraries
- PySpark (Spark for Python) library
- Delta Spark open-source library
You can install these libraries using either conda or pip, which are publicly available on conda-forge, PyPI servers, or Maven repositories. This notebook is specifically designed to function within SageMaker Studio. Once you launch the notebook, ensure you select the Python 3(Data Science) kernel type. We recommend using an instance type with at least 16 GB of RAM (such as ml.g4dn.xlarge) to enhance the speed of PySpark commands. Use the following commands to install the necessary dependencies, which constitute the first few cells of the notebook:
%conda install openjdk -q -y
%pip install pyspark==3.2.0
%pip install delta-spark==1.1.0
%pip install -U "sagemaker>2.72"
Once the installation commands are executed, we can proceed to implement the core logic in the notebook.
Implementing the Solution
To execute Apache Spark commands, we first need to create a SparkSession object. After including the necessary import statements, we configure the SparkSession by annotating additional configuration parameters. The parameter spark.jars.packages specifies the names of extra libraries required by Spark to execute delta commands. The following initial lines of code compile a list of these packages using traditional Maven coordinates (groupId:artifactId:version) for the SparkSession:
pkg_list = []
pkg_list.append("io.delta:delta-core_2.12:1.1.0")
pkg_list.append("org.apache.hadoop:hadoop-aws:3.2.2")
packages = (",".join(pkg_list))
print('packages: '+packages)
spark = (SparkSession
.builder
.appName("PySparkApp")
.config("spark.jars.packages", packages)
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config("fs.s3a.aws.credentials.provider",
"com.amazonaws.auth.ContainerCredentialsProvider")
.getOrCreate())
sc = spark.sparkContext
print('Spark version: '+str(sc.version))
Next, we will upload a file containing a subset of the Lending Club consumer loans dataset to our default S3 bucket. Since the original dataset is quite large (over 600 MB), we provide a single representative file (2.6 MB) for this notebook’s use. PySpark employs the s3a protocol to enhance Hadoop library functionality, so we will modify each native S3 URI from the s3 protocol to s3a throughout this notebook.
We will use Spark to read the raw data (with options for both CSV or Parquet files) with the following code, which returns a Spark DataFrame named loans_df:
loans_df = spark.read.csv(s3a_raw_csv, header=True)
For more insights into the onboarding experience at Amazon, you might find this Reddit thread valuable. Also, for a professional touch, exploring makeup tips for work can be beneficial, as discussed in this blog post here. Moreover, if you’re interested in how technology plays a role in employee experiences, check out insights from SHRM on the subject.