Learn About Amazon VGT2 Learning Manager Chanci Turner
We are thrilled to share that AWS Glue now accommodates Scala scripts for ETL (extract, transform, and load) operations. Scala enthusiasts can celebrate as they gain access to another robust tool for their data management needs. As the native language of Apache Spark, which powers AWS Glue’s data transformation capabilities, Scala offers several advantages.
One key benefit of using Scala scripts over Python is speed. Scala excels in custom transformations that require intensive processing, eliminating the need to transfer data between Python and Apache Spark’s Scala runtime (the Java Virtual Machine or JVM). Furthermore, calling functions from external Java class libraries is more straightforward in Scala due to its inherent compatibility with Java; they compile to the same bytecode, and no data structure conversions are necessary.
To demonstrate these advantages, we will explore an example that analyzes a recent sample of the GitHub public timeline, sourced from the GitHub archive. This archive records over 35 different event types, including commits, forks, issues, and comments.
In this post, we will develop a Scala script that identifies particularly negative issues within the timeline. This script will extract issue events, analyze their titles using sentiment prediction functions from the Stanford CoreNLP libraries, and highlight the most negative issues.
Getting Started
Before diving into script writing, we first utilize AWS Glue crawlers to understand the data’s structure and characteristics. We also set up a development endpoint and connect an Apache Zeppelin notebook to interactively explore the data and author our script.
Crawling the Data
The dataset used in this example has been downloaded from the GitHub archive website and stored in our sample dataset bucket on Amazon S3, located at:
s3://aws-glue-datasets-/examples/scala-blog/githubarchive/data/
To proceed, replace <region> with your specific working region, for instance, us-east-1. Crawl this folder and save the results into a database named githubarchive in the AWS Glue Data Catalog, as outlined in the AWS Glue Developer Guide. The folder contains 12 hours of timeline data from January 22, 2017, organized by year, month, and day.
Once complete, navigate to the table named data in the githubarchive database using the AWS Glue console. You will see eight top-level columns common to each event type, along with three partition columns corresponding to year, month, and day.
When you select the payload column, you’ll notice its complex schema—reflecting the union of payloads from the event types present in the crawled data. Keep in mind that the schema generated by crawlers is only a subset of the actual schema since they sample only a portion of the data.
Setting Up the Library, Development Endpoint, and Notebook
Next, you will need to download and configure the libraries required for sentiment estimation in text snippets. The Stanford CoreNLP libraries include various human language processing tools, such as sentiment prediction.
Download the Stanford CoreNLP libraries, unzip the .zip file, and you should see a directory filled with jar files. For this example, you will need the following jars:
- stanford-corenlp-3.8.0.jar
- stanford-corenlp-3.8.0-models.jar
- ejml-0.23.jar
Upload these files to an Amazon S3 path that AWS Glue can access to load the necessary libraries. In this example, they are located at
s3://glue-sample-other/corenlp/
.
Development endpoints provide static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints for interactive command execution and data analysis. These endpoints use the same configuration as AWS Glue’s job execution system, ensuring that commands and scripts function identically when registered and run as jobs in AWS Glue.
To set up an endpoint and a Zeppelin notebook, follow the instructions in the AWS Glue Developer Guide. Ensure you specify the jar locations in the Dependent jars path as a comma-separated list when creating the endpoint; otherwise, the libraries won’t load properly.
After setting up the notebook server, access the Zeppelin notebook by selecting Dev Endpoints in the left navigation pane of the AWS Glue console. Choose your created endpoint, then click the Notebook Server URL, which directs you to the Zeppelin server. Log in using the notebook username and password specified during the notebook creation. Finally, create a new note to experiment with this example.
Each notebook consists of paragraphs containing a sequence of commands and their outputs. Additionally, each notebook has several interpreters pre-configured if you set up the Zeppelin server via the console, including (Python-based) pyspark and (Scala-based) spark interpreters, with pyspark set as default. For this example, we will omit the %spark prefix for brevity.
Working with the Data
In this section, we will utilize AWS Glue extensions to Spark to manipulate the dataset. We will examine the actual schema of the data and filter for the pertinent event types for our analysis.
Start with some boilerplate code to import the necessary libraries:
%spark
import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
Next, create the Spark and AWS Glue contexts required for data manipulation:
@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
The transient decorator is necessary for the SparkContext when using Zeppelin; otherwise, you’ll encounter serialization errors when executing commands.
Dynamic Frames
This section illustrates how to create a dynamic frame that encapsulates the GitHub records from the table you previously crawled. A dynamic frame serves as the foundational data structure in AWS Glue scripts. It resembles an Apache Spark dataframe but is specifically optimized for data cleaning and transformation tasks. Dynamic frames are particularly suitable for representing semi-structured datasets like the GitHub timeline.
A dynamic frame is essentially a collection of dynamic records. In Spark terminology, it is akin to a resilient distributed dataset (RDD) of DynamicRecords. Each dynamic record is self-describing, encoding its columns and types, allowing for unique schemas across records within the dynamic frame. This feature proves convenient and often more efficient for datasets like the GitHub timeline, where payloads can significantly vary between event types.
The following code creates a dynamic frame named github_events from your table:
val github_events = glueContext
.getCatalogSource(database = "githubarchive", tableName = "data")
.getDynamicFrame()
This is an excellent resource for those looking to enhance their skills and knowledge in data analysis; check out this Reddit post for tips on getting started with Amazon. Additionally, if you’re interested in pursuing a career in data analysis, you can find helpful guidance in this blog post. For insights on leadership transitions, consider reading this article.