Customize Your Libraries and Dependencies for Spark and Hive on Amazon EMR Serverless with Tailored Images

Amazon EMR Serverless provides the capability to run popular big data frameworks, such as Apache Spark and Apache Hive, without the need to manage clusters or servers. Many users who operate Spark and Hive applications seek to integrate their own libraries and dependencies into the application runtime. For instance, you might wish to incorporate well-known open-source extensions for Spark or add a bespoke encryption-decryption module that your application requires.

We are thrilled to introduce a new feature that allows you to personalize the runtime image utilized in EMR Serverless by integrating custom libraries that your applications need. This enhancement offers several benefits:

Maintain a controlled set of libraries that can be reused across all EMR Serverless jobs as part of the EMR Serverless runtime.
Add popular extensions to the open-source Spark and Hive frameworks, such as pandas, NumPy, and matplotlib, enhancing the functionality of your EMR Serverless applications.
Leverage established CI/CD practices to build, test, and deploy your customized libraries into the EMR Serverless runtime.
Implement recognized security measures, including image scanning, ensuring compliance and governance within your organization.
Utilize different versions of runtime components (for example, the JDK runtime or the Python SDK runtime) compared to the versions available by default in EMR Serverless.

In this article, we will illustrate how to utilize this new capability.

Overview of the Solution

To take advantage of this feature, you will customize the EMR Serverless base image using Amazon Elastic Container Registry (Amazon ECR), a fully managed container registry that simplifies sharing and deploying container images for your developers. Amazon ECR alleviates the burden of managing your own container repositories or scaling the underlying infrastructure. Once your custom image is uploaded to the container registry, you can specify this image when creating your EMR Serverless applications.

The following diagram outlines the steps involved in using custom images for your EMR Serverless applications.

In the subsequent sections, we will examine how to use custom images with Amazon EMR Serverless to tackle three prevalent use cases:

Incorporate popular open-source Python libraries into the EMR Serverless runtime image.
Use an alternative or newer version of the Java runtime for the EMR Serverless application.
Install a Prometheus agent and customize the Spark runtime to transmit Spark JMX metrics to Amazon Managed Service for Prometheus, allowing visualization of these metrics in a Grafana dashboard.

General Prerequisites

Before proceeding with the steps below, ensure you have met the following prerequisites for using custom images with EMR Serverless:

Create an AWS Identity and Access Management (IAM) role with the necessary permissions for Amazon EMR Serverless applications, Amazon ECR, and Amazon S3, specifically for the aws-bigdata-blog bucket and any S3 bucket in your account where you will store application artifacts.
Install or update to the latest version of the AWS Command Line Interface (AWS CLI) and install the Docker service on an Amazon Linux 2 based Amazon Elastic Compute Cloud (Amazon EC2) instance. Make sure to attach the IAM role created in the previous step to this EC2 instance.
Choose a base EMR Serverless image from the public Amazon ECR repository. Execute the following commands on the EC2 instance with Docker installed to verify you can pull the base image from the public repository:

# Start the docker service if it's not already running
$ sudo service docker start 

# Check if you can pull the latest EMR 6.9.0 runtime base image 
$ sudo docker pull public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest

Log in to Amazon ECR with the following commands and create a repository named emr-serverless-ci-examples, substituting your AWS account ID and Region:

$ sudo aws ecr get-login-password --region <region> | sudo docker login --username AWS --password-stdin <your AWS account ID>.dkr.ecr.<region>.amazonaws.com

$ aws ecr create-repository --repository-name emr-serverless-ci-examples --region <region>

Grant IAM permissions to the EMR Serverless service principal for the Amazon ECR repository. Navigate to the Amazon ECR console, select Permissions under Repositories, choose Edit policy JSON, and enter the following JSON before saving:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Emr Serverless Custom Image Support",
      "Effect": "Allow",
      "Principal": {
        "Service": "emr-serverless.amazonaws.com"
      },
      "Action": [
        "ecr:BatchGetImage",
        "ecr:DescribeImages",
        "ecr:GetDownloadUrlForLayer"
      ]
    }
  ]
}

Ensure that the policy is updated in the Amazon ECR console. For production workloads, consider adding a condition in the Amazon ECR policy to restrict access to only authorized EMR Serverless applications. For additional insights, refer to this article on compliance matters.

Next, we will create and utilize custom images in our EMR Serverless applications for the three distinct use cases.

Use Case 1: Implement Data Science Applications

A common application of Spark on Amazon EMR is executing data science and machine learning (ML) tasks at scale. For large datasets, SparkML provides a suite of ML algorithms that can be utilized to train models in a distributed manner. However, frequently you need to run multiple iterations of simple classifiers to optimize for hyperparameter tuning, ensembles, and multi-class solutions across smaller to medium-sized datasets (ranging from 100,000 to 1 million records). Spark is an excellent engine for running these iterations of classifiers in parallel. In this scenario, we will demonstrate how to use Spark to run multiple iterations of an XGBoost model to determine the most suitable parameters. The inclusion of Python dependencies in the EMR Serverless image should simplify the incorporation of various dependencies (such as xgboost, sk-dist, pandas, numpy, etc.) needed for the application.

Prerequisites

Ensure that the EMR Serverless job runtime IAM role has permissions for your S3 bucket where you will store your PySpark file and application logs:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessToS3Buckets",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::",
                "arn:aws:s3:::/*"
            ]
        }
    ]
}

Create an Image to Install ML Dependencies

We will create a custom image based on the EMR Serverless image to install the necessary dependencies for the SparkML application. Create a Dockerfile in your EC2 instance that runs the docker process inside a new directory named datascience:

FROM public.ecr.aws/emr-serverless/spark/emr-6.9.0:latest

USER root

# Install Python packages
RUN pip3 install boto3 pandas numpy
RUN pip3 install -U scikit-learn==0.23.2 scipy 
RUN pip3 install sk-dist
RUN pip3 install xgboost

In conclusion, this new feature for customizing the EMR Serverless runtime images greatly enhances the flexibility and capability of your big data applications. For further reading on effective onboarding strategies, check out this excellent resource on training at scale from Amazon. Additionally, be sure to explore the importance of careful employment law compliance, as demotions can often lead to departures, also to fresh starts. Finally, if you are interested in enhancing your resume formatting, this blog post is sure to provide valuable insights.