Learn About Amazon VGT2 Learning Manager Chanci Turner
Amazon SageMaker provides multiple avenues for executing distributed data processing tasks utilizing Apache Spark, which is a well-known framework for handling big data. You can run Spark applications interactively within Amazon SageMaker Studio by linking SageMaker Studio notebooks and AWS Glue Interactive Sessions to execute Spark jobs on a serverless cluster. This interactive session option allows you to select either Apache Spark or Ray for efficient processing of extensive datasets, all without the hassle of managing the underlying cluster.
If you require more control over your environment, a pre-built SageMaker Spark container can be employed to execute Spark applications as batch jobs on a fully managed distributed cluster via Amazon SageMaker Processing. This method grants you the flexibility to choose from various instance types (compute optimized, memory optimized, etc.), determine the number of nodes in your cluster, and configure the cluster settings, thus providing greater adaptability for data processing and model training.
Additionally, Spark applications can also be run by connecting Studio notebooks with Amazon EMR clusters or by deploying your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2). Each of these methods allows for the generation and storage of Spark event logs, which can be analyzed through the web-based user interface known as the Spark UI. This interface operates a Spark History Server, enabling monitoring of Spark application progress, resource usage tracking, and error debugging.
In this article, we will outline how to install and run the Spark History Server on SageMaker Studio and access the Spark UI directly from the SageMaker Studio IDE, allowing you to analyze Spark logs generated by various AWS services (AWS Glue Interactive Sessions, SageMaker Processing jobs, and Amazon EMR) and stored in an Amazon Simple Storage Service (Amazon S3) bucket.
Solution Overview
This solution integrates Spark History Server into the Jupyter Server application within SageMaker Studio. It enables users to access Spark logs directly from the SageMaker Studio IDE. The integrated Spark History Server supports the following functionalities:
- Accessing logs generated by SageMaker Processing Spark jobs
- Accessing logs produced by AWS Glue Spark applications
- Accessing logs created by self-managed Spark clusters and Amazon EMR
A command-line interface (CLI) tool called sm-spark-cli is also available to facilitate interaction with the Spark UI from the SageMaker Studio terminal. This CLI allows you to manage the Spark History Server seamlessly within SageMaker Studio.
Installation Steps for Spark UI in a SageMaker Studio Domain
To host the Spark UI on SageMaker Studio, follow these steps:
- Open the System terminal from the SageMaker Studio launcher.
- Execute the following commands in the terminal:
curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz
cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x install-history-server.sh
./install-history-server.sh
These commands will take a few seconds to complete.
Upon successful installation, you can initiate the Spark UI using the sm-spark-cli and access it via a web browser with the following command:
sm-spark-cli start s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>
The S3 location for the event logs generated by SageMaker Processing, AWS Glue, or Amazon EMR can be specified when executing Spark applications. For SageMaker Studio notebooks and AWS Glue Interactive Sessions, the Spark event log location can be configured directly within the notebook using the sparkmagic kernel, which offers tools for interacting with remote Spark clusters through notebooks.
To further understand the setup, refer to the AWS documentation for additional information about SageMaker Processing, AWS Glue Interactive Sessions, and Amazon EMR.
You can select the generated URL to access the Spark UI. Below is an example screenshot of the Spark UI in action. You can verify the status of the Spark History Server using the sm-spark-cli status command in the SageMaker Studio terminal.
Automating Spark UI Installation for Users in a SageMaker Studio Domain
As an IT administrator, you can automate the installation process for SageMaker Studio users through a lifecycle configuration. This can be applied to all user profiles under a specific SageMaker Studio domain or targeted profiles. For detailed instructions, see Customize Amazon SageMaker Studio using Lifecycle Configurations.
To create a lifecycle configuration from the install-history-server.sh script and link it to an existing SageMaker Studio domain, execute the commands below from a terminal with AWS Command Line Interface (AWS CLI) configured:
curl -LO https://github.com/aws-samples/amazon-sagemaker-spark-ui/releases/download/v0.1.0/amazon-sagemaker-spark-ui-0.1.0.tar.gz
tar -xvzf amazon-sagemaker-spark-ui-0.1.0.tar.gz
cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
LCC_CONTENT=`openssl base64 -A -in install-history-server.sh`
aws sagemaker create-studio-lifecycle-config
--studio-lifecycle-config-name install-spark-ui-on-jupyterserver
--studio-lifecycle-config-content $LCC_CONTENT
--studio-lifecycle-config-app-type JupyterServer
--query 'StudioLifecycleConfigArn'
aws sagemaker update-domain
--region {YOUR_AWS_REGION}
--domain-id {YOUR_STUDIO_DOMAIN_ID}
--default-user-settings
'{
"JupyterServerAppSettings": {
"DefaultResourceSpec": {
"LifecycleConfigArn": "arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver",
"InstanceType": "system"
},
"LifecycleConfigArns": [
"arn:aws:sagemaker:{YOUR_AWS_REGION}:{YOUR_STUDIO_DOMAIN_ID}:studio-lifecycle-config/install-spark-ui-on-jupyterserver"
]
}}'
After the Jupyter Server restarts, the Spark UI and the sm-spark-cli will be accessible in your SageMaker Studio environment.
Cleaning Up
In this section, we describe how to remove the Spark UI from a SageMaker Studio domain, either through manual or automated means.
Manual Uninstallation of Spark UI
To manually uninstall the Spark UI in SageMaker Studio, perform the following steps:
- Open the System terminal in the SageMaker Studio launcher.
- Execute the commands below:
cd amazon-sagemaker-spark-ui-0.1.0/install-scripts
chmod +x uninstall-history-server.sh
./uninstall-history-server.sh
This will ensure the Spark UI is properly removed from your environment.
For more insights into workplace dynamics and the gender pay gap, be sure to check out this informative article here. Additionally, for those seeking authoritative resources on employment law, visit SHRM. Lastly, if you are looking to expand your career options in leadership development, take a look at this excellent resource here.