Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning

When you set up a cluster, Amazon EMR allows you to select applications that will operate on your cluster. But what if you’re looking to deploy your own tailored application? This article guides you through the process of building a custom application for EMR using Apache Bigtop, focusing on releases 4.x and newer. Since EMR nodes are built on the Amazon Linux AMI, the deployment will utilize RPM packages, with Elasticsearch serving as the sample application.

Understanding Apache Bigtop

Apache Bigtop is a community-driven repository that encompasses a variety of components and projects, including Hadoop, HBase, and Spark. It supports different Linux packaging systems, such as RPM and Deb, facilitating application packaging, deployment, and configuration on clusters using Puppet.

Step-by-Step Guide

The diagram below illustrates the process for creating a Bigtop package.

To build a Bigtop package for EMR, follow these steps:

Launch a development EMR cluster.
Clone the Bigtop public repository.
Add the application definition to bigtop.bom.
Create directories and configuration files for the application.
Create an RPM package.
Establish a Yum repository.
Move the output repository to S3 for accessibility on any new cluster where the application needs to be installed.
Test the application.
Develop a bootstrap script.
Start an EMR cluster using the bootstrap script.

You will initiate an EMR cluster specifically for development. This setup equips you with the necessary tools to create and test the Bigtop application, including Maven and Gradle.

Launching a Development EMR Cluster

Utilize command-line tools with the following command to initiate the development cluster:

aws emr create-cluster --name "EMR_Bigtop_Dev" --release-label emr-4.7.2 --instance-type=m3.xlarge --instance-count 1 --ec2-attributes KeyName=<YOUR-KEY-PAIR> --log-uri s3://<YOUR-BUCKET>/ --no-auto-terminate --use-default-roles --bootstrap-action Name="Install EMR DEV Tools",Path=s3://us-west-2.awssupportdatasvcs.com/bootstrap-actions/EMR_Dev/setup_EMR_Dev.sh

Cloning the Bigtop Public Repository

Once the cluster is operational, SSH into the EMR Bigtop dev master node and clone the Bigtop public repository:

git clone https://github.com/apache/bigtop.git

Adding the Application Definition to bigtop.bom

In the directory created by the clone command (/home/hadoop/bigtop/), locate the file named bigtop.bom. This file contains definitions for all applications available in the current Bigtop version. Within the components section, append an ‘elasticsearch’ entry as follows:

'elasticsearch' {
  name    = 'elasticsearch'
  relNotes = 'Search and Analytics engine'
  version { base = '1.6.0'; pkg = base; release = 1 }
  tarball { destination = "$name-${version.base}.tar.gz"
            source      = "v${version.base}.zip" }
  url     { site = "https://github.com/elastic/elasticsearch/archive"
            archive = site }
}

This configuration details the application’s name, version, tarball destination, and source URL.

Testing the Repository

To verify that Gradle and all necessary tools for building a Bigtop application are installed, execute the following command:

gradle tasks | grep elasticsearch

The first execution might take some time, but you should receive a final output similar to the following:

Creating Directories and Configuration Files for the Application

Deploying an application for Bigtop involves two primary tasks: creating RPM packages and Puppet scripts.

Creating RPM Packages for the Application

For Elasticsearch, the example application, a customized version of the SPEC RPM definition is utilized. If the application you want to include provides an RPM, you can tailor it for Bigtop; otherwise, a SPEC RPM definition file needs to be created from scratch. The default directory for these files is:

bigtop-packages/src/rpm/<application-name>/SPECS

Common scripts executed during the package building process include:

do-component-build: This contains environment configuration and build commands used when creating a package. For example: mvn clean install -DskipTests -Dhadoop.version=$HADOOP_VERSION "$@"
install-<application-name>.sh: This script defines the package directory structure and how files are distributed.

For comprehensive guidance, refer to the Fedora documentation on creating RPM packages or, if you’re just beginning, check out their tutorial on creating a GNU Hello RPM package.

Developing the Puppet Scripts

Puppet manages the installation and configuration processes for the application. Each application has a main init.pp script where you declare installation procedures, configuration file population, and service management, among other tasks. The default directory for the init.pp script is:

bigtop-deploy/puppet/modules/<application-name>/manifests/

The ‘templates’ directory is also key in the Puppet structure, where configuration files are deployed by combining code and data. The default location for templates is:

bigtop-deploy/puppet/modules/<application-name>/templates/

For more information on Puppet templates, see the documentation, and for beginners, explore Puppet Hello World.

Establishing the File and Directory Structure

For this example, create the necessary file and directory structure with the following commands:

cd ~
git clone https://github.com/awslabs/aws-big-data-blog.git

After cloning the necessary structure, copy it to the local Bigtop repository to build the application from there:

cd aws-big-data-blog/aws-blog-bigtop-application-emr/
cp -r bigtop-packages/* ~/bigtop/bigtop-packages/
cp -r bigtop-deploy/* ~/bigtop/bigtop-deploy/

Creating an RPM Package for the New Application

With all configuration files in place, run the command to build the new application. This command downloads the source code (as specified in bigtop.bom), compiles it, and creates a new RPM according to the specifications in the SPEC file:

cd /home/hadoop/bigtop
gradle realclean elasticsearch-rpm --stacktrace

This is an excellent resource for those starting their journey with Amazon. If you’re interested in further development, check out this related post from the Iowa Economic Development Authority. Additionally, for insights into skills-based volunteering, you can refer to the authoritative piece from SHRM.