Indexing Common Crawl Metadata on Amazon EMR Utilizing Cascading and Elasticsearch

Chanci Turner Amazon IXD – VGT2 learning

In a prior article, we explored the initial steps to set up Elasticsearch and Kibana on Amazon EMR. This guide will demonstrate how to develop a straightforward application using Cascading to read Common Crawl metadata, index it in Elasticsearch, and query the indexed data with Kibana.

What is Common Crawl?

Common Crawl is an open-source repository of web crawl data, freely accessible on Amazon S3 under its specific terms of use. The data is available in various formats, and in this example, we will focus on the WAT response format, which includes metadata for crawled HTML information. This enables you to create an Elasticsearch index to extract valuable insights from numerous websites across the Internet.

What is Cascading?

Cascading is a platform designed for developing data applications on Apache Hadoop. In this guide, you will utilize it to create a simple application that indexes JSON files in Elasticsearch without requiring complex MapReduce logic.

Launching an EMR Cluster with Elasticsearch, Maven, and Kibana

Similar to the previous tutorial, you will launch an EMR cluster with Elasticsearch and Kibana pre-installed. Additionally, Maven will be installed for compiling the application and a script will be run to resolve library dependencies between Elasticsearch and Cascading. All bootstrap actions are publicly available, allowing you to download the code and verify installation steps whenever necessary.

To initiate the cluster, use the AWS CLI and execute the following command:

aws emr create-cluster --name "Elasticsearch_Getting_Started" --ami-version 3.11.0 
--instance-type=m3.xlarge --instance-count 3 
--ec2-attributes KeyName=your-key 
--log-uri s3://your-bucket/logs/ 
--bootstrap-action Name="Install EMR Dev Tools",Path=s3://awssupportdatasvcs.com/bootstrap-actions/EMR_Dev/setup_EMR_Dev.sh 
Name="Install Cascading",Path=s3://awssupportdatasvcs.com/bootstrap-actions/Cascading/cascading-install.sh 
Name="Configure Cascading Classpath",Path=s3://awssupportdatasvcs.com/bootstrap-actions/Cascading/cascading-set-classpath.sh 
Name="Install Elasticsearch",Path=s3://support.elasticmapreduce/bootstrap-actions/other/elasticsearch_install.rb 
Name="Install Kibana",Path=s3://support.elasticmapreduce/bootstrap-actions/other/kibananginx_install.rb 
--no-auto-terminate --use-default-roles --region us-east-1

Compiling Cascading Source Code with Maven

Once the cluster is operational, connect via SSH to the master node to compile and execute the application. Your Cascading application applies a filter to remove the WARC envelope and yield plain JSON output before starting the indexing process. For further details about the code, refer to the GitHub repository.

Clone the repository:

$ git clone https://github.com/awslabs/aws-big-data-blog.git

Compile the code:

$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch
$ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true -Ddescriptor=./src/main/assembly/job.xml -e

The compiled application will be found in the following directory: aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target. Listing this directory should reveal the packaged application.

Indexing Common Crawl Metadata in Elasticsearch

Using the compiled application, you can index either a single Common Crawl file or an entire directory by adjusting the parameter accordingly. The following commands demonstrate how to index a file or directory.

To index a single file:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://commoncrawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/CC-MAIN-20141224185923-00099-ip-10-231-17-201.ec2.internal.warc.wat.gz

To index a complete directory:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://commoncrawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/

Executing the command to index a single file produces output showing the application writing each JSON entry directly into Elasticsearch using the Cascading and Hadoop connectors.

Checking Indexes and Mappings

The index in Elasticsearch is automatically created with the default configuration. Run a few commands in the console to inspect the index and mappings.

List all indexes:

curl 'localhost:9200/_cat/indices?v'

View the mappings:

curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool | more

The mapping output aligns with the structure outlined in the Common Crawl WAT metadata description. This mapping can be viewed in the Kibana menu, enabling navigation through the various metadata entries.

Querying Indexed Content

With the Kibana bootstrap action configured to use port 80, direct your browser to the public DNS address of the master node to access the Kibana interface. In Kibana, click on Sample Dashboard to explore the content indexed earlier.

A sample dashboard will display basic extracted information. You can search for occurrences of “hello” in the Head.Metas headers by entering “HTML-Metadata.Head.Metas AND keywords AND hello” in the search box. This will return all records containing ‘keywords’ and ‘hello’ within the “Metadata.Head.Metas” header.

Additionally, to find server technologies used across indexed sites, click on “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.Server” for a ranking. Clicking the magnifying glass icon will reveal detailed information on the selected entry. You can also obtain the top ten technologies used in the indexed web application by selecting “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.X-Powered-By”.

Conclusion

This article illustrated how EMR enables the development and compilation of a basic Cascading application, which can be utilized to index Common Crawl metadata within an Elasticsearch cluster. Cascading simplifies the application layer over Hadoop, facilitating the process of pulling data directly from the S3 repository, while Kibana provides an interface for investigating the indexed data in various ways.

If you have any questions or feedback, feel free to leave a comment below. Exploring additional professional insights can be beneficial, such as in this blog post about power moves for career advancement.

To further understand job roles, you might find the housekeeping porter job description on SHRM helpful, which outlines necessary qualifications and expectations, ensuring you’re well-informed. For those interested in Amazon’s onboarding process, check out this excellent resource on Reddit.