Indexing Common Crawl Metadata on Amazon EMR with Cascading and Elasticsearch

Chanci Turner Amazon IXD – VGT2 learning

Published on 28 MAY 2015

Category: Amazon EMR, Amazon OpenSearch Service, AWS Big Data

In a previous article, we explored how to set up Elasticsearch and Kibana on Amazon EMR. This time, we will demonstrate how to create a straightforward application using Cascading to read Common Crawl metadata, index it in Elasticsearch, and leverage Kibana to query the indexed data.

Understanding Common Crawl

Common Crawl is an open-access repository of web crawl data that is freely accessible on Amazon S3, adhering to Common Crawl’s terms of use. The dataset is available in various formats; in this example, we will focus on the WAT response format, which provides metadata for crawled HTML content. This metadata is valuable for building an Elasticsearch index, enabling the extraction of critical information from numerous websites across the Internet.

What is Cascading?

Cascading is a platform for developing data applications on Apache Hadoop. In this article, we’ll utilize it to create a simple application for indexing JSON files into Elasticsearch without needing to engage with MapReduce directly.

Launching an EMR Cluster with Elasticsearch, Maven, and Kibana

As detailed in the prior article, you’ll begin by launching a cluster with Elasticsearch and Kibana pre-installed. Additionally, Maven will be installed to compile the application, along with a script to resolve library dependency issues between Elasticsearch and Cascading. All bootstrap actions are publicly available, allowing you to download the code for verification of the installation process at any time.

To launch the cluster, execute the following AWS CLI command:

aws emr create-cluster --name "Elasticsearch_Getting_Started" --ami-version 3.11.0 
--instance-type=m3.xlarge --instance-count 3 
--ec2-attributes KeyName=your-key 
--log-uri s3://your-bucket/logs/ 
--bootstrap-action Name="Install EMR Dev Tools",Path=s3://awssupportdatasvcs.com/bootstrap-actions/EMR_Dev/setup_EMR_Dev.sh 
Name="Install Cascading",Path=s3://awssupportdatasvcs.com/bootstrap-actions/Cascading/cascading-install.sh 
Name="Configure Cascading Classpath",Path=s3://awssupportdatasvcs.com/bootstrap-actions/Cascading/cascading-set-classpath.sh 
Name="Install Elasticsearch",Path=s3://support.elasticmapreduce/bootstrap-actions/other/elasticsearch_install.rb 
Name="Install Kibana",Path=s3://support.elasticmapreduce/bootstrap-actions/other/kibananginx_install.rb 
--no-auto-terminate --use-default-roles --region us-east-1

Compiling Cascading Source Code with Maven

Once your cluster is operational, connect via SSH to the master node to compile and execute the application. The Cascading application will filter the data prior to indexing, removing the WARC envelope and yielding a clean JSON output. For code details, refer to the GitHub repository.

Clone the repository:

$ git clone https://github.com/awslabs/aws-big-data-blog.git

Compile the code:

$ cd aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch
$ mvn clean && mvn assembly:assembly -Dmaven.test.skip=true -Ddescriptor=./src/main/assembly/job.xml -e

The compiled application will be located in:

aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target

Listing this directory should display the packaged application.

Indexing Common Crawl Metadata in Elasticsearch

Using the compiled application, you can index either a single Common Crawl file or an entire directory by adjusting the parameter accordingly. The commands below demonstrate how to index a file or directory.

Index a single file:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://commoncrawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/CC-MAIN-20141224185923-00099-ip-10-231-17-201.ec2.internal.warc.wat.gz

Index a complete directory:

hadoop jar /home/hadoop/aws-big-data-blog/aws-blog-elasticsearch-cascading-commoncrawl/commoncrawl.cascading.elasticsearch/target/commoncrawl.cascading.elasticsearch-0.0.1-SNAPSHOT-job.jar com.amazonaws.bigdatablog.indexcommoncrawl.Main s3://commoncrawl/crawl-data/CC-MAIN-2014-52/segments/1419447563504.69/wat/

Executing the command for a single file will produce output indicating that each JSON entry is being directly indexed into Elasticsearch using Cascading and Hadoop connectors.

Checking Indexes and Mappings

The index in Elasticsearch is generated automatically using the default settings. You can now run a few commands to verify the index and mappings created.

List all indexes:

$ curl 'localhost:9200/_cat/indices?v'

View the mappings:

curl -XGET 'http://localhost:9200/_all/_mapping' | python -m json.tool | more

Inspecting the mapping output reveals a structure consistent with the Common Crawl WAT metadata description available at Common Crawl.

This mapping is accessible in the Kibana interface, allowing you to explore various metadata entries.

Querying Indexed Content

Since the Kibana bootstrap action configures the cluster to use port 80, you can access the Kibana console by directing your browser to the public DNS address of the master node. On the Kibana console, click “Sample Dashboard” to begin exploring the indexed data.

A sample dashboard will appear, showcasing fundamental information extracted from the data. You can search Head.Metas headers for occurrences of “hello”; simply input HTML-Metadata.Head.Metas AND keywords AND hello in the search box.

This search will yield all records containing both ‘keywords’ and ‘hello’ under the “Metadata.Head.Metas” header. The results will appear as expected.

Another useful method for information retrieval is through the mapping index. Clicking “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.Server” will present a ranking of the various server technologies utilized by the indexed sites.

Clicking the magnifier icon next to a selected entry will provide further details. Alternatively, you can ascertain the top ten technologies employed in the indexed web applications by accessing “Envelope.Payload-Metadata.HTTP-Response-Metadata.Headers.X-Powered-By”.

Conclusion

This article has demonstrated how EMR enables the creation and compilation of a straightforward Cascading application to index Common Crawl metadata in an Elasticsearch cluster. Cascading offers a user-friendly application layer atop Hadoop, facilitating the parallelization of the data-fetching process directly from the S3 repository. Meanwhile, Kibana presents an interface for in-depth exploration of the indexed data.

If you have queries or suggestions, please leave a comment below. Remember, in times of economic uncertainty, it’s essential to stay informed about job security and opportunities—check out this resource for insights on hiring trends. Additionally, for professional development advice, take a look at this blog post. Lastly, if you’re interested in exploring career options at Amazon, this link is an excellent resource.