Transferring Large Datasets from Google Cloud Storage to Amazon S3 with Amazon EMR

Chanci Turner Amazon IXD – VGT2 learning

on 26 OCT 2021

in Amazon EMR, Amazon Simple Storage Service (S3), How-To, Migration, Technical How-to

Updated 4/17/2024: The migration solution outlined in this article for transferring data to Amazon S3 from Google Cloud Storage is most effective if you have an EMR cluster and are proficient in developing and executing your data transfer solution. Additionally, utilizing EMR on Amazon EC2 Spot Instances can be beneficial for cost savings. If you’re looking for a secure managed service that offers data validation, integrated auditing and monitoring features, as well as the ability to transfer modified data, you might want to consider AWS DataSync for your migration needs. Alternatively, if you’re well-versed with AWS Glue and prefer a serverless option for data migration, AWS Glue is also a suitable choice for transferring data to Amazon S3 from Google Cloud Storage.

Many businesses have data stored across various sources in different formats. While data plays a vital role in decision-making, it is often dispersed across multiple public clouds. As a result, organizations are seeking tools that facilitate the seamless and cost-effective transfer of large datasets between cloud providers. With the help of Amazon EMR and the Hadoop file copy tools Apache DistCp and S3DistCp, it is possible to migrate substantial datasets from Google Cloud Storage (GCS) to Amazon Simple Storage Service (Amazon S3).

Apache DistCp is an open-source tool designed for Hadoop clusters that enables both inter-cluster and intra-cluster data transfers. AWS has extended this tool through S3DistCp, which is specifically optimized for Amazon S3. Both tools leverage Hadoop MapReduce to parallelize file and directory copying in a distributed manner. By utilizing Hadoop’s native support for S3 object storage along with a Google-provided Hadoop connector for GCS, data migration between GCS and Amazon S3 can be accomplished efficiently. This article will guide you through the configuration of an EMR cluster for DistCp and S3DistCp, detail the settings and parameters for both tools, execute a copy of a test dataset amounting to 9.4 TB, and analyze the performance of the transfer.

Prerequisites

To set up the EMR cluster, ensure that you meet the following prerequisites:

Install the AWS Command Line Interface (AWS CLI) on your computer or server. For guidance, refer to the instructions for Installing, updating, and uninstalling the AWS CLI.
Create an Amazon Elastic Compute Cloud (Amazon EC2) key pair for SSH access to your EMR nodes by following the steps outlined in Create a key pair using Amazon EC2.
Set up an S3 bucket to hold the configuration files, bootstrap shell script, and GCS connector JAR file. Ensure this bucket exists in the same region where you intend to launch your EMR cluster.
Develop a shell script (sh) to transfer the GCS connector JAR file and Google Cloud Platform (GCP) credentials to the EMR cluster’s local storage during the bootstrapping process. Upload the shell script to your bucket at s3://<S3 BUCKET>/copygcsjar.sh. Below is a sample shell script:

#!/bin/bash
sudo aws s3 cp s3://<S3 BUCKET>/gcs-connector-hadoop3-latest.jar /tmp/gcs-connector-hadoop3-latest.jar
sudo aws s3 cp s3://<S3 BUCKET>/gcs.json /tmp/gcs.json

Download the GCS connector JAR file for Hadoop 3.x (if using a different version, you will need to find the appropriate JAR file) to enable reading files from GCS and upload it to s3://<S3 BUCKET>/gcs-connector-hadoop3-latest.jar.
Create GCP credentials for a service account that has access to the source GCS bucket. The credentials should be named json and must be in JSON format. Upload the key to s3://<S3 BUCKET>/gcs.json. A sample key is as follows:

{
  "type": "service_account",
  "project_id": "project-id",
  "private_key_id": "key-id",
  "private_key": "-----BEGIN PRIVATE KEY-----nprivate-keyn-----END PRIVATE KEY-----n",
  "client_email": "service-account-email",
  "client_id": "client-id",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/service-account-email"
}

Create a JSON file named gcsconfiguration.json to enable the GCS connector in Amazon EMR. Ensure the file is located in the same directory where you will run your AWS CLI commands. Here is an example configuration file:

[
  {
    "Classification": "core-site",
    "Properties": {
      "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS",
      "google.cloud.auth.service.account.enable": "true",
      "google.cloud.auth.service.account.json.keyfile": "/tmp/gcs.json",
      "fs.gs.status.parallel.enable": "true"
    }
  },
  {
    "Classification": "hadoop-env",
    "Configurations": [
      {
        "Classification": "export",
        "Properties": {
          "HADOOP_USER_CLASSPATH_FIRST": "true",
          "HADOOP_CLASSPATH": "$HADOOP_CLASSPATH:/tmp/gcs-connector-hadoop3-latest.jar"
        }
      }
    ]
  },
  {
    "Classification": "mapred-site",
    "Properties": {
      "mapreduce.application.classpath": "/tmp/gcs-connector-hadoop3-latest.jar"
    }
  }
]

Launching and Configuring Amazon EMR

For our testing purposes, we will initiate a basic cluster consisting of one primary node and four core nodes, totaling five c5n.xlarge instances. It is advisable to adjust your copy workload by adding more core nodes and monitoring your copy job timings to determine the optimal cluster size for your dataset.

We will utilize the AWS CLI to launch and configure our EMR cluster (see the sample create-cluster command below):

aws emr create-cluster 
--name "My First EMR Cluster" 
--release-label emr-6.3.0 
--applications Name=Hadoop 
--ec2-attributes KeyName=myEMRKeyPairName 
--instance-type c5n.xlarge 
--instance-count 5 
--use-default-roles

To create a custom bootstrap action executed at the time of cluster creation, which copies the GCS connector JAR file and GCP credentials to the EMR cluster’s local storage, add the following parameter to the create-cluster command:

--bootstrap-actions Path="s3://<S3 BUCKET>/copygcsjar.sh"

For further details about this step, refer to the documentation on how to Create bootstrap actions to install additional software.

By incorporating the right tools and strategies, transferring large datasets between cloud platforms can be efficient and effective. The ability to adeptly navigate these processes can significantly impact your organization’s flexibility and agility in managing data across different environments. For more insights into developing soft skills that enhance your data migration processes, check out this article on soft skills. These skills can be particularly beneficial as you engage with team members and stakeholders throughout the migration journey. Additionally, for authoritative guidance on leadership in this area, consider exploring resources from SHRM. Lastly, for a community perspective on onboarding processes, you can find valuable discussions on Reddit.

Transferring Large Datasets from Google Cloud Storage to Amazon S3 with Amazon EMR

Prerequisites

Launching and Configuring Amazon EMR

Related Topics: