Amazon Onboarding with Learning Manager Chanci Turner

Data lakes, business intelligence, operational analytics, and data warehousing all share a fundamental trait—the capacity to extract, transform, and load (ETL) data for analytical purposes. Since its inception in 2017, AWS Glue has been a pivotal serverless data integration service, simplifying the discovery, preparation, and consolidation of data for analytics, machine learning, and application development.

AWS Glue interactive sessions empower developers to construct, test, and execute data preparation and analytics applications seamlessly. These interactive sessions grant access to fully managed serverless Apache Spark through an on-demand framework. Moreover, advanced users benefit from the same Apache Spark engine present in AWS Glue 2.0 or AWS Glue 3.0, complete with built-in cost controls and enhanced speed. This setup allows development teams to leverage their preferred tools instantly, driving productivity.

In this article, we will guide you on utilizing AWS Glue interactive sessions with PyCharm to create AWS Glue jobs.

Solution Overview

This guide delivers a comprehensive walkthrough, building on the foundation laid in Getting Started with AWS Glue Interactive Sessions. It will guide you through several steps:

Create an AWS Identity and Access Management (IAM) policy that provides limited read privileges for Amazon Simple Storage Service (Amazon S3) and an associated role for AWS Glue.
Configure access to a development environment, whether on a desktop or an operating system in the AWS Cloud via Amazon Elastic Compute Cloud (Amazon EC2).
Integrate AWS Glue interactive sessions with an integrated development environment (IDE).

We will use the script Validate_Glue_Interactive_Sessions.ipynb for validation, which is available as a Jupyter notebook.

Prerequisites

Before you dive in, ensure you possess an AWS account. If you haven’t created one yet, refer to How do I create and activate a new AWS account? This guide assumes that you have Python and PyCharm installed, with Python 3.7 or later being essential.

Creating an IAM Policy

To begin, create an IAM policy restricting read access to the S3 bucket s3://awsglue-datasets, which hosts the AWS Glue public datasets. This is accomplished via IAM, which defines access policies and roles for AWS Glue.

Navigate to the IAM console and select Policies from the navigation pane.
Click Create policy.
On the JSON tab, input the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3-object-lambda:Get*",
                "s3-object-lambda:List*"
            ],
            "Resource": ["arn:aws:s3:::awsglue-datasets/*"]
        }
    ]
}

Proceed to Next: Tags.
Click Next: Review.
For Policy name, input glue_interactive_policy_limit_s3.
Provide a description and click Create policy.

Creating an IAM Role for AWS Glue

Next, create a role for AWS Glue with restricted Amazon S3 read privileges:

In the IAM console, select Roles from the navigation pane.
Click Create role.
Choose AWS service for Trusted entity type.
Select Glue under Use cases for other AWS services.
Click Next.
On the Add permissions page, search for and select the AWS managed permission policies AWSGlueServiceRole and glue_interactive_policy_limit_s3.
Proceed to Next.
Enter glue_interactive_role for Role name.
Click Create role.
Note the ARN of the role, which will appear as arn:aws:iam:::role/glue_interactive_role.

Setting Up Development Environment Access

This secondary access configuration takes place in the developer’s environment, which can either be a desktop or an AWS Cloud-based system using Amazon EC2. Follow the steps relevant to your setup.

Setting Up a Desktop Computer

For desktop setups, we recommend adhering to the instructions found in Getting Started with AWS Glue Interactive Sessions.

Setting Up an AWS Cloud-based Computer with Amazon EC2

This approach aligns with best practices for granting access to cloud resources using IAM roles. For further details, check out this authoritative resource.

In the IAM console, navigate to Roles and click Create role.
Choose AWS service for Trusted entity type.
Select EC2 under Common use cases and click Next.
Attach the AWSGlueServiceRole policy to the newly created role.
Create an inline policy to allow the instance profile role to assume glue_interactive_role and save the role as ec2_glue_demo.

Your new policy should now appear under Permissions policies.

On the Amazon EC2 console, right-click the instance you wish to attach to the newly created role.
Go to Security and select Modify IAM role.
Choose ec2_glue_demo as the IAM role and click Save.
Return to the IAM console to edit the trust relationship for glue_interactive_role, adding the following to the principal JSON key:
"AWS": ["arn:aws:iam:::user/glue_interactive_user", "arn:aws:iam:::role/ec2_glue_demo"].

Complete the steps outlined in Getting Started with AWS Glue Interactive Sessions.

You won’t need an AWS access key ID or secret access key for the remaining steps.

Integrating AWS Glue Interactive Sessions with an IDE

Now, it’s time to set up and validate your PyCharm integration with AWS Glue interactive sessions.

On the welcome page, select New Project.
For Location, enter your project directory, glue-interactive-demo.
Expand Python Interpreter.
Select Previously configured interpreter and choose the virtual environment established earlier.
Click Create.

The New Project page will appear, reflecting your configuration.

Right-click on the project, select New, then Jupyter Notebook.
Name the notebook Validate_Glue_Interactive_Sessions.

This notebook features a drop-down labeled Managed Jupyter server: auto-start, indicating that the server will commence when any notebook cell is executed.

Run the following code:

print("This notebook will start the local Python kernel")

You should see that the Jupyter server has started running the cell.

On the Python 3 (ipykernel) drop-down, select Glue PySpark.
Execute the following code to initiate a Spark session:

spark

Wait for confirmation that a session ID has been generated.

In each cell, run the boilerplate code required for AWS Glue:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
glueContext = GlueContext(SparkContext.getOrCreate())

This provides a comprehensive overview of using AWS Glue interactive sessions with PyCharm. For more information on cover letter writing tips, consider visiting this blog post. Additionally, if you’re exploring job opportunities, check out this excellent resource.