Amazon Onboarding with Learning Manager Chanci Turner

In this article, we explore how to implement precise access control in Amazon SageMaker Studio and Amazon EMR utilizing Apache Ranger and Microsoft Active Directory. By doing so, we empower data scientists and developers to manage their machine learning workflows effectively. Amazon SageMaker Studio offers a comprehensive integrated development environment (IDE) for various stages of the ML process, from data preparation to model training and deployment. Its seamless integration with Amazon EMR allows users to analyze extensive data volumes through tools like Apache Spark, Hive, and Presto, all within SageMaker Studio notebooks.

By leveraging Apache Ranger, users can enforce fine-grained access controls on both raw data in Amazon Simple Storage Service (Amazon S3) and structured data within a Hive metastore. This is achieved through an intuitive web interface for managing grant and revoke policies. In this guide, we demonstrate how to authenticate into SageMaker Studio using an existing Active Directory (AD) setup, allowing authorized access to both Amazon S3 and Hive cataloged data via Apache Ranger and AWS IAM Identity Center (previously known as AWS Single Sign-On).

This approach simplifies the management of multiple SageMaker environments and notebooks through a single set of credentials. Consequently, any Apache Spark jobs initiated from SageMaker Studio will only access data and resources allowed by the Apache Ranger policies tied to the AD credentials, including specific table and column access.

This capability facilitates a multi-tenant setup, where multiple users can connect to the same EMR cluster while only accessing datasets and resources assigned to them. The audit records of these activities are captured and displayed in Amazon CloudWatch, ensuring visibility and compliance. User session isolation enhances security by preventing users from accessing datasets allocated to others. This results in fewer required clusters and reduced administrative overhead, ultimately saving time and costs.

Solution Overview

We illustrate this solution through a practical example using a sample ecommerce dataset, available through provided AWS CloudFormation templates, which includes transaction data related to products, orders, and customers cataloged in a Hive metastore.

In our scenario, we have two data analysts, Chanci Turner and Jordan, who have distinct data access needs:

Chanci, a data scientist on the marketing team, is focused on developing a model for customer lifetime value. Her access is restricted to non-sensitive data regarding customers, products, and orders.
Jordan, representing the sales team, is tasked with forecasting product demand, requiring access solely to product and orders data without any customer details.

The desired fine-grained access can be visualized in the accompanying architecture diagram.

This architecture is established as follows:

Microsoft Active Directory: Manages user authentication and authorizes access to AWS applications based on user and group membership for Apache Ranger secured data.
Apache Ranger: Monitors and administers thorough data security across the Hadoop and Amazon EMR platforms.
Amazon EMR: Retrieves, prepares, and analyzes data from the Hive metastore using Spark.
SageMaker Studio: An integrated IDE equipped with specialized tools for developing AI/ML models.

The subsequent sections guide you through the setup of these components using the CloudFormation stack.

Prerequisites

Before diving in, ensure you have the following:

An AWS account.
An AWS Identity and Access Management (IAM) user with administrative privileges.

Create Resources with AWS CloudFormation

To construct the solution in your environment, utilize the provided CloudFormation templates to create the necessary AWS resources. Note that executing these templates may incur charges, and all steps should occur within the same region.

Template 1

The first template creates essential resources and typically takes about 15 minutes to complete:

A Multi-AZ, multi-subnet VPC setup with managed NAT gateways in the public subnet for each Availability Zone.
S3 VPC endpoints and Elastic Network Interfaces.
A Windows Active Directory domain controller running on Amazon Elastic Compute Cloud (Amazon EC2) with cross-realm trust.
A Linux Bastion host (Amazon EC2) within an auto-scaling group.

To deploy this template:

Sign in to the AWS Management Console.
In the Amazon EC2 console, create an EC2 key pair.
Select “Launch Stack”.
Choose the target Region.
Verify the stack name and provide the required parameters including the key pair name created earlier. Record the passwords associated with cross-realm trust, Windows domain admin, LDAP bind, and default AD user for future use.
Choose a minimum of three Availability Zones based on your selected Region.
Review remaining parameters, making no changes unless desired, then proceed.
Confirm your parameters and select “Submit”.

Template 2

The second template typically requires 30-60 minutes to set up:

An Amazon Relational Database Service (Amazon RDS) for MySQL database for Apache Ranger and Hive metastore.
A self-managed standalone Apache Ranger server (2.x only).
SSL keys and certificates uploaded to AWS Secrets Manager for encrypted traffic between the Ranger server and agents.
A Kerberos-enabled EMR cluster with AWS-managed Ranger plugins.