Implementing Table-Level Access Control on Data Lake Tables with AWS Glue 5.0 and AWS Lake Formation

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

AWS Glue 5.0 introduces Full-Table Access (FTA) control in Apache Spark, leveraging policies defined in AWS Lake Formation. This new capability allows read and write operations from AWS Glue 5.0 Spark jobs on tables registered in Lake Formation, provided the job role is granted full table access. Such control is particularly beneficial for scenarios requiring compliance with security regulations at the table level. In addition, enhanced Spark functionalities like Resilient Distributed Datasets (RDDs), custom libraries, and user-defined functions (UDFs) can now be utilized with Lake Formation tables. This enables Data Manipulation Language (DML) operations—such as CREATE, ALTER, DELETE, UPDATE, and MERGE INTO—on both Apache Hive and Iceberg tables within the same Apache Spark application. Data teams can execute intricate, interactive Spark applications via Amazon SageMaker Unified Studio in compatibility mode, while adhering to the table-level security frameworks established by Lake Formation. This streamlines the security and governance of your data lakes.

In this article, we will guide you through the process of enforcing FTA control on AWS Glue 5.0 using Lake Formation permissions.

Understanding Access Control in AWS Glue

AWS Glue 5.0 provides two mechanisms to achieve access control via Lake Formation:

  1. Full-Table Access (FTA) Control
  2. Fine-Grained Access Control (FGAC)

At a glance, FTA facilitates access control at the table level, while FGAC supports varying degrees of control at the table, row, column, and cell levels. FGAC employs a strict security model based on user/system space isolation, ensuring that only a limited selection of Spark core classes are permitted. Additionally, FGAC requires extra configuration, such as passing the --enable-lakeformation-fine-grained-access parameter to the job. For further details about FGAC, check out Enforce fine-grained access control on data lake tables using AWS Glue 5.0 integrated with AWS Lake Formation.

While granular control is essential for organizations that need to comply with data governance and security regulations or handle sensitive information, it may be excessive for those seeking solely table-level access control. To provide a solution that imposes table-level access without the performance, cost, and setup burdens associated with FGAC’s tighter security model, AWS Glue has introduced FTA. Let’s delve into FTA, our primary focus in this discussion.

How Full-Table Access (FTA) Functions in AWS Glue

Prior to AWS Glue 5.0, Lake Formation-based data access was managed through the GlueContext class, a utility provided by AWS Glue. With the release of AWS Glue 5.0, access to Lake Formation-based data is now achievable through native Spark SQL and Spark DataFrames.

With this introduction, if you possess full table access to your tables via Lake Formation permissions, you no longer need to enable fine-grained access mode for your AWS Glue jobs or sessions. This eliminates the necessity for a system driver and system executors, which are tailored for fine-grained access, leading to improved performance and reduced costs. Additionally, while Lake Formation fine-grained access mode supports read operations only, FTA encompasses both read and write functions via commands like CREATE, ALTER, DELETE, UPDATE, and MERGE INTO.

To utilize FTA mode, third-party query engines must be permitted to access data without the validation of AWS Identity and Access Management (IAM) session tags in Lake Formation. You can follow the guidance in Application integration for full table access to achieve this.

Transitioning from AWS Glue 4.0 to AWS Glue 5.0 Native Spark FTA

The overarching steps for enabling the Spark native FTA feature are detailed in Using AWS Glue with AWS Lake Formation for Full Table Access. However, this section will provide a comprehensive example of migrating an AWS Glue 4.0 job that employs FTA through GlueContext to an AWS Glue 5.0 job utilizing Spark native FTA.

Prerequisites

Before proceeding, ensure you have the following prerequisites:

  • An AWS account with necessary AWS Identity and Access Management (IAM) roles:
    • A Lake Formation data access IAM role that isn’t a service-linked role.
    • An AWS Glue job execution role with the AWS managed policy AWSGlueServiceRole attached, along with lakeformation:GetDataAccess permission. Make sure to include the AWS Glue service in the trust policy.
  • The permissions required to:
    • Read or write to an Amazon Simple Storage Service (Amazon S3) bucket.
    • Create and execute AWS Glue jobs.
    • Manage AWS Glue Data Catalog databases and tables.
    • Manage Amazon Athena workgroups and run queries.
  • Lake Formation configured in your account, along with a Lake Formation administrator role or a similar role to follow the instructions in this article. If you want to learn more about setting up permissions for a data lake administrator role, see Create a data lake administrator.

In this article, we will use the us-east-1 AWS Region, but you can adapt it to your preferred Region if the AWS services involved are available there. You’ll set up test data and an example AWS Glue 4.0 job using GlueContext; however, if you already have these, you may skip ahead to Migrate an AWS Glue 4.0 GlueContext FTA job to AWS Glue 5.0 native Spark FTA. With the prerequisites in place, you’re ready to begin the implementation steps.

Creating an S3 Bucket and Uploading Sample Data

To create an S3 bucket for raw input datasets and Iceberg tables, follow these steps:

  1. On the AWS Management Console for Amazon S3, select Buckets in the navigation pane.
  2. Click on Create bucket.
  3. Input the bucket name (for instance, glue5-fta-demo-${AWS_ACCOUNT_ID}-${AWS_REGION_CODE}), leaving other fields at their defaults.
  4. Click Create bucket.
  5. On the bucket details page, click on Create folder.
  6. Create two subfolders: raw-csv-input and iceberg-datalake.

Upload the LOAD00000001.csv file into the raw-csv-input folder of the bucket.

Creating AWS Glue Database and Tables

To set up input and output sample tables in the Data Catalog, follow these steps:

  1. Navigate to the Athena console and open the query editor.
  2. Execute the following queries in sequence (substituting your S3 bucket name):
-- Create a database for the demo
CREATE DATABASE glue5_fta_demo;

-- Create an external table for input CSV files. Replace the S3 path with your bucket name
CREATE EXTERNAL TABLE glue5_fta_demo.raw_csv_input(
 op string, 
 product_id bigint, 
 category string, 
 product_name string, 
 quantity_available bigint, 
 last_update_time string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3:///raw-csv-input/'
TBLPROPERTIES (
 'areColumnsQuoted'='false', 
 'classification'='csv', 
 'columnsOrdered'='true', 
 'compressionType'='none', 
 'delimiter'=',', 
 'typeOfData'='file');

For more information on career transitions and navigating changes, check out this webinar. Also, for insights on using ADHD to one’s advantage at work, this resource offers expert advice. For additional support, consider visiting this Reddit thread that provides insights into the on-boarding process.

Chanci Turner