Automated Data Governance with AWS Glue Data Quality, Sensitive Data Detection, and AWS Lake Formation

Chanci Turner Amazon IXD – VGT2 learning manager

Data governance ensures that an organization’s data remains accurate, accessible, usable, and secure. Given the massive influx of data into data lakes, establishing and maintaining governance policies becomes increasingly complex. The two primary pillars of effective data governance are data confidentiality and data quality. Data confidentiality emphasizes the safeguarding of sensitive information, particularly personally identifiable information (PII), against unauthorized access. Meanwhile, data quality is concerned with ensuring that data is accurate, consistent, and reliable, as poor data quality can lead to misguided decisions and deteriorating business performance.

Organizations must uphold data confidentiality throughout their data pipelines while ensuring that high-quality data is readily available to consumers. Much of this work is performed manually, where data owners and stewards statically define and implement policies for each dataset. This laborious process can hinder data adoption across the enterprise.

This post demonstrates how to leverage AWS Glue alongside AWS Glue Data Quality, sensitive data detection transforms, and AWS Lake Formation’s tag-based access control to automate data governance.

Solution Overview

For illustration, we consider a hypothetical company, DataSphere. DataSphere operates multiple ingestion pipelines that populate various tables within its data lake and aims to ensure governance through robust data quality rules and access policies.

Multiple user personas within DataSphere, including business leaders, data scientists, data analysts, and data engineers, require different governance levels. Business leaders need highly accurate data, while data scientists must avoid PII and work with data that meets a specific quality threshold for model training. Data engineers should have access to all data except for PII.

Currently, these governance requirements are hard-coded and manually managed, which DataSphere seeks to automate. They are looking for features that ensure:

New data and tables added to the data lake automatically apply governance policies, such as data quality checks and access controls. Data should only be accessible once certified for consumption, with basic data quality checks applied to new tables based on their quality score.
Existing data profiles may change as source data evolves, necessitating automatic adjustments to governance. For instance, if a column previously marked as public is found to contain sensitive data, it should be reclassified accordingly, restricting access for unauthorized users.

The governance policies for this scenario include:

No PII data should exist in tables or columns designated as public.
Columns containing PII must be marked as sensitive, and the associated tables must also be tagged as sensitive.
The following data quality rules should apply to all tables:
- Each table must contain a minimum set of columns: data_key, data_load_date, and data_location.
- The data_key must be unique and complete.
- The data_location must match locations defined in a separate reference table.
- The data_load_date column must be complete.

Access to tables will be controlled based on user categories:

User Category	Can Access Sensitive Tables	Can Access Sensitive Columns	Min Data Quality Threshold
Category 1	Yes	Yes	100%
Category 2	Yes	No	50%
Category 3	No	No	0%

In this post, we utilize AWS Glue Data Quality and sensitive data detection features along with Lake Formation tag-based access control to effectively manage access at scale.

The architecture diagram illustrates how the governance requirements translate into Lake Formation LF-Tags, ensuring compliance with the defined policies.

This post employs AWS Step Functions to orchestrate the governance jobs, but any orchestration tool can be utilized. For data ingestion simulation, we manually place files in an Amazon S3 bucket, triggering the Step Functions state machine for clarity. In practical scenarios, jobs can be integrated as part of a data ingestion pipeline or triggered by events like AWS Glue crawlers or Amazon S3 events.

An AWS Glue database named datasphere_autogov_temp and a target table named customers will be used to apply the governance rules. AWS CloudFormation is employed to provision resources, allowing management of infrastructure as code.

Prerequisites

Before you begin, ensure the following:

Choose an AWS Region for resource creation and maintain consistency throughout the setup.
Have an administrator role in Lake Formation to execute the CloudFormation template and grant permissions.

Log into the Lake Formation console and confirm your role as a Lake Formation data lake administrator. If this is your first setup in a region, you can designate yourself in the pop-up window upon connecting to the console. Otherwise, navigate to Administrative roles and tasks in the Lake Formation console to add data lake administrators.

Deploying the CloudFormation Stack

Run the provided CloudFormation stack to set up the necessary resources, specifying a unique bucket name and passwords for three user personas (Category 1, Category 2, and Category 3). The stack provisions an S3 bucket to store dummy data, AWS Glue scripts, sensitive data detection results, and Amazon Athena query results.

For further insights into the topic, consider exploring this blog post on exceptionalism, which ties into the importance of recognizing diverse perspectives. Additionally, SHRM provides authoritative insights into why employees prefer office environments that allow them to concentrate on their work. Lastly, for more information on safety and training protocols, visit Amazon’s resource which outlines their commitment to workplace safety.

Automated Data Governance with AWS Glue Data Quality, Sensitive Data Detection, and AWS Lake Formation

Solution Overview

Prerequisites

Deploying the CloudFormation Stack

Related Topics: