Empowering Enterprise-Grade Data Vaults with Amazon Redshift – Part 2

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

Amazon Redshift stands out as a prominent cloud data warehouse, providing a fully managed cloud service that integrates seamlessly with an organization’s Amazon Simple Storage Service (Amazon S3) data lake, real-time streams, machine learning (ML) workflows, transactional processes, and much more—all while delivering up to 7.9x superior price-performance compared to other cloud data warehouses.

As with all AWS services, Amazon Redshift prioritizes customer needs, acknowledging that a universal solution doesn’t exist for all clients regarding data models. This is why Amazon Redshift accommodates various data models, including Star Schemas, Snowflake Schemas, and Data Vault. This post explores the critical requirements in designing an enterprise-grade Data Vault and how Amazon Redshift—along with AWS cloud—addresses these needs. The first part of this two-part series reviews best practices for crafting enterprise-grade data vaults of diverse scales using Amazon Redshift.

Whether it’s the goal to easily maintain data lineage within the data warehouse, establish a source-system agnostic data model, or simplify compliance with GDPR regulations, customers adopting a data vault model will gain valuable insights from this post’s discussion on considerations, best practices, and the features of Amazon Redshift, along with relevant AWS cloud capabilities for building enterprise-grade data vaults. Although creating a basic version of any system may be straightforward, developing something that meets enterprise-grade standards for scale, security, resilience, and performance necessitates knowledge of proven best practices and the correct application of tools and features in the appropriate contexts.

Data Vault Overview

For a succinct overview of the fundamental Data Vault concepts, please refer to the first post in this series.

In the following sections, we will delve into the most prevalent areas of consideration that are vital for Data Vault implementations at scale: data protection, performance and elasticity, analytical functionality, cost and resource management, availability, and scalability. While these areas are also crucial for any data warehouse model, our experience indicates that they possess unique aspects and specific needs to achieve effective data vault implementations at scale.

Data Protection

Security is always the top priority at AWS, and we consistently observe the same level of commitment to security from our clients. Data security encompasses numerous layers and dimensions, from encryption at rest and in transit to fine-grained access controls and beyond. In this section, we will examine the most common data security requirements within both raw and business data vaults alongside the features offered by Amazon Redshift to meet these needs.

Data Encryption

By default, Amazon Redshift encrypts data in transit. With a simple click, you can configure Amazon Redshift to encrypt data at rest at any stage in a data warehouse’s lifecycle.

You have the option to utilize either AWS Key Management Service (AWS KMS) or Hardware Security Module (HSM) for data encryption at rest. If you choose AWS KMS, you can select either an AWS managed key or a customer-managed key. For further information, check out Amazon Redshift’s documentation on database encryption.

Additionally, you can adjust cluster encryption settings after the cluster is created. Also, Amazon Redshift Serverless is encrypted by default.

Fine-Grained Access Controls

To achieve fine-grained access controls at scale, Data Vaults typically require the use of both static and dynamic access controls. Static access controls restrict access to databases, tables, rows, and columns for specific users, groups, or roles. Dynamic access controls, on the other hand, can mask parts or all of a data item, such as a column depending on a user’s role or their specific privileges.

Amazon Redshift has long supported static access controls through the GRANT and REVOKE commands for databases, schemas, and tables, including row and column levels. It also supports row-level security, allowing further restrictions on particular rows of visible columns, along with role-based access control to simplify security privilege management in Amazon Redshift.

In the following example, we illustrate how to utilize GRANT and REVOKE statements to implement static access control in Amazon Redshift.

First, create a table and populate it with credit card values:

-- Create the credit cards table
CREATE TABLE credit_cards 
( customer_id INT, 
is_fraud BOOLEAN, 
credit_card TEXT);

--populate the table with sample values
INSERT INTO credit_cards 
VALUES
(100,'n', '453299ABCDEF4842'),
(100,'y', '471600ABCDEF5888'),
(102,'n', '524311ABCDEF2649'),
(102,'y', '601172ABCDEF4675'),
(102,'n', '601137ABCDEF9710'),
(103,'n', '373611ABCDEF6352');

Next, create the user user1 and check user1’s permissions on the credit_cards table:

-- Create user
CREATE USER user1 WITH PASSWORD '1234Test!';

-- Check access permissions for user1 on the credit_cards table
SET SESSION AUTHORIZATION user1; 
SELECT * FROM credit_cards; -- This will return permission defined error

Then grant SELECT access on the credit_cards table to user1:

RESET SESSION AUTHORIZATION;
GRANT SELECT ON credit_cards TO user1;

Finally, verify access permissions on the credit_cards table for user1:

SET SESSION AUTHORIZATION user1;
SELECT * FROM credit_cards; -- Query will return rows
RESET SESSION AUTHORIZATION;

Data Obfuscation

While static access controls help create strict boundaries for user communities that should access certain datasets (for instance, only users in the marketing group should access marketing data), what if access controls need to limit only certain aspects of a field rather than the entire field? Amazon Redshift supports full, partial, or custom data masking of a field through dynamic data masking. This feature allows you to protect sensitive data in your data warehouse, manipulating how Amazon Redshift displays sensitive data to users at query time without altering it in the database through masking policies.

In the following example, we will achieve a complete redaction of credit card numbers at runtime using a masking policy on the previously mentioned credit_cards table.

Create a masking policy that fully masks the credit card number:

CREATE MASKING POLICY mask_credit_card_full 
WITH (credit_card VARCHAR(256)) 
USING ('000000XXXX0000'::TEXT);

Next, attach mask_credit_card_full to the credit_cards table as the default policy. All users will see this masking policy unless a higher-priority masking policy is applied to them or their role:

ATTACH MASKING POLICY mask_credit_card_full 
ON credit_cards(credit_card) TO PUBLIC;

Users will see credit card details masked when executing the following query:

SELECT * FROM credit_cards;

Centralized Security Policies

You can achieve considerable scalability by combining static and dynamic access controls across a wide range of user communities, datasets, and access scenarios. However, what about datasets shared across multiple Redshift warehouses? This is a complex issue that needs exploring.

For additional insights on how to stay focused at work, check out this blog post. Also, consider exploring the HR perspectives from SHRM, which provides an authoritative stance on related topics. For a thorough resource on onboarding new hires during challenging times, visit this excellent source.

Chanci Turner