Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

Creating an efficient data model for an enterprise data warehouse (EDW) traditionally requires considerable effort in design, development, and administration. Moreover, it is essential that the data model remains flexible and responsive to change while effectively managing vast amounts of data.

Data Vault serves as a methodology for expediting the design and implementation of data warehouse projects. Within this framework, the Data Vault 2.0 modeling standards are widely adopted as they focus on business keys and their relationships within the context of business processes. Data Vault allows for the rapid construction of data models through several mechanisms:

  • Entities based on well-defined patterns, each serving a specific purpose
  • Elimination of data silos by representing data in source system-independent structures
  • Parallel data loading with minimal interdependencies
  • Storage of historized data at its most granular level
  • Flexibility to apply business rules independently of data loading
  • Addition of new data sources without impacting the existing model

It is advisable to start from the business requirements when selecting the most appropriate data modeling approach. There are instances where Data Vault may not be the optimal choice for your enterprise data warehouse, and another modeling pattern could prove more effective.

In this article, we illustrate how to implement a Data Vault model in Amazon Redshift and optimize queries using the latest features of Amazon Redshift, including compute-storage separation, seamless data sharing, automatic table optimizations, and materialized views. For detailed best practices concerning the design of enterprise-grade Data Vaults of varying scales using Amazon Redshift, please check out the following blog series:

Overview of Data Vault Modeling

A data warehouse platform designed with Data Vault typically includes the following four layers:

  1. Staging – This layer contains the most recent changes from source systems. It does not retain historical data, and various transformations can be applied during its population, including data type adjustments, character set conversions, and adding metadata columns for future processing.
  2. Raw Data Vault – This layer stores a historized copy of all data from diverse source systems. At this stage, no filtering or business transformations have taken place; the data is merely stored in structures independent of the source system.
  3. Business Data Vault – While optional, this layer is frequently constructed. It contains business calculations and denormalizations aimed at enhancing speed and ease of access in the consumption layer, known as the Information Mart layer.
  4. Information Mart Layer – This is where data is primarily accessed by users, such as for reporting dashboards or data extracts. Multiple marts can be established from a single Data Vault Integration Layer, with Star/Kimball schemas being the most common modeling choice for these marts.

Transforming a Third Normal Form Transactional Schema into a Data Vault Schema

The following entity relationship diagram exemplifies a transactional model that could be utilized by a sports ticket selling service:

The primary entities in this schema include sporting events, customers, and tickets. A customer can purchase one or more tickets for a sporting event. The transaction is documented by the Ticket Purchase History intersection entity. Furthermore, a sporting event can offer many tickets for sale and is held in a specific city.

To convert this source model into a Data Vault model, we identify the business keys, their descriptive attributes, and the associated business transactions. The three main entity types in the Raw Data Vault model are categorized as follows:

  • Hubs – A collection of business keys identified for each business entity.
  • Links – Business transactions occurring within the modeled process, typically involving two or more business keys (hubs) recorded at a specific time.
  • Satellites – Historized reference data related to either the business key (Hub) or business transaction (Link).

The following example illustrates the transformation of some sporting event entities into the corresponding Raw Data Vault objects.

Hub Entities

The hub entity represents the definitive list of business keys loaded into the Raw Data Vault from all source systems. A business key uniquely identifies a business entity and is never duplicated. In our case, the source system assigns a surrogate key field called Id to represent the Business Key, stored in the Hub column as sport_event_id. Common additional columns on hubs include Load DateTimeStamp, which captures when the business key was first discovered, and Record Source, which indicates the source system from which this business key was initially loaded. Although creating a surrogate type (hash or sequence) for the primary key column is not mandatory, it is common practice in Data Vault to hash the business key, as shown in the following code:

create table raw_data_vault.hub_sport_event 
(
  sport_event_pk  varchar(32) not null     
 ,sport_event_id  integer     not null
 ,load_dts        timestamp   not null       
 ,record_source   varchar(10) not null      
);

Note the following:

  • The above code assumes the use of the MD5 hashing algorithm. If using FNV_HASH, the datatype would be Bigint.
  • The Id column represents the business key from the source feed, passed into the hashing function for the _PK column.
  • In this instance, there is a single value for the business key. If a compound key is necessary, more than one column can be included.
  • Load_DTS is populated via the staging schema or extract, transform, and load (ETL) code.
  • Record_Source is populated via the staging schema or ETL code.

Link Entities

The link entity records the occurrence of two or more business keys in a business transaction, such as purchasing a ticket for a sporting event. Each business key is referenced in its respective hub, and a primary key is generated for the link, typically comprising all business keys (often separated by a delimiter like ‘^’). Additional columns, like Load DateTimeStamp and Record Source, are similarly included, as shown in the following code:

create table raw_data_vault.lnk_ticket_sport_event 
(
  ticket_sport_event_pk varchar(32)  not null    
 ,ticket_fk             varchar(32)  not null   
 ,sport_event_fk        varchar(32)  not null   
 ,load_dts              timestamp    not null   
 ,record_source         varchar(10)  not null   
);

For those interested in career advancement, this blog post can provide valuable insights. Additionally, if you are looking to enhance your approach to employee management, SHRM offers authoritative guidance on this topic. For a comprehensive understanding of onboarding, visit this Reddit thread which is an excellent resource.

Chanci Turner