Learn About Amazon VGT2 Learning Manager Chanci Turner
Amazon’s Redshift data sharing feature enables secure and efficient sharing of live data across Redshift clusters for read-only purposes. As a fully managed cloud data warehouse, Amazon Redshift empowers users to analyze vast amounts of data using standard SQL and existing business intelligence (BI) tools. It supports complex analytic queries against terabytes to petabytes of structured data, leveraging advanced query optimization, columnar storage, and massively parallel query execution.
In this article, we will explore how to utilize Amazon Redshift data sharing to ensure workload isolation across various analytical scenarios while fulfilling critical business SLAs. For additional insights into this innovative feature, check out Announcing Amazon Redshift data sharing (preview).
How to Utilize Amazon Redshift Data Sharing
Amazon Redshift data sharing facilitates a producer cluster to share data objects with one or more consumer clusters for read purposes, without requiring data duplication. This method allows distinct clusters to collaborate on data frequently, fostering innovation and providing valuable analytical services to both internal and external stakeholders. You can share data at multiple levels, including databases, schemas, tables, views, columns, and user-defined functions, enabling tailored access controls for different users and organizations needing access to Amazon Redshift data.
The data sharing process between Amazon Redshift clusters involves two key steps. First, an administrator from the producer cluster creates a data share, a newly introduced named object that serves as a unit of sharing. The producer cluster then incorporates the necessary database objects—such as schemas, tables, and views—into the data share and designates a list of consumer clusters for the shared data. Subsequently, users with appropriate privileges on the consumer clusters create a local database reference from the data share and grant access permissions on the database objects to relevant users and groups. Users can then list the shared objects as a part of standard metadata queries and start querying without delay.
Solution Overview
In our scenario, the producer cluster is a central ETL cluster that hosts enterprise sales data, a 3 TB Cloud DW benchmark dataset based on the TPC-DS benchmark. This cluster caters to multiple BI and data science clusters tailored for different business teams within the organization. One of these teams is sales BI, which runs reports using sales data generated in the ETL cluster and combines it with a product reviews dataset loaded into their managed BI cluster.
This strategy allows the sales BI team to maintain distinct data lifecycle management between the enterprise sales dataset in the ETL producer and the product reviews data they fully control in the BI consumer cluster, simplifying data stewardship. It also enhances agility, allows independent cluster sizing for workload isolation, and establishes a straightforward cost charge-back model.
As illustrated in the following diagram, the central ETL cluster named etl_cluster contains the sales data within a schema labeled sales. A superuser in etl_cluster creates a data share named salesdatashare, incorporates the bi_semantic schema along with all its objects into the data share, and grants usage permissions to the BI consumer cluster identified as bi_cluster. It’s crucial to note that a data share serves merely as a metadata container, representing the shared data from producer to consumer, without any actual data transfer.
The superuser in the BI consumer cluster then establishes a local database reference called sales_semantic from the available data share. BI users utilize the product reviews dataset within the local schema named product_reviews and join it with bi_semantic data for reporting purposes. You can access the script in the product review dataset utilized in this post to load the dataset into bi_cluster. Loading the DW benchmark dataset into etl_cluster can be done using this GitHub link. The instructions provided here assume that these datasets have already been loaded into their respective Amazon Redshift clusters.
Building a BI Semantic Layer
A BI semantic layer serves as a representation of enterprise data, simplifying BI reporting needs while enhancing performance. In our example, the semantic layer transforms sales data into a denormalized customer dataset and another dataset capturing all store sales by product for a given year. The following queries are executed on etl_cluster to create the BI semantic layer.
- Create a new schema for housing BI semantic tables:
CREATE SCHEMA bi_semantic;
- Generate a denormalized customer view with the necessary columns for the sales BI team:
CREATE VIEW bi_semantic.customer_denorm AS SELECT c_customer_sk, c_customer_id, c_birth_year, c_birth_country, c_last_review_date_sk, ca_city, ca_state, ca_zip, ca_country, ca_gmt_offset, cd_gender, cd_marital_status, cd_education_status FROM sales.customer c, sales.customer_address ca, sales.customer_demographics cd WHERE c.c_current_addr_sk=ca.ca_address_sk AND c.c_current_cdemo_sk=cd.cd_demo_sk;
- Create a second view for all product sales with the required columns for the BI team:
CREATE VIEW bi_semantic.product_sales AS SELECT i_item_id, i_product_name, i_current_price, i_wholesale_cost, i_brand_id, i_brand, i_category_id, i_category, i_manufact, d_date, d_moy, d_year, d_quarter_name, ss_customer_sk, ss_store_sk, ss_sales_price, ss_list_price, ss_net_profit, ss_quantity, ss_coupon_amt FROM sales.store_sales ss, sales.item i, sales.date_dim d WHERE ss.ss_item_sk=i.i_item_sk AND ss.ss_sold_date_sk=d.d_date_sk;
Sharing Data Across Amazon Redshift Clusters
Now, let’s share the bi_semantic schema from etl_cluster with bi_cluster.
- Create a data share in etl_cluster with the following command while connected to etl_cluster. Only superusers and database owners can create data share objects. By default, PUBLICACCESSIBLE is set to false. If the producer cluster is publicly accessible, you can add PUBLICACCESSIBLE = true to the following command:
CREATE DATASHARE SalesDatashare;
- Add the BI semantic views to the data share. To include objects in the data share, specify the schema before adding the objects. Use ALTER DATASHARE to share the entire schema; to share tables, views, and functions in a specific schema; and to share objects from multiple schemas. This is also a great opportunity to learn more about belonging and inclusion as you navigate your own journey.
For more information on how blue-collar workers are likely to search for jobs on smartphones, check out this article from SHRM to stay informed on the latest trends.
If you’re curious about what to expect on your first day at Amazon, take a look at this Reddit thread which offers valuable insights from others who’ve been there.