Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

In the second installment of our multi-part series, we outline effective strategies for selecting the ideal Amazon Redshift cluster, designing data architecture, converting stored procedures, and optimizing SQL functions and queries commonly used in these migrations. If you missed the first post, check it out for valuable insights on planning, executing, and validating a large-scale migration from Greenplum to Amazon Redshift using the AWS Schema Conversion Tool (AWS SCT).

Selecting the Right Amazon Redshift Cluster

Amazon Redshift offers two cluster types: provisioned and serverless. Provisioned clusters require you to configure the necessary compute resources, while Redshift Serverless provides high-performance analytics without the need for managing data warehouse infrastructure. For further details, see the article on Introducing Amazon Redshift Serverless – Run Analytics At Any Scale Without Having to Manage Data Warehouse Infrastructure.

An Amazon Redshift cluster is made up of nodes, including a leader node and multiple compute nodes. The leader node processes queries from client applications, formulates execution plans, and coordinates parallel execution with compute nodes, aggregating results before returning them to the clients.

When selecting your cluster type, consider the following:

  • Assess the size of your input data, including compression, vCPU needs, and overall performance. Currently, we suggest using the Amazon Redshift RA3 instance with managed storage, which allows for independent scaling of compute and storage for improved query performance.
  • Amazon Redshift features an automated “Help me choose” tool to assist in selecting a cluster based on your data size.
  • One major advantage of utilizing a cloud-based Amazon Redshift data warehouse is the flexibility it offers compared to traditional hardware setups. You can experiment with various cluster configurations to find the most cost-effective and high-performing option.
  • During the initial development phase or pilot testing, it’s often feasible to begin with a smaller number of nodes. As you transition to production, you can adjust node counts based on usage patterns. For optimal cost efficiency, we recommend selecting reserved instance types. The public tool Simple Replay can help assess performance across different cluster types and sizes by simulating customer workloads.

Amazon Redshift also supports resizing, pausing, stopping, and concurrency scaling based on workload patterns, along with effective workload management (WLM) for optimizing memory and query concurrency.

Creating Data Extraction Tasks with AWS SCT

Using AWS SCT extraction agents, you can migrate source tables in parallel. These agents authenticate through valid credentials on the data source, allowing you to manage the resources available for that user during extraction. The agents process data locally and upload it to Amazon Simple Storage Service (Amazon S3) via the network (preferably using AWS Direct Connect). We recommend ensuring consistent network bandwidth between your Greenplum system and your AWS region.

For tables with around 20 million rows or those reaching 1 TB, consider utilizing AWS SCT’s virtual partitioning feature to enhance data extraction efficiency. This feature creates multiple sub-tasks, allowing for parallel data extraction. Therefore, we recommend establishing two task groups for each migrating schema: one for smaller tables and another for larger tables utilizing virtual partitions. For additional details, refer to Creating, running, and monitoring an AWS SCT data extraction task.

Data Architecture Considerations

To modernize and streamline your data architecture, keep in mind the following principles:

  • Assign accountability and authority to uphold enterprise data standards and policies.
  • Formalize the operating model for data and analytics across enterprise and business units.
  • Simplify your data technology ecosystem through the rationalization of tools and assets.
  • Foster organizational structures that enhance integration between business and delivery teams, leading to the development of data-oriented products that address business challenges.
  • Regularly back up data to ensure recoverability in case of issues.
  • Throughout the planning, execution, and maintenance phases, implement data quality management to achieve desired outcomes.
  • Embrace simplicity to create intuitive, efficient, and cost-effective solutions. Simple processes scale more effectively than complex ones. For instance, when performing truncation and loading for incremental data, identify a watermark and only process the necessary incremental data.

Depending on your use case, you may require record-level operations for privacy compliance, simplified file management, or near-real-time data access. In such cases, consider utilizing specialized tools for your specific needs, such as AWS’s options involving Apache HUDI with Amazon EMR and AWS Glue.

Migrating Stored Procedures

Here, we discuss best practices for migrating stored procedures from Greenplum to Amazon Redshift. Many data processing pipelines rely on stored procedures to execute complex business logic. We recommend modernizing your extract, transform, and load (ETL) processes using big data solutions like AWS Glue or Amazon EMR. For more insights, check out the Top 8 Best Practices for High-Performance ETL Processing Using Amazon Redshift.

If a time-sensitive migration to a cloud-native data warehouse like Amazon Redshift is necessary, moving your stored procedures from Greenplum directly to Redshift may be a practical solution. To ensure successful migration, adhere to Amazon Redshift’s best practices for stored procedures:

  • Specify the schema name upon creating stored procedures to enhance schema-level security and manage access controls.
  • To avoid naming conflicts, prefix your procedure names with “sp_.” This ensures that your procedure names do not clash with existing or future Amazon Redshift procedures.
  • Always qualify your database objects with the schema name within the stored procedure.
  • Adhere to the minimal required access principle and revoke unnecessary permissions. Ensure that stored procedure execution rights are not granted to ALL.
  • The SECURITY attribute governs a procedure’s access to database objects. When creating a stored procedure, set the SECURITY attribute to either DEFINER or INVOKER, depending on your access needs.

For further information on job descriptions for leadership roles, refer to this link. Additionally, for insights on onboarding processes, this thread on Reddit is an excellent resource.

To delve deeper into social justice topics, consider reading this other blog post.