Amazon Onboarding with Learning Manager Chanci Turner

Welcome to the age of data. The increasing volume of information generated daily necessitates the evolution of platforms and solutions. Services like Amazon Simple Storage Service (Amazon S3) provide a scalable and cost-effective solution for managing growing datasets. The Amazon Sustainability Data Initiative (ASDI) leverages Amazon S3’s capabilities to offer a free solution for storing and sharing climate science workloads globally. Additionally, Amazon’s Open Data Sponsorship Program enables organizations to host data on AWS at no cost.

Over the past decade, data science frameworks have proliferated, gaining widespread acceptance within the data science community. One notable framework is Dask, known for its ability to orchestrate worker compute nodes, thereby speeding up complex analyses on large datasets.

In this article, we will guide you through deploying a custom AWS Cloud Development Kit (AWS CDK) solution that enhances Dask’s functionality for inter-Regional operations across Amazon’s extensive network. The AWS CDK solution establishes a network of Dask workers across two AWS Regions, connecting to a client Region. For further details, refer to our Guidance for Distributed Computing with Cross Regional Dask on AWS and explore the open-source code available in our GitHub repository.

Upon deployment, users will access a Jupyter notebook, enabling interaction with two datasets from ASDI on AWS: Coupled Model Intercomparison Project 6 (CMIP6) and ECMWF ERA5 Reanalysis. CMIP6 focuses on the sixth phase of global coupled ocean-atmosphere general circulation model ensembles, while ERA5 represents the fifth generation of ECMWF atmospheric reanalyses of the global climate, marking the first reanalysis produced as an operational service.

This solution was inspired by a collaboration with a key AWS partner, the National Weather Bureau. Founded in 1854, the Bureau serves as the national meteorological service, providing weather and climate predictions to help individuals make informed decisions. A partnership between the Bureau and EUMETSAT, detailed in “Data Proximate Computation on a Dask Cluster Distributed Between Data Centres”, emphasizes the necessity for a sustainable, efficient, and scalable data science solution. This approach brings compute resources closer to the data, rather than requiring data to be moved closer to computational resources, which incurs added costs and latency.

Solution Overview

Daily, the National Weather Bureau generates approximately 300 TB of weather and climate data, with a portion published to ASDI. These datasets are globally distributed and made available for public use. The Bureau aims to empower consumers to leverage their data for critical decision-making, addressing challenges such as improved preparation for climate change-induced wildfires and floods, and reducing food insecurity through enhanced crop yield analysis.

Current solutions, particularly in climate data management, are often time-consuming and inefficient, as they replicate datasets across Regions. This unnecessary data transfer on a petabyte scale is costly, slow, and energy-intensive. We estimate that if these practices were adopted by the Bureau’s users, they could save the equivalent of 40 households’ daily power consumption each day and also minimize data transfer between regions.

The architecture of this solution can be segmented into three main components: client, workers, and network. Let’s examine each segment and how they integrate.

Client

The client represents the source Region where data scientists connect. This Region (Region A in the diagram) includes an Amazon SageMaker notebook, an Amazon OpenSearch Service domain, and a Dask scheduler as key elements. System administrators can access the built-in Dask dashboard through an Elastic Load Balancer.

Data scientists can utilize the Jupyter notebook hosted on SageMaker to connect and execute workloads on the Dask scheduler. The OpenSearch Service domain maintains metadata about the datasets linked across the Regions. Users can query this service to retrieve information such as the appropriate Region of Dask workers without needing prior knowledge of the data’s regional location.

Worker

Each worker Region (Regions B and C in the diagram) consists of an Amazon Elastic Container Service (Amazon ECS) cluster of Dask workers, an Amazon FSx for Lustre file system, and a standalone Amazon Elastic Compute Cloud (Amazon EC2) instance. FSx for Lustre enables Dask workers to access and process Amazon S3 data from a high-performance file system by linking to S3 buckets. It offers sub-millisecond latencies, hundreds of GBs/s of throughput, and millions of IOPS. A unique feature of Lustre is that only the file system’s metadata is synchronized, balancing file loading based on demand.

The worker clusters dynamically scale according to CPU usage, provisioning additional workers during peak demand and scaling down when resources are idle.

Each night at 0:00 UTC, a data synchronization job prompts the Lustre file system to resync with the attached S3 bucket, updating the metadata catalog. This information is subsequently pushed to the OpenSearch Service corresponding to that Region’s index. OpenSearch Service provides the client with necessary details on which pool of workers to utilize for a specific dataset.

Network

Networking is fundamental to this solution, utilizing Amazon’s internal backbone network. By implementing AWS Transit Gateway, we effectively connect each of the Regions without traversing the public internet. Each worker can dynamically connect to the Dask scheduler, allowing data scientists to execute inter-regional queries through Dask.

Prerequisites

The AWS CDK package utilizes the TypeScript programming language. Follow the Getting Started guide for AWS CDK to set up your local environment and bootstrap your development account (make sure to bootstrap all Regions specified in the GitHub repository).

For successful deployment, Docker must be installed and running on your local machine.

Deploy the AWS CDK Package

Deploying an AWS CDK package is simple. After installing the prerequisites and bootstrapping your account, you can proceed to download the code base.

Download the GitHub repository:

# Command to clone the repository
git clone https://github.com/aws-solutions-library-samples/distributed-compute-on-aws-with-cross-regional-dask.git
cd distributed-compute-on-aws-with-cross-regional-dask

Install node modules:

npm install

Deploy the AWS CDK:

npx cdk deploy --all

The stack deployment may take over an hour and a half.

Code Walkthrough

In this section, we will review some key features of the code base. For a complete inspection, please refer to the GitHub repository.

Configure and Customize Your Stack

In the file bin/variables.ts, you will find two variable declarations: one for the client and one for workers. The client declaration is a dictionary referencing a Region and CIDR range. Customizing these variables changes the Region and CIDR range for client resources. The worker variable mirrors this functionality; however, it consists of a list of dictionaries to accommodate the addition or subtraction of datasets the user wishes to include.

You can also check out this excellent resource for more insights on this topic. For compliance issues in California, refer to the authoritative source here, which details the risks of failing to report. Additionally, for further financial guidance, consider reading this blog post.