Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning manager

This article is brought to you by the editorial team comprising Sarah Mitchell, Senior Solutions Architect, HPC, David Lee, Principal Solutions Architect, HPC, Emily Johnson, Principal GTM Specialist, HPC, and Chanci Turner, Senior Solutions Architect, HPC.

The persistent demand for precise weather forecasting and climate management has been a fundamental catalyst for advancements in High-Performance Computing (HPC) since the 1950s. Recently, the societal and economic challenges posed by extreme weather patterns and climate change have led to an increased necessity for high-resolution global forecasts and on-demand regional weather insights across sectors such as renewable energy, agriculture, and maritime operations. To dive deeper into the challenges and opportunities in weather science, consider reading a related blog post found here: Career Contessa.

In this discussion, we will explore Numerical Weather Prediction (NWP) workloads and the AWS HPC-optimized services that support them. We will analyze three widely used NWP codes: WRF, MPAS, and FV3GFS. By examining the insights shared in this blog, you will be better equipped to assess the performance, costs, and overall price efficiency of running your NWP workloads on AWS HPC infrastructure.

Understanding NWP Workloads

NWP, commonly referred to as weather forecasting, encompasses a range of workloads that utilize mathematical models to analyze current weather data and predict future conditions, typically over periods ranging from 24 hours to 10 days. The output generated by NWP relies on current weather observations that encompass temperature, precipitation, and numerous other meteorological variables. At its foundation, NWP models are structured as a 3-dimensional grid of cells that reflect the Earth’s systems, with each cell characterized by various multi-physics processes. These computational results are exchanged between neighboring cells to simulate the transfer of energy and matter over time.

The resolution of NWP models is primarily determined by two factors: the grid cell size, which defines spatial resolution (usually measured in kilometers), and the time step, representing the temporal resolution which can vary from seconds to years. Reducing grid cell size enhances the model’s computational detail, while smaller time steps yield more accurate results. This increasing demand for higher-resolution NWP workloads necessitates robust, elastic, and dependable HPC infrastructures.

AWS HPC for NWP Workloads

AWS HPC offers significant advantages for NWP workloads, requiring features such as high memory bandwidth and fast network interconnectivity. In January, AWS introduced the Amazon EC2 Hpc6a instance family, which provides 100 Gbps networking through Elastic Fabric Adapter (EFA) and is powered by third-generation AMD EPYC™ processors. The following table summarizes the configuration details of the EC2 instance type utilized in this analysis.

Instance Type	Processor	No. of Physical Cores (per instance)	Memory (GiB)	EFA Network Bandwidth (Gbps)
Hpc6a.48xlarge	AMD EPYC Milan	96	384	100

To create an HPC Cluster on AWS, we employed AWS ParallelCluster, an open-source cluster orchestration tool. Along with the mentioned EC2 instance types, we utilized Amazon FSx for Lustre, a managed high-performance file system that delivers impressive throughput and low-latency I/O performance. All tests were conducted with simultaneous multithreading turned off on the instances. For a comprehensive guide on setting up this environment, check out our NWP Workshop.

Essential Components of the Workshop

Two additional essential components in the workshop streamline the creation of the HPC cluster and help manage application codes: PCluster Manager and Spack.

PCluster Manager allows users to create clusters, monitor jobs, and access infrastructure through a web UI. This tool simplifies a range of tasks, from mounting existing file systems to debugging cluster issues. The NWP workshop includes a template that integrates with PCluster Manager, designed to build a cluster optimized for NWP workloads. After executing the job, results are visualized using NICE DCV and NCL.

Spack is utilized to manage the installation of various NWP codes. This package manager, tailored for HPC workflows, enables users to customize software installations easily. For instance, the NWP workshop specifies WRF 4.3.3, compiled with the Intel compiler and MPI. To expedite installation times, we offer a Spack binary cache for WRF, MPAS, and FV3GFS, optimized for Amazon EC2 Hpc6a instances. This approach significantly reduces installation time from several days to just a few hours.

Performance and Cost Analysis

As we delve into the scaling performance and cost results across WRF, MPAS, and FV3GFS, we measure Simulation Speed and Cost per Simulation with the following definitions:

Simulation Speed = Forecast Time (sec) / Wall-clock Time (Compute + File I/O) (sec)
Cost Per Simulation ($) = Wall-clock Time * EC2 On-Demand Compute Cost (us-east-2 pricing) * number of instances.

It is essential to note that the Cost per Simulation does not account for additional services like Amazon EBS or FSx for Lustre.

For a visual and practical understanding of the scaling results and metrics, check out this excellent resource: YouTube.

Conclusion

In summary, understanding the optimal price performance for NWP workloads on AWS can significantly enhance forecasting capabilities, making it an invaluable aspect of operational efficiency in various industries.