Amazon Onboarding with Learning Manager Chanci Turner

SAS® provides a robust platform for data science and analytics, utilized by both enterprise and governmental organizations. The SAS Grid is known for its high availability and rapid processing capabilities, offering centralized management that effectively balances workloads across various compute nodes. This application suite is proficient in data management, visual analytics, governance, security, forecasting, text mining, statistical analysis, and environmental management. Recently, SAS and AWS conducted tests utilizing the Amazon FSx for Lustre shared file system to evaluate the performance of a standard workload on AWS with SAS Grid Manager. For further insights into the results, refer to the whitepaper titled “Accelerating SAS Using High-Performing File Systems on Amazon Web Services.”

In this blog post, we will explore a method for deploying the necessary AWS infrastructure to run SAS Grid with FSx for Lustre, which can also be applied to similar applications that have demanding I/O requirements.

System Design Overview

Executing high-performance workloads that rely heavily on throughput and are sensitive to network latency necessitates strategies beyond those for typical applications. AWS commonly recommends spanning multiple Availability Zones for enhanced high availability; however, for latency-sensitive, high-throughput applications, it is advisable to keep traffic localized for optimal performance. To maximize throughput, consider the following:

Operate within a virtual private cloud (VPC) and utilize instance types that support enhanced networking.
Deploy instances within the same Availability Zone.
Use placement groups for your instances.

The diagram below illustrates the architecture of SAS Grid with FSx for Lustre on AWS.

The architecture comprises mid-tier nodes, metadata servers, and Grid compute nodes. Mid-tier nodes are responsible for running the Platform Web Services (PWS) and Load Sharing Facility (LSF) components that dispatch jobs and return their statuses.

To effectively operate PWS and LSF on mid-tier nodes, you will need Amazon Elastic Compute Cloud (Amazon EC2) instances equipped with substantial memory. The r5 instance family is suitable for this requirement. Metadata servers house the metadata repository containing definitions for all SAS Grid manager products; the r5 instance family can also efficiently fulfill this role. It is advisable to meet or exceed the recommended memory requirement of 24 GB of RAM or 8 MB per physical core (whichever is larger). Metadata servers do not require compute-intensive resources or high I/O bandwidth, making the r5 instance family a balanced choice between cost and performance.

SAS Grid nodes execute the jobs received from the grid. The suitable EC2 instances for these jobs depend on their size, complexity, and volume. To meet the minimum requirements for SAS Grid workloads, a minimum of 8 GB of physical RAM per core and a robust I/O throughput of 100–125 MB/second per physical core are recommended. For this application, the EC2 instance families m5n and r5n adequately meet RAM and throughput needs. You can store SASDATA, SASWORK, and UTILLOC libraries within a shared file system. If you opt to offload SASWORK to instance storage, the i3en instance family provides over 1.2 TB of instance storage capacity. The next section will detail the throughput testing conducted to arrive at the EC2 instance recommendations using FSx for Lustre.

Steps to Maximize Storage I/O Performance

SAS Grid necessitates a shared file system, and we aimed to benchmark the performance of FSx for Lustre as the chosen shared file system against various EC2 instance families that meet the minimum specifications of 8 GB of physical RAM per core and 100–125 MB/second throughput per physical core.

FSx for Lustre is a fully managed file storage service crafted for applications that require rapid storage. As a POSIX-compliant file system, FSx for Lustre can be utilized with existing Linux-based applications without modifications. Although it offers a selection between scratch and persistent file types, we recommend using a persistent FSx for Lustre file system for SAS Grid to retain SASWORK, SASDATA, and UTILLOC data and libraries for extended periods while ensuring high availability and data durability. Confirm that you select appropriate storage capacity to meet I/O throughput per storage unit to achieve the desired range of 100–125 MB/second.

After configuring the file system, we advise mounting FSx for Lustre with the flock mount option. Here is an example of a mount command and its options for FSx for Lustre:

$ sudo mount -t lustre -o noatime,flock fs-0123456789abcd.fsx.us-west-2.amazonaws.com@tcp:/za3atbmv /fsx
$ mount -t lustre 172.31.41.37@tcp:/za3atbmv on /fsx type lustre

Throughput Testing and Results

To determine the best EC2 instances for running SAS Grid with FSx for Lustre, we performed a series of parallel network throughput tests from individual EC2 instances against a 100.8 TiB persistent file system, which had an aggregate throughput capacity of 19.688 GB/second. These tests were conducted across multiple regions using various EC2 instance families (c5, c5n, i3, i3en, m5, m5a, m5ad, m5n, m5dn, r5, r5a, r5ad, r5n, and r5dn). Each instance underwent testing for 3 hours, with the DataWriteBytes metric recorded every minute. Only one instance accessed the file system at any given time; the p99.9 results were documented, and the metrics remained consistent across all four regions.

The results indicated that the i3en, m5n, m5dn, r5n, and r5dn EC2 instance families either met or surpassed the minimum memory and network performance recommendations. For deeper insights into the performance results, consult the whitepaper “Accelerating SAS Using High-Performing File Systems on Amazon Web Services.” The i3 instance family nearly meets the minimum network performance criteria. If instance storage is intended for SASWORK and UTILLOC libraries, the i3en instances are a viable option.

The m5n and r5n families provide a commendable balance between cost and performance, and we recommend the m5n instance family for SAS Grid nodes. However, if your workload is memory-intensive, consider the r5n instances, which offer higher memory per physical core at a greater price point than the m5n instances.

Additionally, we executed rhel_iotest.sh, available from the SAS technical support samples tool repository (SASTSST), with the aforementioned FSx for Lustre configuration. The table below summarizes the read and write performance per physical core for various instance sizes in the m5n and r5n families.

Instance Type	Read (MB/second)	Write (MB/second)
m5n.large	850.20	357.07
m5n.xlarge	519.46	386.25
m5n.2xlarge	283.01	446.84
m5n.4xlarge	202.89	376.57
m5n.8xlarge	154.98	297.71
r5n.large	906.88	429.93
r5n.xlarge	488.36	455.76
r5n.2xlarge	256.96	471.65

By keeping these insights in mind, you can optimize your SAS Grid Manager deployment on AWS with Amazon FSx for Lustre, ensuring that your application performs efficiently and meets your organizational needs. For more information on naming conventions, check out this blog post. To stay updated on talent acquisition trends, refer to this article from SHRM. Finally, for job opportunities related to this field, visit this resource.

Amazon Onboarding with Learning Manager Chanci Turner

System Design Overview

Steps to Maximize Storage I/O Performance

Throughput Testing and Results

Related Topics: