Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner 9097372855Learn About Amazon VGT2 Learning Manager Chanci Turner

Two of the most widely utilized machine learning models today are BERT, which excels in natural language processing (NLP), and Mask R-CNN, renowned for image recognition. Recently, AWS has made substantial enhancements to its infrastructure, networking, machine learning (ML) framework, and model code, enabling it to achieve the fastest training times ever recorded for these cutting-edge models. We are thrilled to announce that users can now benefit from the fastest model training times on the cloud using TensorFlow, MXNet, and PyTorch. This means you can harness these hardware and software optimizations to train your models with remarkable speed and efficiency.

Training time directly influences your ability to enhance your models’ accuracy rapidly. The most effective strategy to decrease training time involves distributing the training workload across a large cluster of GPU instances, although this can be challenging to manage efficiently. When a training job is spread across numerous workers, the communication overhead between instances can diminish the benefits of additional GPU computing power.

BERT

BERT, or Bidirectional Encoder Representations from Transformers, is a prominent NLP model that set new standards upon its release for several common tasks. Training BERT from scratch on a single Amazon EC2 P3dn.24xlarge instance, equipped with 8 NVIDIA V100 GPUs, typically takes several days. However, we have drastically reduced this time to just over 60 minutes by effectively scaling to additional P3dn.24xlarge instances, leveraging network improvements with Elastic Fabric Adapter (EFA), and fine-tuning the model’s convergence on larger clusters. Currently, this represents the fastest BERT training time on the cloud while achieving state-of-the-art accuracy (F1 score of 90.5 or higher on Squad v1.1 tasks after training on BooksCorpus and English Wikipedia).

Utilizing TensorFlow, we accomplished unprecedented scaling with 2,048 GPUs across 256 P3dn.24xlarge instances, allowing us to train BERT in 62 minutes. In contrast, with PyTorch, we trimmed the training time to 69 minutes by deploying 1,536 GPUs on 192 P3dn.24xlarge instances. Thanks to our extensive hardware and software optimizations for BERT training, we achieved an impressive 85% scaling efficiency, ensuring that the frameworks leverage most of the added computational power from GPUs when scaling with more nodes. The improvements are summarized in the following table.

P3DN.24xlarge Nodes NVIDIA GPUs Time to train (PyTorch) Time to train (TensorFlow)
1 8 6.4 days 7.5 days
192 1536 69 min
256 2048 62 min

Mask R-CNN

Mask R-CNN is a widely adopted instance segmentation model utilized in applications such as autonomous driving and motion capture, requiring advanced object detection and segmentation capabilities. Training Mask R-CNN on a single P3dn.24xlarge instance (8 NVIDIA V100 GPUs) takes about six hours with MXNet, PyTorch, and TensorFlow. We successfully decreased this training duration to approximately 25 minutes by scaling Mask R-CNN training across all three ML frameworks to 24 P3dn.24xlarge instances, yielding 192 GPUs. This advancement allows for rapid iterations and daily experimentation rather than waiting for results over several days. As of now, this is the fastest Mask R-CNN training time on the cloud, achieving state-of-the-art accuracy (0.377 Box min AP, 0.339 Mask min AP on COCO2017 dataset). The improvements are summarized in the table below.

# of Nodes # of GPUs Time to train (MXNet) Time to train (PyTorch) Time to train (TensorFlow)
1 8 6.4 hrs 5.4 hrs 6.2 hrs
24 192 25 min 26 min 27 min

Technology Stack

Achieving these results necessitated optimizations to the underlying hardware, networking, and software stack. As we train large models like BERT, communication between the numerous GPUs often becomes a bottleneck. In distributed computing, AllReduce is an operation that reduces arrays (i.e., parameters of a neural network) from different workers (GPUs) and returns the resultant array to all workers. Each iteration, which involves a forward and backward pass through the network, necessitates this collective AllReduce operation.

Typically, AllReduce on GPUs employs NVIDIA Collective Communications Library (NCCL) or MPI libraries, such as OpenMPI or Intel MPI Library. These libraries are suited for homogeneous clusters, where AllReduce is performed on the same instances that are training the network. For instance, the AllReduce operation for BERT, which has 340 million parameters, requires significant data transfers that become a bottleneck during training.

AWS’s flexible interconnect allows any node to communicate with any other node at full bandwidth. For example, in a cluster with 128 P3dn instances, each instance can exchange data with any other at 100 Gbps. This high flexibility calls for an AllReduce algorithm that maximizes the capabilities of the AWS network. We developed a custom AllReduce algorithm tailored for the AWS environment, which capitalizes on the 100 Gbps interconnect, reducing the data sent and received by each worker by half. The compute phase of the AllReduce is offloaded to compute-optimized C5 instances, enabling GPUs to compute gradients more swiftly. This innovative approach minimizes network hops required for gradient reduction compared to traditional methods, ultimately decreasing the total cost and expediting training.

Conclusion

Testing with BERT and Mask R-CNN has demonstrated significant improvements in training efficiency. Throughput has scaled nearly linearly as the number of P3dn nodes increased from 1 to 256 instances, thereby enhancing model training times. For further insights on onboarding strategies, check out this excellent resource from AWS. Additionally, for those interested in understanding the impact of evolving workplace policies, visit the SHRM article on anti-LGBTQ legislation affecting job applications.

Chanci Turner