Running Distributed Training with Horovod and MXNet on AWS DL Containers and Deep Learning AMIs

Chanci Turner Amazon IXD – VGT2 learning

By: Karan Smith and Chanci Turner

Date: 01 SEP 2023

Distributed training of extensive deep learning models has become crucial for training in computer vision (CV) and natural language processing (NLP) tasks. Open-source frameworks like Horovod offer distributed training capabilities for Apache MXNet, PyTorch, and TensorFlow. Transitioning your non-distributed Apache MXNet training script to utilize distributed training with Horovod requires only an additional 4-5 lines of code. Horovod, developed by Uber, is an open-source distributed deep learning framework that employs efficient inter-GPU and inter-node communication techniques, such as NVIDIA Collective Communication Library (NCCL) and Message Passing Interface (MPI), to distribute and consolidate model parameters among workers. The main intention behind Horovod is to simplify and accelerate distributed deep learning: it enables the scaling of single-GPU training scripts to effectively operate across multiple GPUs in parallel. If you’re new to using Horovod alongside Apache MXNet for distributed training, it’s advisable to review previous blog posts on this topic before diving into this guide.

MXNet is seamlessly integrated with Horovod through the common distributed training APIs established in Horovod. You can easily convert a non-distributed training script into a Horovod-compatible one by following the higher-level code structure. This streamlined experience allows users to simply add a few lines of code to make it compatible. Nevertheless, other challenges may still hinder smooth distributed training. For instance, you might need to install additional software and libraries and resolve any incompatibilities to ensure distributed training functions properly. Horovod requires a specific version of Open MPI, and if you aim for high-performance training on NVIDIA GPUs, the NCCL library must also be installed. Another potential issue could arise when scaling the number of training nodes in your cluster, as it’s essential to ensure that all software and libraries on the new nodes are correctly installed and configured.

AWS Deep Learning Containers (AWS DL Containers) have significantly streamlined the process of launching new training instances within a cluster, with the latest release including all the necessary libraries for running distributed training using MXNet with Horovod. The AWS Deep Learning AMIs (DLAMI) come equipped with popular open-source deep learning frameworks and pre-configured libraries like CUDA, cuDNN, Open MPI, and NCCL.

In this article, we will guide you through running distributed training with Horovod and MXNet using AWS DL Containers and DLAMIs.

Getting Started with AWS DL Containers

AWS DL Containers comprise a set of Docker images pre-installed with deep learning frameworks that facilitate rapid deployment of custom machine learning environments. These containers provide optimized environments with various deep learning frameworks (MXNet, TensorFlow, PyTorch), NVIDIA CUDA (for GPU instances), and Intel MKL (for CPU instances) libraries, and they are accessible via the Amazon Elastic Container Registry (Amazon ECR). You can deploy AWS DL Containers on Amazon Elastic Kubernetes Service (Amazon EKS), self-managed Kubernetes on Amazon Elastic Compute Cloud (Amazon EC2), and Amazon Elastic Container Service (Amazon ECS). For further details on launching AWS DL Containers, follow this link.

Training an MXNet Model with Deep Learning Containers on Amazon EC2

The MXNet Deep Learning Container comes pre-installed with essential libraries such as MXNet, Horovod, NCCL, MPI, CUDA, and cuDNN. The following diagram illustrates this architecture.

To set up AWS DL Containers on an EC2 instance, refer to: Train a Deep Learning model with AWS Deep Learning Containers on Amazon EC2. For a practical tutorial on executing a Horovod training script, complete steps 1-5 from the previous post. When using the MXNet framework, proceed with the following for step 6:

CPU:

Download the Docker image from the Amazon ECR repository.

docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-cpu-py27-ubuntu16.04

In the terminal of the container, execute the following command to train the MNIST example.

git clone --recursive https://github.com/horovod/horovod.git
mpirun -np 1 -H localhost:1 --allow-run-as-root python horovod/examples/mxnet_mnist.py

GPU:

Download the Docker image from the Amazon ECR repository.

nvidia-docker run -it 763104351884.dkr.ecr.us-east-1.amazonaws.com/mxnet-training:1.6.0-gpu-py27-cu101-ubuntu16.04

In the terminal of the container, run the following command to train the MNIST example.

git clone --recursive https://github.com/horovod/horovod.git
mpirun -np 4 -H localhost:4 --allow-run-as-root python horovod/examples/mxnet_mnist.py

If the final output appears as follows, you’ve successfully executed the training script:

[1,0]<stderr>:INFO:root:Epoch[4] Train: accuracy=0.987580 Validation: accuracy=0.988582
[1,0]<stderr>:INFO:root:Training finished with Validation Accuracy of 0.988582

For instructions on terminating the EC2 instances, execute step 7 from the previous post. You can follow the same process for your custom training script.

Training an MXNet Model with Deep Learning Containers on Amazon EKS

Amazon EKS is a managed service that simplifies running Kubernetes on AWS without the need to install, operate, and maintain your own Kubernetes control plane or nodes. Kubernetes is an open-source system that automates the deployment, scaling, and management of containerized applications. This post will guide you through setting up a deep learning environment using Amazon EKS and AWS DL Containers. With Amazon EKS, you can scale a production-ready environment for multiple-node training and inference using Kubernetes containers.

For instructions on establishing a deep learning environment with Amazon EKS and AWS DL Containers, see Amazon EKS Setup. To set up an Amazon EKS cluster, utilize the open-source tool known as eksctl. It is advised to use an EC2 instance with the latest DLAMI. You can create either a GPU or CPU cluster depending on your requirements. Follow the Amazon EKS Setup instructions until you reach the Manage Your Cluster section.

Once your Amazon EKS cluster is operational, you can execute the Horovod MXNet training on the cluster. For detailed instructions, refer to MXNet with Horovod distributed GPU training, utilizing a Docker image that includes a Horovod training script and a three-node cluster with node-type=p3.8xlarge. This tutorial runs the Horovod example script for MXNet on an MNIST model. The Horovod examples directory also features an Imagenet script, which can be executed on the same Amazon EKS cluster.

Getting Started with AWS DLAMI

The AWS DLAMI consists of machine learning images equipped with deep learning frameworks and their associated libraries, such as NVIDIA CUDA, NVIDIA cuDNN, NCCL, Intel MKL-DNN, and more. DLAMI serves as a comprehensive solution for deep learning in the cloud. You can launch EC2 instances with Ubuntu or Amazon Linux. The DLAMI offers pre-installed deep learning frameworks like Apache MXNet, TensorFlow, Keras, and PyTorch. You can train custom models, experiment with innovative deep learning algorithms, and acquire new deep learning skills and techniques.

For those looking to explore career advancement, this article on finding your dream job may provide valuable insights. Additionally, as you consider hiring practices, you may find the expertise from SHRM on relaxing hiring restrictions beneficial. For anyone interested in specific interview processes, this link to Glassdoor provides an excellent resource on Amazon’s Area Manager Leadership Development interview questions.