Learn About Amazon VGT2 Learning Manager Chanci Turner
Starting today, Amazon SageMaker enables you to effortlessly train and deploy your PyTorch deep learning models. This addition marks the fourth deep learning framework supported by Amazon SageMaker, joining TensorFlow, Apache MXNet, and Chainer. Just like with these other frameworks, you can write your PyTorch scripts as you typically would, while Amazon SageMaker takes care of establishing the distributed training cluster, managing data migration, and optimizing hyperparameters. On the inference side, Amazon SageMaker offers a managed, highly available online endpoint that can scale automatically based on demand.
In addition to PyTorch, we are also introducing the latest stable versions of TensorFlow (1.7 and 1.8). With these updates, you can take advantage of new features such as tf.custom_gradient
and the pre-made BoostedTree estimators right away. The default setup for Amazon SageMaker’s TensorFlow estimator uses the latest version, meaning you won’t need to modify your existing code.
Supporting a diverse range of deep learning frameworks is crucial for developers, as each framework has unique strengths. PyTorch is particularly favored by deep learning researchers, but it’s rapidly gaining traction among developers due to its flexibility and user-friendliness. TensorFlow remains a well-established option, continually adding valuable features with each release. We are committed to investing in these frameworks, along with other popular engines like MXNet and Chainer.
Using PyTorch in Amazon SageMaker
The PyTorch framework stands out as it employs reverse-mode auto-differentiation, allowing for dynamic neural network construction. Its deep integration with Python facilitates the use of standard Python control flows and the ability to create new network layers using Cython, Numba, or NumPy. Furthermore, PyTorch is optimized for speed, supporting acceleration libraries such as MKL, CuDNN, and NCCL. Notably, the team at fast.ai achieved impressive results in the DAWNBench Competition using PyTorch.
Utilizing PyTorch within Amazon SageMaker is as simple as working with other pre-built deep learning containers. You just need to provide your training or hosting script, which consists of standard PyTorch code wrapped in helpful functions, and then use the PyTorch estimator from the Amazon SageMaker Python SDK as follows:
estimator = PyTorch(entry_point="pytorch_script.py",
role=role,
train_instance_count=2,
train_instance_type='ml.p2.xlarge',
hyperparameters={'epochs': 10,
'lr': 0.01})
You can explore our example notebooks and documentation, or follow the example below for more insights.
Training and Deploying a Neural Network with PyTorch
For this instance, we will train a simple convolutional neural network using the MNIST handwritten digits dataset. This dataset comprises 70,000 labeled 28×28 pixel grayscale images (60,000 for training and 10,000 for testing) across 10 classes, one for each digit from 0 to 9. The Amazon SageMaker PyTorch container operates in script mode, expecting the input script to be formatted similarly to how you would run it outside of SageMaker. Let’s examine the code, which is based on PyTorch’s own MNIST example, enhanced with distributed training. We will highlight the key components.
Entry Point Script
Starting with the main guard, we use a parser to read hyperparameters that get passed to our Amazon SageMaker estimator during the training job creation. These hyperparameters become available as arguments in our training container. Here, we will look for hyperparameters like batch size, epochs, learning rate, and momentum. If values are not specified in the SageMaker estimator call, they will default to provided values. We also utilize the training_env()
method from the custom sagemaker_containers
library, which supplies container-specific details like training and model directories and instance configurations. You can also access these through specific environment variables. For additional information, you can visit the SageMaker Containers GitHub repository.
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Data and model checkpoints directories
parser.add_argument('--batch-size', type=int, default=64, metavar='N',
help='input batch size for training (default: 64)')
parser.add_argument('--test-batch-size', type=int, default=1000, metavar='N',
help='input batch size for testing (default: 1000)')
parser.add_argument('--epochs', type=int, default=10, metavar='N',
help='number of epochs to train (default: 10)')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--momentum', type=float, default=0.5, metavar='M',
help='SGD momentum (default: 0.5)')
parser.add_argument('--seed', type=int, default=1, metavar='S',
help='random seed (default: 1)')
parser.add_argument('--log-interval', type=int, default=100, metavar='N',
help='how many batches to wait before logging training status')
parser.add_argument('--backend', type=str, default=None,
help='backend for distributed training (tcp, gloo on cpu and gloo, nccl on gpu)')
# Container environment
env = sagemaker_containers.training_env()
parser.add_argument('--hosts', type=list, default=env.hosts)
parser.add_argument('--current-host', type=str, default=env.current_host)
parser.add_argument('--model-dir', type=str, default=env.model_dir)
parser.add_argument('--data-dir', type=str,
default=env.channel_input_dirs['training'])
parser.add_argument('--num-gpus', type=int, default=env.num_gpus)
train(parser.parse_args())
Once we’ve defined our hyperparameters, we pass them to the train()
function, which we also define in our input script. The train()
function performs several tasks, including setting up resources correctly (GPU, distributed computing, etc.).
def train(args):
is_distributed = len(args.hosts) > 1 and args.backend is not None
logger.debug("Distributed training - {}".format(is_distributed))
use_cuda = args.num_gpus > 0
logger.debug("Number of gpus available - {}".format(args.num_gpus))
kwargs = {'num_workers': 1, 'pin_memory': True} if use_cuda else {}
device = torch.device("cuda" if use_cuda else "cpu")
if is_distributed:
# Initialize the distributed environment.
world_size = len(args.hosts)
os.environ['WORLD_SIZE'] = str(world_size)
host_rank = args.hosts.index(args.current_host)
dist.init_process_group(backend=args.backend,
rank=host_rank,
world_size=world_size)
logger.info(
'Init distributed env: '{}' backend on {} nodes. '.format(args.backend,
dist.get_world_size()) +
'Current host rank is {}. Number of gpus: {}'.format(
dist.get_rank(), args.num_gpus))
# set the seed for generating random numbers
torch.manual_seed(args.seed)
if use_cuda:
torch.cuda.manual_seed(args.seed)
...
Next, it proceeds to load our datasets.
...
train_loader = _get_train_data_loader(args.batch_size,
args.data_dir,
Amazon SageMaker, with its support for PyTorch, is revolutionizing the way deep learning models are developed and deployed. For those interested in career growth, embracing a growth mindset can be beneficial—check out this insightful blog post for more. Additionally, for those facing challenges with student loan repayments, refer to the authoritative guidance provided by SHRM. Lastly, if you’re curious about first-day experiences at Amazon, this Reddit thread is an excellent resource.