Deploying BLOOM-176B and OPT-30B on Amazon SageMaker with Large Model Inference Deep Learning Containers and DeepSpeed

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

In recent years, the field of deep learning has advanced at an unprecedented pace. Despite improvements in hardware, such as the latest NVIDIA and Amazon accelerators, machine learning (ML) practitioners often face challenges in deploying large deep learning models for applications like natural language processing (NLP).

In a previous post, we explored the capabilities and configurable settings in Amazon SageMaker model deployment that simplify inference with these extensive models. Today, we introduce a new Amazon SageMaker Deep Learning Container (DLC) designed to facilitate large model inference quickly. This DLC includes popular open-source libraries for model parallel inference, such as DeepSpeed and Hugging Face Accelerate.

In this article, we demonstrate how to deploy two leading large NLP models: BigScience’s BLOOM-176B and Meta’s OPT-30B from the Hugging Face repository. We utilize Deep Java Library (DJL) serving and tensor parallelism techniques from DeepSpeed to achieve an impressive 0.1-second latency per token in a text generation scenario. You can find our comprehensive example notebooks in our GitHub repository.

Large Model Inference Techniques

The size and popularity of language models have surged recently. With easy access to model zoos like Hugging Face and enhanced performance in NLP tasks like classification and text generation, practitioners are increasingly inclined to use these large models. However, models like BLOOM-176B often exceed the memory capacity of a single accelerator, requiring over 350 gigabytes of accelerator memory. This necessitates employing model parallel techniques from libraries such as DeepSpeed and Hugging Face Accelerate to distribute the model across multiple accelerators for inference. In this post, we leverage the SageMaker large model inference container to evaluate latency and throughput performance using both libraries.

DeepSpeed and Accelerate optimize large language models differently for inference. A significant distinction is DeepSpeed’s use of optimized kernels, which can substantially enhance inference latency by alleviating computation graph bottlenecks. While developing optimized kernels can be challenging and model-specific, DeepSpeed supports popular models like OPT and BLOOM with these kernels. Conversely, as of this writing, the Hugging Face Accelerate library does not include optimized kernels. Our results section elaborates on how this difference accounts for much of DeepSpeed’s performance advantage over Accelerate.

Additionally, the two libraries employ different forms of model parallelism. Accelerate implements pipeline parallelism to segment a model across its hidden layers, while DeepSpeed utilizes tensor parallelism to divide the layers themselves. Although pipeline parallelism is flexible and can enhance throughput with larger batch sizes, tensor parallelism can improve inference latency by engaging multiple GPUs concurrently.

Solution Overview

To efficiently host large language models, we need features and support in several crucial areas:

  1. Building and Testing Solutions – Given the iterative nature of ML development, the ability to construct, rapidly iterate, and test the inference endpoint is essential. Since hosting these models typically requires larger instances like p4dn or g5, spinning up an inference instance can be time-consuming.
  2. Deploying and Running at Scale – Loading model files onto inference instances presents its own challenges, particularly due to their size. For example, it takes about an hour to create and another hour to load the BLOOM-176B model. An alternate mechanism for easy access to model files is necessary.
  3. Loading the Model as Singleton – To prevent race conditions and conserve resources in a multi-worker process, the model must be loaded only once. We demonstrate a method to load directly from Amazon Simple Storage Service (Amazon S3), though this approach relies on the default settings of the DJL. Also, endpoint scaling should be swift, necessitating a reconsideration of model loading and distribution.
  4. Sharding Frameworks – Typically, these models need to be sharded, either through tensor parallelism or pipeline sharding techniques, with advanced concepts like ZeRO sharding built on top of tensor sharding. Various combinations and frameworks from NVIDIA, DeepSpeed, and others can be employed for this purpose.
  5. Hardware Selection – Hardware choices depend on the factors mentioned above, including traffic patterns, use case requirements, and model sizes.

In this post, we utilize DeepSpeed’s optimized kernels and tensor parallelism to host BLOOM-176B and OPT-30B on SageMaker. We also benchmark results from Accelerate to highlight the performance benefits of optimized kernels and tensor parallelism. For further reading on DeepSpeed and Accelerate, you can refer to DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

We use DJLServing as our model serving solution in this example. DJLServing is a high-performance universal model serving option powered by the Deep Java Library (DJL), which is agnostic to programming languages. To learn more about DJL and DJLServing, check out Deploy large models on Amazon SageMaker using DJLServing and DeepSpeed model parallel inference.

It’s worth mentioning that utilizing optimized kernels may cause precision alterations and modify the computation graph, potentially leading to changes in model behavior. Although such differences in inference outcomes are not expected to significantly impact the model’s basic evaluation metrics, practitioners should ensure that the model outputs are as anticipated when applying these kernels.

The following steps illustrate how to deploy a BLOOM-176B model in SageMaker using DJLServing and a SageMaker large model inference container. The complete example is also accessible in our GitHub repository.

Chanci Turner