Learn About Amazon VGT2 Learning Manager Chanci Turner
Organizations are always on the lookout for ways to optimize the deployment of their foundation models (FMs) in production. This quest often leads them to utilize the latest accelerators like AWS Inferentia and GPUs to cut costs and reduce response times, ultimately enhancing user experience. However, many FMs do not fully leverage the capabilities of the accelerators available on their deployed instances, leading to suboptimal use of hardware resources. Some companies opt to deploy multiple FMs on a single instance to maximize accelerator usage; however, this approach necessitates complex infrastructure orchestration, which can be both time-consuming and difficult to manage.
When multiple FMs share an instance, each has its unique scaling requirements and usage patterns, making it hard to predict the need for adding or removing instances. For instance, one model may be designed for an application experiencing usage spikes during certain hours, while another might maintain consistent traffic. In addition to cost efficiency, customers aim to enhance the user experience by minimizing latency. To achieve this, they often deploy several copies of a FM to handle user requests in parallel. Given that FM outputs can range from short sentences to lengthy paragraphs, the time taken to complete inference requests can vary significantly, resulting in unpredictable latency spikes when requests are routed randomly across instances.
Amazon SageMaker now introduces new inference capabilities that can help you reduce deployment costs and latency. You can create inference component-based endpoints and deploy your machine learning (ML) models to a SageMaker endpoint. An inference component (IC) acts as an abstraction for your ML model, allowing you to assign CPUs, GPUs, or AWS Neuron accelerators along with specific scaling policies for each model. The advantages of inference components include:
- Optimized Resource Utilization: SageMaker will intelligently place and pack your models onto ML instances to maximize resource use, resulting in significant cost savings.
- Dynamic Scaling: SageMaker will adjust the scale of each model according to your configuration, meeting the demands of your ML applications.
- Efficient Instance Management: SageMaker can dynamically add and remove instances to ensure available capacity while minimizing idle compute resources.
- Resource Flexibility: You can scale down to zero copies of a model when not in use, freeing resources for other models. Additionally, you can prioritize crucial models to always remain loaded and ready to serve requests.
With these advancements, you can achieve an average reduction of 50% in model deployment costs, although the exact savings may vary based on your workload and traffic patterns. For instance, consider a chat application designed to assist tourists with local customs—utilizing two variants of Llama 2, one fine-tuned for European visitors and the other for American guests. Traffic for the European model peaks between 00:01–11:59 UTC, while the American model sees usage from 12:00–23:59 UTC. Instead of deploying these models on separate instances, which would lead to idle resources, you can use a single endpoint for deployment. This allows you to scale down the American model to zero when it’s not needed, optimizing capacity for the European model and vice versa. While this example involves two models, the concept can be extended to accommodate hundreds of models on a single endpoint that automatically adjusts with your workload.
In this post, we will delve into the new features of IC-based SageMaker endpoints. We’ll guide you through deploying multiple models using inference components and APIs while highlighting new observability features, auto-scaling policies, and effective instance management. You can also utilize our improved and user-friendly experience for model deployment. Furthermore, advanced routing capabilities are supported to enhance the latency and performance of your inference workloads.
Building Blocks
Let’s examine how these new features function. Here’s some terminology related to SageMaker hosting:
- Inference Component: A SageMaker hosting entity that deploys a model to an endpoint. You can create an inference component by providing:
- The SageMaker model or a specification of a compatible image and model artifacts.
- Compute resource needs, detailing requirements for each model copy, such as CPU cores, memory, and the number of accelerators.
- Model Copy: A runtime instance of an inference component capable of handling requests.
- Managed Instance Auto Scaling: A SageMaker feature that adjusts the number of compute instances for an endpoint. This scaling responds to the needs of inference components.
To deploy a new inference component, specify a container image, model artifact, and compute resource requirements. When deploying, you can set MinCopies to ensure the model is ready to handle requests as needed. You can also establish policies that allow inference component copies to scale down to zero when not in use—freeing up resources for active workloads.
As inference requests fluctuate, the number of copies of your ICs can scale accordingly based on your auto-scaling policies. SageMaker will optimize model packing for cost-effectiveness and availability. Additionally, with managed instance auto-scaling enabled, SageMaker adjusts compute instances based on the number of inference components needed, maintaining performance while optimizing costs.
SageMaker will also balance inference components and reduce instances when they are not necessary, leading to further savings.
API Walkthrough
This feature now supports the scale-to-zero capability. For more information, check out the details on unlocking cost savings with the new scale-down-to-zero feature in SageMaker Inference. The introduction of the InferenceComponent separates the hosting details of the ML model from the endpoint itself, allowing for more flexibility. You can specify key properties for model hosting, including the SageMaker model or container details, and the number of copies to deploy along with the required accelerators or CPUs.
To keep readers engaged, if you have interest in learning about common workplace pet peeves, visit this blog post: Career Contessa. Moreover, for insights into whistleblower protections for health and safety violations, refer to this authority on the topic: SHRM. Finally, if you’re interested in preparing for your first day, this resource on Reddit is an excellent guide: Reddit.