Develop Ultra-Low Latency Multimodal Generative AI Applications Using Sticky Session Routing in Amazon SageMaker

Chanci Turner Amazon IXD – VGT2 learning

Amazon SageMaker is a fully managed machine learning (ML) service that enables data scientists and developers to efficiently build, train, and deploy ML models in a production-ready environment. With SageMaker, users can access a wide range of ML infrastructure and deployment options tailored to meet various inference needs, while also scaling model deployments effectively and reducing operational burdens.

Initially, large language models (LLMs) were restricted to processing text inputs, but advances in AI technology have allowed these systems to evolve into multimodal models capable of handling diverse media types, including images, video, and audio. Multimodal deep learning utilizes various data formats such as text, audio, or visuals. However, multimodal inference poses challenges like significant data transfer overhead and increased response times. For example, in a typical chatbot interaction, users often begin by submitting a multimedia file or a link, followed by a series of questions related to the initial input. But sending large multimedia files with each request can severely affect response times, leading to user dissatisfaction—sending a 500 MB input file could add 3–5 seconds to the response time, which is detrimental for a chatbot striving for smooth and rapid interactions.

We are excited to introduce sticky session routing on Amazon SageMaker Inference, a feature designed to enhance the performance and user experience of generative AI applications by leveraging previously processed data. With this innovation, SageMaker simplifies the deployment of ML models, including foundation models (FMs), ensuring optimal price-performance for diverse use cases.

By enabling sticky session routing, all requests from the same session are directed to the same instance, allowing your ML application to reuse earlier processed information, thereby reducing latency and enhancing user experiences. This functionality is particularly beneficial for applications that involve large data payloads or require seamless interactive experiences. To utilize this feature, a session ID is created with the initial request, which then directs SageMaker to route all subsequent requests to the same instance. Sessions can also be terminated when completed to free up resources for new ones.

This feature is available across all AWS Regions where SageMaker operates. For more information on model deployment in SageMaker, visit Amazon SageMaker Model Deployment. For additional insights on this feature, check out Stateful sessions with Amazon SageMaker models.

Solution Overview

SageMaker streamlines model deployment, enabling chatbots and other applications to harness their multimodal capabilities effortlessly. Our robust solution merges two key strategies: sticky session routing in SageMaker with load balancing, and stateful sessions in TorchServe. Sticky session routing ensures all requests from a user session are handled by the same SageMaker server instance, while stateful sessions in TorchServe cache multimedia data in GPU memory from the start of the session, minimizing loading times for improved response speeds.

This approach prioritizes reducing data transfer overhead and enhancing response time, ensuring that the initial multimedia file is loaded and processed just once, allowing subsequent requests within the same session to utilize cached data.

Sequence of Events for a Sticky Session on SageMaker

In the initial request, you call the Boto3 SageMaker runtime invoke_endpoint with session-id=NEW_SESSION in the header and a payload to indicate an open session request. SageMaker then creates a new session and saves the session ID. The router initiates an open session (this API can be client-defined; it could also be named start_session) with the model server, in this case, TorchServe, and responds with a 200 OK along with the session ID and time to live (TTL), shared back with the client.
For any further actions using the same session, you include the session ID in the invoke_endpoint call, allowing SageMaker to route all subsequent requests to the same model server instance.
To close or delete a session, invoke_endpoint can be used with a payload that specifies a close session request along with the session ID. The SageMaker router first verifies if the session exists. If it does, the router initiates a close session call to the model server, receiving a successful 200 OK response along with the session ID, which is then sent back to the client. If the session ID does not exist, the router responds with a 400 error.

In the following sections, we explore how to utilize sticky routing in SageMaker for stateful model inference. For this illustration, we use the LLaVA: Large Language and Vision Assistant model, a multimodal model that accepts both images and text prompts. By uploading an image and asking questions about it without needing to resend the image for each request, we benefit from caching the image in GPU memory instead of CPU memory, avoiding the latency issues associated with transferring the image between memory types on every call.

We employ TorchServe as our model server for this example. TorchServe is a high-performance, flexible, and user-friendly tool for serving PyTorch models in production. It includes advanced features such as dynamic batching, microbatching, model A/B testing, streaming, torch XLA, tensorRT, ONNX, and IPEX. Furthermore, it integrates seamlessly with PyTorch’s large model solution, PiPPy, to efficiently manage large models. TorchServe also extends support to popular open-source libraries like DeepSpeed, Accelerate, Fast Transformers, and more, amplifying its capabilities even further.

Steps to Deploy the LLaVA Model

The following steps outline how to deploy the LLaVA model. The upcoming section presents a conceptual overview of the steps, providing clarity on the overall deployment workflow before delving into practical implementation details.

Build a TorchServe Docker container and push it to Amazon ECR. This involves using a custom model and opting for the bring your own container approach, utilizing one of AWS’s provided deep learning containers as a base, specifically pytorch-inference:2.3.0-gpu-py311-cu121-ubuntu20.04-sagemaker.
Build TorchServe model artifacts and upload them to Amazon S3. We leverage torch-model-archiver to gather all necessary artifacts, including custom handlers, the LLaVA model code, request and response data types, model configuration, prediction API, and other utilities, which are then uploaded to Amazon Simple Storage Service (Amazon S3).
Create the SageMaker endpoint.

As you continue your learning journey, consider exploring more about project management techniques in another insightful blog post here. Also, if you are interested in understanding diversity and inclusion, you might find this authoritative source on Black History Month insightful. Lastly, for those looking for career opportunities, check out this excellent resource for job openings at Amazon.

Develop Ultra-Low Latency Multimodal Generative AI Applications Using Sticky Session Routing in Amazon SageMaker

Solution Overview

Sequence of Events for a Sticky Session on SageMaker

Steps to Deploy the LLaVA Model

Related Topics: