Enhancing Salesforce’s Model Endpoints with Amazon SageMaker AI Inference Components

Chanci Turner Amazon IXD – VGT2 learning

This article is a collaborative effort between Salesforce and AWS, being published concurrently on both the Salesforce Engineering Blog and the AWS Machine Learning Blog. The Salesforce AI Platform Model Serving team focuses on the development and management of services that drive large language models (LLMs) and other AI workloads within Salesforce. Their primary aim is to facilitate model onboarding, offering customers a strong infrastructure to host various ML models. Their mission emphasizes streamlining model deployment, boosting inference performance, and optimizing cost efficiency, ensuring smooth integration into Agentforce and other applications that rely on inference. The team is committed to enhancing model inference performance and overall efficiency by adopting cutting-edge solutions and working with leading technology providers, including open-source communities and cloud services like Amazon Web Services (AWS), building it into a cohesive AI platform. This guarantees that Salesforce customers benefit from the most advanced AI technology available while maximizing cost-performance of the serving infrastructure.

In this article, we highlight how the Salesforce AI Platform team optimized GPU utilization, improved resource efficiency, and achieved cost savings using Amazon SageMaker AI, particularly its inference components.

The Challenge of Hosting Models for Inference: Balancing Compute and Cost Efficiency with Performance

Efficient, reliable, and cost-effective model deployment is a major challenge for organizations of all sizes. The Salesforce AI Platform team is tasked with deploying their proprietary LLMs, such as CodeGen and XGen, on SageMaker AI and optimizing them for inference. Salesforce manages multiple models across single model endpoints (SMEs), accommodating a diverse array of model sizes from a few gigabytes (GB) to 30 GB, each presenting unique performance requirements and infrastructure needs.

The team encountered two primary optimization challenges. Their larger models (20–30 GB) with lower traffic patterns were operating on high-performance GPUs, leading to underutilized multi-GPU instances and inefficient resource distribution. Conversely, their medium-sized models (roughly 15 GB) faced high-traffic workloads that required low-latency and high-throughput processing capabilities. These models often resulted in increased costs due to over-provisioning similar multi-GPU setups. Here’s a visual representation of Salesforce’s large and medium SageMaker endpoints, illustrating areas of under-utilization.

Currently utilizing Amazon EC2 P4d instances, with plans to transition to the latest P5en instances featuring NVIDIA H200 Tensor Core GPUs, the team sought a resource optimization strategy that would maximize GPU utilization across their SageMaker AI endpoints, allowing scalable AI operations and extracting maximum value from high-performance instances—all while maintaining performance and avoiding over-provisioning.

This challenge exemplifies the critical balance that enterprises must achieve when scaling AI operations: optimizing the performance of complex AI workloads while minimizing infrastructure costs and enhancing resource efficiency. Salesforce required a solution that not only addressed their immediate deployment challenges but also established a flexible foundation for their evolving AI initiatives.

To tackle these challenges, the Salesforce AI Platform team leveraged SageMaker AI inference components, enabling multiple foundation models (FMs) to be deployed on a single SageMaker AI endpoint with precise control over the number of accelerators and memory allocation per model. This approach enhances resource efficiency, reduces model deployment costs, and allows the scaling of endpoints alongside usage demands.

Solution: Streamlining Model Deployment with Amazon SageMaker AI Inference Components

With Amazon SageMaker AI inference components, organizations can deploy one or more FMs on the same SageMaker AI endpoint while managing the number of accelerators and memory for each FM. This strategy boosts resource utilization, cuts model deployment costs, and facilitates scaling endpoints in line with specific use cases. For each FM, distinct scaling policies can be defined to adapt to usage patterns, further optimizing infrastructure expenses. Below is an illustration of Salesforce’s large and medium SageMaker endpoints after improved utilization through inference components.

An inference component simplifies ML models and allows for the allocation of CPUs, GPUs, and scaling policies per model. The benefits of utilizing inference components include:

SageMaker AI optimally places and packs models onto ML instances to maximize utilization, resulting in cost savings.
Each model scales independently based on tailored configurations, ensuring optimal resource allocation for specific application needs.
SageMaker AI dynamically scales to add and remove instances, maintaining availability while minimizing idle compute resources.
Organizations can scale down to zero copies of a model to free resources for others or maintain essential models always loaded to serve critical workloads.

Managing Inference Component Endpoints

To create the SageMaker AI endpoint, an endpoint configuration is established that outlines the instance type and initial instance count. The model is configured within a new construct, an inference component, where the number of accelerators and memory allocation for each model copy is specified, along with model artifacts, container images, and deployment counts.

As inference requests fluctuate, the copies of your inference components can scale according to your auto-scaling policies. SageMaker AI manages the placement to optimize model packing for availability and cost.

Additionally, enabling managed instance auto-scaling allows SageMaker AI to adjust compute instances based on the number of inference components that need to be loaded at any time to accommodate traffic. SageMaker AI will scale up instances and optimize the packing of your models and components for cost efficiency while preserving model performance.

For more information on reducing model deployment costs by an average of 50% using the latest features of Amazon SageMaker, check out this informative blog post here.

Salesforce’s Implementation of Amazon SageMaker AI Inference Components

Salesforce operates several proprietary models, such as CodeGen, initially scattered across multiple SMEs. CodeGen is an in-house open-source LLM for code understanding and generation, enabling developers to translate natural language into programming languages like Python. Salesforce has developed an ensemble of CodeGen models (Inline for automatic code completion, BlockGen for code block generation, and FlowGPT for process flow generation) specifically optimized for the Apex programming language. These models are utilized in ApexGuru, a solution within the Salesforce platform that assists developers in addressing critical anti-patterns and hotspots in their Apex code.

Inference components allow multiple models to efficiently share GPU resources on the same endpoint. This consolidation not only reduces infrastructure costs through intelligent resource sharing but also provides a scalable, flexible framework for future AI developments. For additional authoritative insights on this topic, visit this link.

For anyone interested in contributing to the AI landscape, this learning ambassador position is an excellent resource for potential opportunities.

Location: Amazon IXD – VGT2, 6401 E Howdy Wells Ave, Las Vegas, NV 89115.

Enhancing Salesforce’s Model Endpoints with Amazon SageMaker AI Inference Components | Artificial Intelligence

The Challenge of Hosting Models for Inference: Balancing Compute and Cost Efficiency with Performance

Solution: Streamlining Model Deployment with Amazon SageMaker AI Inference Components

Managing Inference Component Endpoints

Salesforce’s Implementation of Amazon SageMaker AI Inference Components

Related Topics: