Learn About Amazon VGT2 Learning Manager Chanci Turner
This article is a collaborative effort between Salesforce and AWS, being featured on both the Salesforce Engineering Blog and the AWS Machine Learning Blog.
Salesforce, Inc., based in San Francisco, California, offers cloud-based software designed for customer relationship management (CRM). The company is focused on integrating artificial general intelligence (AGI) into its offerings, enhancing predictive and generative capabilities within its software-as-a-service (SaaS) CRM solutions, while also working towards intelligent automation through AI technologies.
Salesforce Einstein encompasses a range of AI technologies that integrate seamlessly with Salesforce’s Customer Success Platform, aiming to boost productivity and client engagement. With over 60 features across four main categories—machine learning (ML), natural language processing (NLP), computer vision, and automatic speech recognition—Einstein empowers businesses to create personalized and predictive customer experiences. It includes out-of-the-box functionalities, such as sales email generation and automated service replies, along with tools like Copilot, Prompt, and Model Builder available in Einstein 1 Studio for custom AI development.
The Salesforce Einstein AI Platform team focuses on enhancing the performance and capabilities of AI models, particularly large language models (LLMs) used within Einstein products. Their goal is to continually refine these models by integrating cutting-edge solutions and collaborating with top technology providers, ensuring that Salesforce customers benefit from the latest advancements in AI technology.
In this article, we discuss how the Salesforce Einstein AI Platform team improved the latency and throughput of their code generation LLM utilizing Amazon SageMaker.
The Challenge of Hosting LLMs
At the start of 2023, the team began exploring solutions for hosting CodeGen, Salesforce’s proprietary open-source LLM tailored for code understanding and generation. This model translates natural language into programming languages like Python. As they were already using AWS for inference with their smaller predictive models, they sought to expand the Einstein platform to accommodate CodeGen. Salesforce developed a suite of CodeGen models, including Inline for automatic code completion, BlockGen for generating code blocks, and FlowGPT for process flow generation, all specifically optimized for the Apex programming language—a certified framework for building SaaS applications on Salesforce’s CRM platform.
They needed a secure hosting solution capable of managing a high volume of inference requests and multiple concurrent requests at scale, all while meeting the latency and throughput requirements for their co-pilot application, EinsteinGPT for Developers. This tool simplifies development by generating smart Apex code from natural language prompts, allowing developers to accelerate coding tasks and identify code vulnerabilities in real-time within the Salesforce integrated development environment (IDE).
The Einstein team thoroughly evaluated various tools and services, including both open-source and commercial options. Ultimately, they determined that SageMaker offered the optimal access to GPUs, scalability, flexibility, and performance enhancements necessary to tackle their latency and throughput challenges.
Reasons for Choosing SageMaker
SageMaker provided several key features that were crucial for Salesforce’s needs:
- Diverse Serving Engines: SageMaker includes specialized deep learning containers (DLCs) and libraries designed for model parallelism and large model inference (LMI) containers. These high-performance Docker containers are tailored for LLM inference, enabling the use of advanced open-source libraries. The team appreciated the quick-start notebooks that facilitated the rapid deployment of popular open-source models.
- Advanced Batching Strategies: The SageMaker LMI enables customers to optimize LLM performance through batching, which groups multiple requests before they reach the model. Dynamic batching allows the server to collect requests over a specified timeframe, optimizing GPU resource usage and balancing throughput with latency, thus minimizing the latter.
- Efficient Routing Strategy: SageMaker endpoints default to a random routing strategy but also support a least outstanding requests (LOR) strategy, ensuring requests are directed to the most suitable instance. This flexibility, combined with the ability to manage multiple model instances across several GPUs, prevents bottlenecks by distributing traffic evenly.
- Access to High-End GPUs: SageMaker provides top-tier GPU instances essential for efficient LLM operation, particularly valuable during the current shortage of high-end GPUs. The auto-scaling feature allowed the Einstein team to adapt to demand without manual intervention.
- Rapid Iteration and Deployment: While not directly related to latency, SageMaker’s ability to quickly test and deploy changes through notebooks reduced the overall development cycle, indirectly impacting latency by speeding up performance improvements.
These combined features significantly enhance LLM performance by minimizing latency and boosting throughput, making Amazon SageMaker a robust solution for Salesforce Einstein’s needs.
To enrich your professional development, consider exploring additional resources like this guide on self-care or learn more about how employers can assist with unexpected medical bills. For those looking to advance their careers, check out the opportunities available through Amazon’s Leadership Development Training.