Enhance Your LLMs with RAG at Scale Using AWS Glue

Large language models (LLMs) are expansive deep-learning architectures that have been pre-trained on extensive datasets. Their versatility allows them to handle a variety of tasks, including answering queries, summarizing text, translating languages, and completing sentences. LLMs hold significant potential to transform content creation and the way users interact with search engines and virtual assistants.

Retrieval Augmented Generation (RAG) enhances LLM output by incorporating references from an authoritative external knowledge base before generating responses. Although LLMs are trained on vast amounts of data and utilize billions of parameters for original output, RAG extends their powerful capabilities to specific domains or an organization’s internal knowledge repository—eliminating the need for LLM retraining. This method is a swift and cost-effective way to ensure that LLM outputs remain relevant, accurate, and contextually useful. RAG involves an information retrieval component that first accesses new data based on user input. This external data can exist in various formats, such as files, databases, or lengthy texts. An AI technique called embedding language models converts this external data into numerical formats, storing it in a vector database. This process effectively creates a knowledge library comprehensible to generative AI models.

Implementing RAG also introduces additional data engineering necessities:

Scalable retrieval indexes must process massive text collections covering essential knowledge areas.
Data needs preprocessing for semantic search during inference, which includes normalization, vectorization, and index optimization.
These indexes continually accumulate documents, necessitating data pipelines that can seamlessly integrate new data on a large scale.
The diversity of data amplifies the need for customizable cleaning and transformation processes to address the unique characteristics of various sources.

In this article, we will discuss how to build a reusable RAG data pipeline utilizing LangChain—an open-source framework for LLM-based applications—while integrating it with AWS Glue and Amazon OpenSearch Serverless. The resulting solution serves as a reference architecture for scalable RAG indexing and deployment. Sample notebooks will be provided, covering ingestion, transformation, vectorization, and index management, allowing teams to effectively utilize disparate data in high-performing RAG applications.

Data Preprocessing for RAG

Effective data preprocessing is essential for responsible retrieval from external datasets with RAG. Clean, high-quality data leads to more accurate RAG results, while privacy and ethical considerations necessitate careful data filtering. This foundational step enables LLMs with RAG to maximize their potential in downstream applications.

A common practice to facilitate effective retrieval from external data is to first clean and sanitize documents. Tools like Amazon Comprehend or AWS Glue’s sensitive data detection feature can identify sensitive information, which you can then clean up using Spark. The next phase involves breaking documents into manageable chunks, converting these chunks into embeddings, and storing them in a vector index while maintaining a link to the original documents. These embeddings help measure semantic similarity between user queries and text from the data sources.

Solution Overview

In this solution, we leverage LangChain integrated with AWS Glue for Apache Spark and Amazon OpenSearch Serverless. By utilizing Apache Spark’s distributed capabilities and PySpark’s adaptable scripting features, our solution is both scalable and customizable. We are employing OpenSearch Serverless as a sample vector store alongside the Llama 3.1 model.

The advantages of this solution include:

Flexibility in data cleaning, sanitizing, and quality management, as well as chunking and embedding.
The ability to build and manage an incremental data pipeline that updates embeddings in the Vectorstore on a large scale.
A diverse selection of embedding models to choose from.
Compatibility with various data sources, including databases, data warehouses, and SaaS applications supported in AWS Glue.

This solution encompasses:

Processing unstructured data formats such as HTML, Markdown, and text files using Apache Spark, including distributed data cleaning, sanitizing, chunking, and embedding vectors for downstream applications.
Integrating everything into a Spark pipeline that incrementally processes sources and publishes vectors to OpenSearch Serverless.
Querying the indexed content using your preferred LLM model to provide natural language responses.

Prerequisites

Before proceeding with this tutorial, ensure that you have created the following AWS resources:

An Amazon Simple Storage Service (Amazon S3) bucket for data storage.
An AWS Identity and Access Management (IAM) role for your AWS Glue notebook, as detailed in the IAM permissions setup for AWS Glue Studio. This role requires IAM permissions for OpenSearch Service Serverless. Below is an example policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "OpenSearchServerless",
      "Effect": "Allow",
      "Action": [
        "aoss:APIAccessAll",
        "aoss:CreateAccessPolicy",
        "aoss:CreateCollection",
        "aoss:CreateSecurityPolicy",
        "aoss:DeleteAccessPolicy",
        "aoss:DeleteCollection",
        "aoss:DeleteSecurityPolicy",
        "aoss:ListCollections"
      ],
      "Resource": "*"
    }
  ]
}

Follow these steps to launch an AWS Glue Studio notebook:

Download the Jupyter Notebook file.
In the AWS Glue console, navigate to Notebooks.
Under Create job, select Notebook.
For Options, choose Upload Notebook.
Choose Create notebook. The notebook will initialize in a minute.

Run the first two cells to configure an AWS Glue interactive session. At this point, you have set up the necessary configurations for your AWS Glue notebook.

Vector Store Setup

The first step is to create a vector store, which facilitates efficient vector similarity searches through specialized indexes. RAG enhances LLMs with an external knowledge base typically structured using a vector database filled with vector-encoded knowledge articles.

For this example, we will use Amazon OpenSearch Serverless for its simplicity and scalability, enabling low-latency vector searches that can handle billions of vectors. To learn more, check out Amazon OpenSearch Service’s vector database capabilities explained.

Complete the following steps to set up OpenSearch Serverless:

In the cell under Vector Store Setup, substitute <your-iam-role-arn> with your IAM role Amazon Resource Name (ARN), and <region> with your AWS Region, then run the cell.
Execute the next cell to create the OpenSearch Serverless collection, security policies, and access policies.

You have successfully provisioned OpenSearch Serverless and are now prepared to inject documents. If you’re interested in further engaging with this topic, consider checking out another blog post that discusses the importance of paying it forward, which can be found here. Additionally, for insights related to workplace dynamics, SHRM provides authoritative resources on returning to work after COVID-19. Lastly, for a glimpse into the experience of Amazon warehouse workers, this Quora thread serves as an excellent resource.

Enhance Your LLMs with RAG at Scale Using AWS Glue

Data Preprocessing for RAG

Solution Overview

Prerequisites

Vector Store Setup

Related Topics: