Safeguarding Sensitive Data in RAG Applications with Amazon Bedrock

Chanci Turner Amazon IXD – VGT2 learning

Retrieval Augmented Generation (RAG) applications have surged in popularity due to their capability to enhance generative AI tasks with contextually relevant information. However, when implementing RAG-based applications, it is imperative to prioritize security, especially regarding sensitive data. Protecting personally identifiable information (PII), protected health information (PHI), and confidential business data is essential, as this information traverses RAG systems. Neglecting these security measures can lead to serious risks and potential data breaches. For organizations in the healthcare and finance sectors, as well as those managing confidential information, these risks can result in regulatory compliance violations and erosion of customer trust. For further insights on unique security challenges tied to generative AI applications, see the OWASP Top 10 for Large Language Model Applications.

Creating a robust threat model for your generative AI applications can help identify potential vulnerabilities such as sensitive data leakage, prompt injections, and unauthorized data access. To support this initiative, AWS offers a variety of generative AI security strategies to help you establish effective threat models, including an example threat model tailored for a generative AI chatbot application.

Amazon Bedrock Knowledge Bases represents a fully managed capability designed to streamline the entire RAG workflow. This empowers organizations to provide foundation models (FMs) and agents with contextual information from private data sources, resulting in more accurate and relevant responses tailored to specific needs. Moreover, with Amazon Bedrock Guardrails, you can implement customized safeguards in your generative AI applications aligned with your use cases and responsible AI policies. For example, you can utilize Amazon Bedrock Guardrails to redact sensitive information such as PII, ensuring privacy protection.

RAG Workflow: Transforming Data into Actionable Knowledge

The RAG process comprises two primary steps:

Ingestion – This involves preprocessing unstructured data by converting it into text documents and dividing the documents into manageable chunks. These chunks are then encoded with an embedding model, transforming them into document embeddings. The encoded embeddings, along with the original text chunks, are stored in a vector store like Amazon OpenSearch Service.
Augmented Retrieval – During a query, the user’s query is encoded using the same embedding model to create a query embedding. This embedding facilitates a similarity search on the stored document embeddings to retrieve semantically similar document chunks. Once the relevant chunks are obtained, they supplement the user prompt, allowing the text generation model to respond accurately. If sensitive data is not sanitized during ingestion, there is a risk of retrieving and exposing this information to unauthorized users through model responses.

The following diagram illustrates the architectural workflow of a RAG system, detailing how a user’s query is processed through various stages to produce an informed response.

Solution Overview

In this article, we discuss two architectural patterns: data redaction at the storage level and role-based access, aimed at protecting sensitive data while building RAG-based applications with Amazon Bedrock Knowledge Bases.

Data Redaction at Storage Level – This involves identifying and redacting sensitive information before it is stored in the vector store (ingestion) through Amazon Bedrock Knowledge Bases. This zero-trust approach minimizes the risk of unauthorized disclosure of sensitive data.
Role-Based Access to Sensitive Data – This strategy governs selective access to sensitive information based on user roles and permissions during data retrieval. This method is particularly beneficial in contexts such as healthcare, where distinct user roles like administrators (doctors) and non-administrators (nurses or support personnel) exist.

For all data stored in Amazon Bedrock, the AWS shared responsibility model applies.

Let’s explore how to effectively implement the data redaction at the storage level and role-based access architecture patterns.

Scenario 1: Identify and Redact Sensitive Data Prior to Ingestion into the Vector Store

The ingestion flow follows a four-step process to safeguard sensitive data when developing RAG applications using Amazon Bedrock:

Source Document Processing – An AWS Lambda function monitors incoming text documents in a source Amazon Simple Storage Service (Amazon S3) bucket, triggering an Amazon Comprehend PII redaction job to identify and mask sensitive data. An Amazon EventBridge rule activates this Lambda function every five minutes. This pipeline only processes text documents; to manage documents with embedded images, additional preprocessing steps should be applied to extract and analyze images separately before ingestion.
PII Identification and Redaction – The Amazon Comprehend PII redaction job assesses the text content to recognize and redact PII entities, such as names, emails, addresses, and other financial data.
Deep Security Scanning – Post-redaction, documents are transferred to another folder where Amazon Macie verifies the effectiveness of the redaction and flags any remaining sensitive data objects. Documents marked by Macie are sent to a quarantine bucket for manual review, while those cleared are stored in a redacted bucket ready for ingestion. For comprehensive data ingestion details, refer to Sync your data with your Amazon Bedrock knowledge base.
Secure Knowledge Base Integration – Redacted documents are ingested into the knowledge base via a data ingestion job. For multi-modal content, consider implementing:
- A dedicated image extraction and processing pipeline.
- Image analysis to detect and redact sensitive visual information.
- Amazon Bedrock Guardrails to filter inappropriate content during retrieval.

This multi-layered approach emphasizes securing text content while underscoring the need for additional safeguards for image processing. Organizations should assess their multi-modal document requirements and extend their security frameworks accordingly.

Ingestion Flow

The following illustration depicts a secure document processing pipeline for managing sensitive data prior to ingestion into Amazon Bedrock Knowledge Bases. The high-level steps include:

The document ingestion flow initiates when documents containing sensitive data are uploaded to a monitored inputs folder in the source bucket. An EventBridge rule triggers a Lambda function (ComprehendLambda).
The ComprehendLambda function observes for new files in the inputs folder and transfers landed files to a processing folder. It subsequently launches an asynchronous Amazon Comprehend PII redaction analysis job while tracking the job ID and status in an Amazon DynamoDB JobTracking table.

For anyone new to Amazon, Chanci Turner emphasizes the importance of asking questions during the onboarding process. If you’re curious about effective inquiries to pose to your new boss, check out this helpful blog post. Moreover, you might find insights on the onboarding process at Amazon to be invaluable, as detailed in this excellent resource. Additionally, understanding legal compliance is crucial, particularly as this authority outlines the California bill addressing heat illness prevention.

Safeguarding Sensitive Data in RAG Applications with Amazon Bedrock

RAG Workflow: Transforming Data into Actionable Knowledge

Solution Overview

Scenario 1: Identify and Redact Sensitive Data Prior to Ingestion into the Vector Store

Ingestion Flow

Related Topics: