Information Extraction with LLMs through Amazon SageMaker JumpStart

Chanci Turner Amazon IXD – VGT2 learning

Author: Pooya Vahidi and Romina Sharifpour

Date: May 7, 2024

Categories: Amazon SageMaker JumpStart, Best Practices, Expert Insights, Generative AI, Technical Guides

Large language models (LLMs) have revolutionized the way we extract information from unstructured text data. While there is considerable buzz around LLMs in the context of generative AI, many of the core applications remain unchanged. Examples include routing support tickets, identifying customer intents in chatbot interactions, extracting key entities from contracts and invoices, and analyzing customer feedback—long-standing needs in various industries.

The transformative power of LLMs lies in their capability to deliver state-of-the-art results on these established tasks with minimal data and straightforward prompting, while also managing multiple tasks simultaneously. Unlike traditional methods that demand extensive feature engineering and data labeling, LLMs can be fine-tuned using small, domain-specific datasets, allowing for rapid adaptation to new applications. Platforms like Amazon SageMaker JumpStart facilitate this process by alleviating the complexities involved in fine-tuning and deploying these models.

SageMaker JumpStart serves as a machine learning (ML) hub, offering foundational models (FMs), built-in algorithms, and pre-configured ML solutions that can be deployed with just a few clicks. This platform allows users to evaluate, compare, and select models based on predefined quality and accountability metrics for tasks such as article summarization and image generation.

In this article, we will explore examples of how to build information extraction use cases by integrating LLMs with prompt engineering and frameworks like LangChain. We will also assess the benefits of fine-tuning an LLM for specific extractive tasks. Whether your goal is to classify documents, extract keywords, identify and redact personally identifiable information (PII), or analyze semantic relationships, the use of LLMs can enhance your natural language processing (NLP) initiatives.

Prompt Engineering

Prompt engineering is the technique of designing instructions for LLMs to generate suggestions, explanations, or text completions interactively. This method relies on large pretrained language models that have been trained on vast amounts of text data. There isn’t a one-size-fits-all approach to crafting prompts; different LLMs may respond better or worse to various prompt styles. Therefore, prompts are typically refined through an iterative process of testing and adjustment to achieve optimal results. As a starting point, users can refer to the model documentation, which usually contains prompting recommendations and examples available in SageMaker JumpStart.

In the sections that follow, we will concentrate on the prompt engineering techniques relevant for extractive use cases. These techniques help unlock the potential of LLMs by providing useful constraints and directing the model toward desired outcomes. We will cover the following applications:

Detection and redaction of sensitive information
Entity extraction, including both generic and specific entities in structured formats
Classification through prompt engineering and fine-tuning

Prerequisites

The source code for this example is accessible in a GitHub repository, which includes several Jupyter notebooks and a utils.py module containing shared code utilized across the notebooks.

The easiest way to execute this example is by using Amazon SageMaker Studio with the Data Science 3.0 kernel or an Amazon SageMaker notebook instance with the conda_python3 kernel. Default settings can be used for the instance type. In this instance, we utilize ml.g5.2xlarge and ml.g5.48xlarge for endpoint tasks, and ml.g5.24xlarge for training. Ensure you have adequate quotas for these instances in your chosen region by checking the Service Quotas console.

Throughout this article, we will employ Jupyter notebooks. Before delving into the examples, it’s essential to verify that you have the latest version of the SageMaker Python SDK, which provides a user-friendly interface for training and deploying models on SageMaker. To install or update to the latest version, execute the following command in the first cell of your Jupyter notebook:

%pip install --quiet --upgrade sagemaker

Deploying Llama-2-70b-chat using SageMaker JumpStart

Amazon SageMaker JumpStart offers a variety of LLMs to choose from. In this example, we will utilize Llama-2-70b-chat, although you may select a different model based on your specific use case. For a comprehensive list of available models, refer to the JumpStart Available Model Table.

You can deploy a model from SageMaker JumpStart using either APIs, as shown in this article, or through the SageMaker Studio UI. Once the model is deployed, you can test it by posing a question to the model:

from sagemaker.jumpstart.model import JumpStartModel

model_id, model_version = "meta-textgeneration-llama-2-70b-f", "2.*"
endpoint_name = model_id
instance_type = "ml.g5.48xlarge"

model = JumpStartModel(
    model_id=model_id, model_version=model_version, role=role_arn
)
predictor = model.deploy(
    endpoint_name=endpoint_name, instance_type=instance_type
)

If an instance type is not specified, the SageMaker JumpStart SDK will select the default option. In this example, we explicitly choose ml.g5.48xlarge.

Sensitive Data Extraction and Redaction

LLMs exhibit great potential for extracting sensitive information for redaction purposes. Techniques such as prompt engineering can prime the model to understand the redaction task and provide examples that enhance performance. For instance, by instructing the model to “redact sensitive information” and demonstrating examples of redacting names, dates, and locations, the LLM can infer the task’s rules.

More sophisticated forms of priming involve providing both positive and negative examples, showcasing common errors, and employing in-context learning to convey the subtleties of effective redaction. With thoughtful prompt design, LLMs can effectively redact information while preserving the document’s readability and usefulness. However, in real-world applications, additional evaluation is often necessary to enhance the reliability and safety of LLMs in handling confidential data. This is usually accomplished through human reviews, as no automated method is entirely without flaws.

Here are a few examples of applying prompt engineering for the extraction and redaction of PII. The prompt consists of two main parts: the report_sample, which contains the text from which you want to identify and mask PII, and instructions (or guidance) provided to the model as the system message:

report_sample = """
This month at AnyCompany, we have seen a significant surge in orders from a diverse clientele. On November 5th, 2023, customer Alice from US placed an order with total of $2190. Following her, on Nov 7th, Bob from UK ordered a bulk set of twenty-five ergonomic keyboards for his office setup with total of $1000. The trend continued with Jane from Australia, who on Nov 12th requested a shipment of ten high-definition monitors with total of $9000, emphasizing the need for environmentally friendly packaging. On the last day of that month, customer John, located in Singapore, finalized an order for fifteen USB-C docking stations, aiming to equip his design studio with the latest technology for total of $3600.
"""

system = """
Your task is to precisely identify Personally Identifiable Information (PII) and identifiable details, including name, address, and the person's country, in the prov in a serious tone, making it about the same overall length.
"""

SEO Metadata

To enhance the visibility of this article, consider implementing SEO best practices such as optimizing the title, using relevant keywords, and including meta descriptions.