Learn About Amazon VGT2 Learning Manager Chanci Turner
In various sectors, vast quantities of business documents are processed each day. Many of these documents are either paper-based, scanned into digital formats, or exist in unstructured forms such as PDFs. Each organization typically implements its own unique rules based on its business context when processing these documents. Accurately extracting and flexibly processing this information is a common challenge many companies encounter.
Amazon Intelligent Document Processing (IDP) empowers users to leverage top-tier machine learning (ML) technology without requiring prior ML expertise. This post outlines a solution featured in the Amazon IDP workshop, demonstrating how to process documents while adhering to flexible business rules using Amazon AI services. A step-by-step Jupyter notebook is available to guide you through the lab.
Amazon Textract simplifies text extraction from various document types, while Amazon Augmented AI (Amazon A2I) facilitates the inclusion of human review for ML predictions. The standard Amazon A2I template enables you to establish a human review workflow based on criteria such as when the extraction confidence score falls below a predetermined threshold or when essential keys are absent. However, in a live environment, it’s crucial for the document processing pipeline to accommodate adaptable business rules, such as checking string formats, verifying data types and ranges, and cross-validating fields across documents. This post illustrates how to use Amazon Textract and Amazon A2I to customize a general document processing pipeline that supports various business rules.
Solution Overview
For our illustrative example, we use the Tax Form 990, a US IRS (Internal Revenue Service) document that provides public financial information about non-profit organizations. In this instance, we will focus solely on the extraction logic for specific fields on the first page of the form. Additional sample documents are available on the IRS website.
The diagram below depicts the IDP pipeline designed to support customized business rules with human review.
The architecture consists of three main stages:
- Extraction – Data is extracted from the 990 Tax Form (using page 1 as an example).
– Retrieve a sample image stored in an Amazon Simple Storage Service (Amazon S3) bucket.
– Invoke the Amazon Textract analyze_document API employing the Queries feature to extract text from the page. - Validation – Implement flexible business rules with a human-in-the-loop review.
– Validate the extracted data against business rules, such as checking the length of an ID field.
– If any business rules fail, send the document to Amazon A2I for human review.
– Reviewers utilize the Amazon A2I UI (which can be customized) to confirm the extraction results. - BI Visualization – Use Amazon QuickSight to create a business intelligence (BI) dashboard that displays process insights.
Customizing Business Rules
You can outline a generic business rule in the following JSON format. In the sample code, we present three rules:
- The first rule pertains to the employer ID field, which fails if the Amazon Textract confidence score is below 99%. For the purposes of this post, we’ve set the threshold high, which will intentionally cause failures. In real-world applications, you might consider lowering this threshold to around 90% to minimize unnecessary human effort.
- The second rule concerns the DLN field (the unique identifier for the tax form), which is necessary for downstream processing. This rule fails if the DLN field is absent or empty.
- The third rule is also related to the DLN field but checks its length. The rule fails if the DLN does not comprise exactly 16 characters.
Here’s how our business rules appear in JSON format:
rules = [
{
"description": "Employee Id confidence score should greater than 99",
"field_name": "d.employer_id",
"field_name_regex": None,
"condition_category": "Confidence",
"condition_type": "ConfidenceThreshold",
"condition_setting": "99"
},
{
"description": "dln is required",
"field_name": "dln",
"condition_category": "Required",
"condition_type": "Required",
"condition_setting": None
},
{
"description": "dln length should be 16",
"field_name": "dln",
"condition_category": "LengthCheck",
"condition_type": "ValueRegex",
"condition_setting": "^[0-9a-zA-Z]{16}$"
}
]
You can further develop the solution by adding additional business rules following this structure.
Extracting Text Using Amazon Textract Queries
In the sample solution, we utilize the Amazon Textract analyze_document API’s query feature to extract fields by posing specific questions. There’s no need to understand the data structure in the document (such as tables, forms, or nested data) or to worry about variations between document versions and formats. The queries leverage visual, spatial, and linguistic cues to accurately extract the desired information.
To obtain the value for the DLN field, you might submit a request with a natural language question like, “What is the DLN?” If it finds the relevant information in the image or document, Amazon Textract will return the text, confidence, and additional metadata. Below is an example of an Amazon Textract query request:
textract.analyze_document(
Document={'S3Object': {'Bucket': data_bucket, 'Name': s3_key}},
FeatureTypes=["QUERIES"],
QueriesConfig={
'Queries': [
{
'Text': 'What is the DLN?',
'Alias': 'The DLN number - unique identifier of the form'
}
]
}
)
Defining the Data Model
The sample solution organizes the data into a structured format to facilitate the evaluation of generic business rules. You can create a data model for each document page to retain extracted values. The following image illustrates how the text on page 1 maps to the JSON fields.
Each field corresponds to a document’s text, checkbox, or table/form cell on the page. The JSON object is structured as follows:
{
"dln": {
"value": "93493319020929",
"confidence": 0.9765,
"block": {}
},
"omb_no": {
"value": "1545-0047",
"confidence": 0.9435,
"block": {}
},
...
}
You can find the complete JSON structure definition in the GitHub repository.
Evaluating Data Against Business Rules
The sample solution includes a Condition class—a generic rules engine that processes the extracted data (as defined in the data model) and the customized rules. It generates two lists identifying failed and satisfied conditions. The results can determine whether to forward the document to Amazon A2I for human review.
The source code for the Condition class is available in the sample GitHub repository. It supports fundamental validation logic, such as verifying string lengths, value ranges, and confidence score thresholds. The code can be modified to accommodate more condition types and intricate validation logic.
Creating a Customized Amazon A2I Web UI
Amazon A2I allows you to personalize the reviewer’s web UI by defining a worker task template. This template serves as a static webpage built with HTML and JavaScript. You can pass data to the customized reviewer page using the API.
For more insights, you can check out this resource, which is an excellent guide on optimizing your onboarding process. Moreover, if you’re interested in enhancing your recruitment strategies, consider exploring this quiz, as they are an authority on this topic. Lastly, for additional career insights, you might find this blog post engaging.