Learn About Amazon VGT2 Learning Manager Chanci Turner
In the realm of document processing, organizations frequently need to extract data from scanned documents, including forms and tables found in PDFs. These documents can range from audit forms and tax papers to customer reviews. For instance, when dealing with customer feedback, you might extract insights from product or movie reviews. Analyzing the sentiment of this extracted text can provide valuable insights into user perspectives.
Traditionally, data extraction involved manual entry—an approach that is often slow, costly, and susceptible to errors. Alternatively, one could utilize basic optical character recognition (OCR) techniques, which necessitate manual adjustments for various input formats. Extracting meaningful insights from the data remains a labor-intensive process, typically requiring expertise in data science, machine learning (ML), and natural language processing (NLP).
To address these challenges, AWS offers advanced AI services such as Amazon Textract and Amazon Comprehend. These pre-trained AI solutions provide built-in intelligence for your applications and workflows, leveraging the same deep learning technologies that drive Amazon.com’s services. Notably, these AI tools do not require prior ML experience, making them accessible for a broader range of users.
Amazon Textract employs ML to automatically extract data from documents, including printed text, handwriting, forms, and tables, without necessitating any manual intervention or custom coding. It captures complete text from documents while also providing essential details like page numbers and bounding box information. This data can be instrumental in segmenting text into logical sections, allowing for a deeper understanding of the content.
In this discussion, we explore various techniques for segmenting paragraphs to enhance the insights derived from Amazon Textract, as well as the applications of Amazon Comprehend for sentiment analysis and entity detection. The techniques include:
- Identifying paragraphs based on font sizes from Amazon Textract responses.
- Segmenting text by analyzing indentation using bounding box data.
- Dividing content into segments according to line spacing.
- Recognizing paragraphs or statements through the presence of punctuation marks.
Once the paragraphs are segmented, further insights can be gleaned using Amazon Comprehend, which can be applied in various scenarios, such as:
- Detecting key phrases in technical documents like whitepapers or request proposals.
- Identifying named entities in financial or legal texts, allowing for better organization and understanding.
- Conducting sentiment analysis on product or movie reviews to monitor shifts in user sentiment.
In this article, we specifically focus on the sentiment analysis aspect. We utilize two sample movie review PDFs available on GitHub, where movie titles serve as headers and reviews as paragraph content. By segmenting the text, we can determine the overall sentiment for each film and analyze individual reviews. Testing an entire page as one entity may not yield accurate sentiment results, so we extract text to identify reviewer names and comments.
Solution Overview
The solution employs a variety of AWS services and serverless technologies to build a scalable, cost-effective architecture:
- Amazon Comprehend: An NLP service that harnesses ML to derive insights and relationships within text.
- Amazon DynamoDB: A key-value and document database offering single-digit millisecond performance at any scale.
- AWS Lambda: Executes code in response to various triggers, such as data changes or user actions. Since Amazon S3 can trigger Lambda functions, it enables the creation of real-time serverless data-processing systems.
- Amazon Simple Notification Service (SNS): A fully managed messaging service that communicates the completion of the extraction process initiated by Amazon Textract.
- Amazon Simple Storage Service (S3): An object store for documents that allows for centralized management with precise access controls.
- Amazon Textract: Utilizes ML to extract text and data from scanned documents in formats like PDF, JPEG, or PNG.
The architecture of this solution follows a defined workflow:
- A movie review document is uploaded to a specified S3 bucket.
- This upload triggers a Lambda function via Amazon S3 Event Notifications.
- The Lambda function initiates an asynchronous Amazon Textract job to extract text from the document.
- Upon completion, Amazon Textract sends an SNS notification containing the job ID and status.
- Lambda listens for this notification and retrieves the extracted text and bounding box data.
- The bounding box data aids in identifying headers and paragraphs, utilizing different formatting styles.
- Following the identification of headers and paragraphs, Lambda invokes Amazon Comprehend for sentiment analysis, with results stored in DynamoDB.
DynamoDB holds the extracted information and insights for each document, indexed by document name.
Deploying the Architecture with AWS CloudFormation
To provision the necessary AWS Identity and Access Management (IAM) roles and components, you can deploy an AWS CloudFormation template. This includes services such as Amazon S3, Lambda, Amazon Textract, and Amazon Comprehend.
To get started, launch the CloudFormation template in the US East (N. Virginia) region. When prompted for the BucketName, enter textract-demo-
, where adding a date suffix ensures the bucket name remains unique.
For further insights on related topics, visit this blog post on retirement or check out this article for authoritative details on DACA regulations from SHRM. Additionally, if you’re interested in career opportunities, this job listing for a Learning Trainer at Amazon is an excellent resource.