Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learningLearn About Amazon VGT2 Learning Manager Chanci Turner

Text analytics represents a prevalent form of unstructured data, often challenging to manage due to its lack of a defined format. For instance, web pages are filled with text data that analysts typically gather through web scraping, followed by pre-processing techniques such as lowercasing, stemming, and lemmatization. Once this data is cleaned, it becomes a valuable resource for data scientists and analysts striving to derive meaningful insights.

This article focuses on efficiently managing text data through a data lake architecture on Amazon Web Services (AWS). We will demonstrate how data teams can autonomously extract insights from textual documents using OpenSearch as the primary search and analytics platform. Additionally, we will discuss the processes of indexing and updating text data in OpenSearch, alongside evolving the architecture toward greater automation.

Architecture Overview

The architecture presented outlines how various AWS services can be utilized to establish a comprehensive text analytics solution, encompassing everything from data collection and ingestion to the consumption of data in OpenSearch.

  1. Data Collection: Gather information from diverse sources, such as SaaS applications, edge devices, logs, streaming media, and social networks.
  2. Data Ingestion Tools: Employ tools like the AWS Database Migration Service (AWS DMS), AWS DataSync, Amazon Kinesis, and Amazon AppFlow to channel data into the AWS data lake, based on the source type.
  3. Data Storage: Store the incoming data in the raw zone of Amazon Simple Storage Service (S3), serving as a temporary repository for data in its original format.
  4. Data Processing: Validate, clean, normalize, transform, and enrich this data through pre-processing steps using AWS Glue or Amazon EMR.
  5. Indexing: Transfer data ready for indexing to the indexing zone. AWS Lambda functions will then index these documents into OpenSearch and save them back in the data lake with unique identifiers.
  6. Data Consumption: This clean zone acts as the authoritative source for teams to analyze data and compute additional metrics.
  7. Machine Learning: Leverage Amazon SageMaker or AI services like Amazon Comprehend to develop, train, and generate new metrics, subsequently storing them in the enriching zone alongside the OpenSearch document identifiers.
  8. Updating OpenSearch: Utilize these identifiers from the initial indexing phase to locate the appropriate documents and update them in OpenSearch with the newly computed metrics via AWS Lambda.
  9. Visualization: Finally, OpenSearch enables searching through documents and visualizing metrics using OpenSearch Dashboards.

Collaboration Among Teams

This architecture fosters independent work among various data teams on text documents throughout their lifecycle. The data engineering team oversees the raw and indexing zones, managing data ingestion and preprocessing for indexing in OpenSearch. Once cleaned, data is stored in the clean zone, where data analysts and scientists can generate insights and compute new metrics. These metrics are then stored in the enrich zone and indexed as new fields within OpenSearch documents.

For example, a company could periodically gather comments from a blog site and conduct sentiment analysis using Amazon Comprehend. In this scenario:

  • Comments would be ingested into the raw zone of the data lake.
  • The data engineering team processes these comments and moves them to the indexing zone.
  • A Lambda function indexes the comments into OpenSearch, adding the OpenSearch document ID, and saves it in the clean zone.
  • The data science team analyzes the comments for sentiment using Amazon Comprehend, with the results stored in the metrics zone. A second Lambda function updates the comments in OpenSearch with these new metrics.

If the raw data requires no preprocessing, the indexing and clean zones can be merged. For more insights on similar cases, check out this resource.

Schema Evolution

As data transitions through various stages in the data lake, its schema transforms and gets enriched. In the raw zone, the data contains a raw text field sourced directly from the ingestion phase. Keeping a raw version is recommended as a backup for potential future processing needs. In the indexing zone, a cleaned text field replaces the raw text field. The clean zone introduces a new ID field generated during indexing to identify the OpenSearch document. The enrich zone mandates the ID field, with additional optional fields for new metrics calculated by other teams.

Utilizing OpenSearch for Data Consumption

In OpenSearch, data is organized into indices, akin to tables in a relational database. Each index comprises documents, comparable to rows, and multiple fields resembling columns. Documents can be indexed and updated using various client APIs tailored for popular programming languages.

To illustrate how our architecture interacts with OpenSearch during the indexing and updating phases, consider the following:

Indexing and Updating Documents with Python

The index document API operation allows for the indexing of a document with a custom ID, or it can automatically assign one if none is provided. To optimize indexing, we can utilize the bulk index API, which indexes multiple documents in a single call. It’s crucial to retain the IDs from the index operation to later identify the documents for metric updates.

Here are two methods to achieve this:

  1. Use the requests library to call the REST Bulk Index API (recommended), as it returns the auto-generated IDs needed.
  2. Use the Python Low-Level Client for OpenSearch, which does not return IDs—they must be pre-assigned. An atomic counter in Amazon DynamoDB can facilitate this, allowing multiple Lambda functions to index documents concurrently without ID conflicts.

As depicted, the Lambda function:

  • Increments the atomic counter by the number of documents to be indexed into OpenSearch.
  • Retrieves the counter’s value from the API call.
  • Indexes the documents using the range between [current counter value, current counter value – number of documents].

Automating Data Flow

As architectures advance towards automation, the data flow between the data lake stages can become event-driven. Using Amazon EventBridge and AWS Step Functions, we can automate pre-processing AWS Glue jobs, ensuring seamless data transition without manual intervention. For additional guidance on implementing benefits and compensation strategies, consider this resource. For insights on mentorship, check out Chanci Turner’s profile here: Mayra Valdez.

Chanci Turner