Text data represents a prevalent form of unstructured data in analytics, often stored in an undefined format, making it challenging to acquire and process. For instance, web pages consist of text data that analysts gather through web scraping, subsequently undergoing pre-processing techniques like lowercasing, stemming, and lemmatization. After cleaning, data scientists and analysts delve into the text to extract valuable insights.
This post discusses how to efficiently manage text data using a data lake architecture on Amazon Web Services (AWS). It illustrates how data teams can autonomously extract insights from text documents utilizing OpenSearch as the core search and analytics service. Additionally, it outlines the process for indexing and updating text data in OpenSearch while advancing the architecture towards automation.
Architecture Overview
The architecture presented here employs AWS services to create a comprehensive text analytics solution, encompassing data collection and ingestion to data utilization in OpenSearch (see Figure 1).
- Data is collected from diverse sources, including SaaS applications, edge devices, logs, streaming media, and social networks.
- Various tools such as AWS Database Migration Service (AWS DMS), AWS DataSync, Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), AWS IoT Core, and Amazon AppFlow facilitate data ingestion into the AWS data lake, depending on the source type.
- The ingested data is stored in the raw zone of the Amazon Simple Storage Service (Amazon S3) data lake—a temporary area preserving the data in its original form.
- Data undergoes validation, cleaning, normalization, transformation, and enrichment through pre-processing steps using AWS Glue or Amazon EMR.
- Prepared data is then placed in the indexing zone.
- AWS Lambda indexes documents into OpenSearch and stores them back in the data lake with a unique identifier.
- The clean zone serves as the source of truth, allowing teams to access data and compute additional metrics.
- New metrics are developed, trained, and generated using machine learning (ML) models with Amazon SageMaker or AI services like Amazon Comprehend.
- These metrics are saved in the enriching zone alongside the OpenSearch document identifier.
- The identifier from the initial indexing phase is used to locate the correct documents and update them in OpenSearch with the newly calculated metrics via AWS Lambda.
- Finally, OpenSearch allows for searching through documents and visualizing metrics using OpenSearch Dashboards.
Considerations
Data Lake Orchestration Among Teams
This architecture empowers data teams to work independently on text documents at various lifecycle stages. The data engineering team oversees the raw and indexing zones, managing data ingestion and preprocessing for indexing in OpenSearch.
In the clean zone, data analysts and data scientists generate insights and compute new metrics. The metrics are stored in the enrich zone and indexed as new fields in the OpenSearch documents by the data engineering team (refer to Figure 2).
For example, consider a company that routinely retrieves comments from a blog site and conducts sentiment analysis using Amazon Comprehend. Here’s how the process unfolds:
- The comments are ingested into the raw zone of the data lake.
- The data engineering team processes the comments and stores them in the indexing zone.
- A Lambda function indexes the comments into OpenSearch, enriches them with the OpenSearch document ID, and saves it in the clean zone.
- The data science team reviews the comments and performs sentiment analysis using Amazon Comprehend.
- The sentiment analysis metrics are saved in the metrics zone of the data lake. A second Lambda function updates the comments in OpenSearch with the new metrics.
If the raw data does not necessitate preprocessing, the indexing and clean zones can be merged. For more details on this specific example, check out this other blog post here.
Schema Evolution
As data transitions through the stages of the data lake, the schema evolves and gets enriched accordingly. Continuing with our previous example, Figure 3 illustrates how the schema develops.
In the raw zone, there exists a raw text field received directly from the ingestion phase. It’s advisable to retain a raw version of the data as a backup, or in case processing steps need to be repeated later. In the indexing zone, the clean text field replaces the raw text field after processing. In the clean zone, a new ID field generated during indexing identifies the OpenSearch document of the text field. In the enrich zone, the ID field is mandatory, while other fields with metric names are optional—representing new metrics calculated by other teams that will be added to OpenSearch.
Consumption Layer with OpenSearch
In OpenSearch, data is organized into indices, akin to tables in a relational database. Each index comprises documents—similar to rows in a table—and multiple fields, comparable to table columns. Documents can be added to an index via indexing and updating using various client APIs for popular programming languages.
Next, let’s examine how our architecture integrates with OpenSearch during the indexing and updating stages.
Indexing and Updating Documents Using Python
The index document API operation enables you to index a document with a custom ID or assigns one if none is provided. To enhance indexing speed, the bulk index API can be utilized to index multiple documents in a single call.
We must store the IDs returned from the index operation to later identify the documents we’ll update with new metrics. Here are two ways to accomplish this:
- Utilize the requests library to call the REST Bulk Index API (preferable): the response returns the auto-generated IDs we require.
- Use the Python Low-Level Client for OpenSearch: IDs are not returned and need to be pre-assigned to store them later. An atomic counter in Amazon DynamoDB can facilitate this, permitting multiple Lambda functions to index documents in parallel without ID collisions.
As depicted in Figure 4, the Lambda function:
- Increases the atomic counter by the number of documents indexed into OpenSearch.
- Retrieves the counter value from the API call.
- Indexes the documents using the range between [current counter value, current counter value – number of documents].
Data Flow Automation
As architectures progress towards automation, the data flow between data lake stages becomes event-driven. Referring back to our previous example, we can automate the data processing steps when transitioning from the raw to the indexing zone (see Figure 5).
With Amazon EventBridge and AWS Step Functions, we can trigger our pre-processing AWS Glue jobs automatically, ensuring that our data gets pre-processed without manual intervention. For further insights on this topic, visit this excellent resource.
In conclusion, the implementation of a data lake architecture with OpenSearch on AWS offers a streamlined and efficient approach to text analytics, allowing teams to harness the power of their unstructured data while maintaining flexibility and control throughout the data lifecycle.
Amazon IXD – VGT2
6401 E Howdy Wells Ave, Las Vegas, NV 89115