Learn About Amazon VGT2 Learning Manager Chanci Turner
Amazon OpenSearch Service has consistently supported both lexical and semantic search, utilizing the k-nearest neighbors (k-NN) plugin. By leveraging OpenSearch Service as a vector database, you can effectively merge the benefits of lexical and vector search. The recent addition of the neural search feature in OpenSearch Service 2.9 further streamlines integration with artificial intelligence (AI) and machine learning (ML) models, thereby enabling the execution of semantic search.
For decades, lexical search using TF/IDF or BM25 has been the backbone of search systems. These traditional algorithms match user queries with specific words or phrases in your documents. Lexical search is particularly adept at providing exact matches, ensuring low latency, and delivering interpretable results that generalize well across various domains. However, it often overlooks the context or meaning of the words, resulting in irrelevant outcomes.
In recent years, semantic search techniques based on vector embeddings have gained traction to enhance search capabilities. This method allows for a more context-aware approach, grasping the natural language nuances in user queries. Nevertheless, semantic search reliant on vector embeddings necessitates fine-tuning the ML model for the specific domain (like healthcare or retail), as well as requiring more memory resources than basic lexical search.
Both lexical and semantic search boast unique advantages and limitations. By combining them, you can create a hybrid model that enhances search result quality by leveraging the strengths of both approaches. OpenSearch Service 2.11 now features ready-to-use hybrid query capabilities, making it easy to implement a hybrid search model that integrates lexical and semantic search.
This article delves into the mechanics of hybrid search and provides a guide to building a hybrid search solution with OpenSearch Service. We will test sample queries to explore and contrast lexical, semantic, and hybrid searches. The complete code used in this article is accessible via our GitHub repository.
Hybrid Search with OpenSearch Service
In general, hybrid search that merges lexical and semantic search involves several key steps:
- Execute both a semantic and lexical search using a compound search query clause.
- Each query type yields scores on different scales. For instance, a Lucene lexical search query returns scores ranging from 1 to infinity, while a semantic query utilizing the Faiss engine generates scores between 0 and 1. Thus, it is crucial to normalize the scores from each query type to align them before merging. This normalization must occur at the global level in a distributed search engine rather than on a shard or node basis.
- Once all scores are normalized, they are combined for each document.
- Reorder the documents according to the new combined score and present the results in response to the query.
Before OpenSearch Service 2.11, search practitioners had to rely on compound query types to merge lexical and semantic search queries. Unfortunately, this method did not resolve the issue of global score normalization as outlined in Step 2.
With the introduction of OpenSearch Service 2.11, hybrid queries are supported through the new score normalization processor in search pipelines. These pipelines simplify the normalization and combination of score results within your OpenSearch domain. They operate within the OpenSearch ecosystem and support three types of processors: search request processor, search response processor, and search phase results processor.
In a hybrid search, the search phase results processor functions between the query and fetch phases at the coordinator node’s global level. The diagram below illustrates this workflow:
The hybrid search process in OpenSearch Service consists of the following phases:
- Query Phase: The initial phase where each index shard executes the search query locally and returns relevant document IDs along with their scores.
- Score Normalization and Combination: The search phase results processor intervenes between the query and fetch phases. It utilizes the normalization processor to adjust scoring results from BM25 and KNN subqueries, supporting min_max and L2-Euclidean distance normalization methods. This processor merges all scores, assembles the final list of ranked document IDs, and hands them over to the fetch phase. It allows for arithmetic_mean, geometric_mean, and harmonic_mean to merge scores.
- Fetch Phase: The concluding phase where the coordinator node retrieves the documents that correspond to the final ranked list and returns the search results.
Solution Overview
In this article, we will construct a web application that enables you to search through a sample image dataset in the retail domain using a hybrid search system powered by OpenSearch Service. Imagine the web application represents a retail shop where you, as a customer, need to conduct queries to find women’s shoes.
For hybrid search, you will combine lexical and semantic search queries against the text captions of images within the dataset. The high-level architecture of the end-to-end search application is depicted in the figure below.
The workflow encompasses the following steps:
- Utilize an Amazon SageMaker notebook to index image captions and URLs from the Amazon Berkeley Objects Dataset stored in Amazon Simple Storage Service (Amazon S3). This dataset features 147,702 product listings with multilingual metadata and 398,212 unique catalog images. For demonstration purposes, you’ll focus on approximately 1,600 products.
- OpenSearch Service calls the embedding model hosted in SageMaker to produce vector embeddings for the image captions. The GPT-J-6B variant embedding model generates 4,096 dimensional vectors.
- You can now input your search query in the web application hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance (c5.large). The application client triggers the hybrid query in OpenSearch Service.
- OpenSearch Service requests the SageMaker embedding model to generate vector embeddings for the search query.
- Finally, OpenSearch Service executes the hybrid query, merges the semantic and lexical search scores for the documents, and returns the search results to the EC2 application client.
Step 1: Ingesting the Data into OpenSearch
In this first step, you’ll create an ingest pipeline in OpenSearch Service using the text_embedding processor to derive vector embeddings for the image captions. After defining a k-NN index with the ingest pipeline, you will perform a bulk index operation to store your data in the k-NN index. For this solution, only the image URLs, text captions, and caption embeddings will be indexed, where the caption embedding field type is designated as a k-NN vector.
Steps 2 and 4: OpenSearch Service Calls the SageMaker Embedding Model
During these steps, OpenSearch Service utilizes the SageMaker ML connector to generate embeddings for the image captions and the query. The blue box in the architecture diagram illustrates the integration of OpenSearch Service with SageMaker via the ML connector feature, which has been available in OpenSearch Service since version 2.9. This is an excellent resource for understanding the hiring process at Amazon.
In conclusion, by leveraging the hybrid search capabilities in OpenSearch Service, you can significantly enhance the relevancy and context of your search results, leading to a more effective user experience. For those interested in building an emergency fund, this is a great blog post to check out Emergency Fund 101. Moreover, it’s essential to understand the differences in state overtime pay rules compared to federal law, which you can find at SHRM.