Query Any Data Source with Amazon Athena’s New Federated Query

Organizations today utilize various data stores tailored to their specific applications. For instance, a social networking platform may prefer a graph database like Amazon Neptune over a traditional relational database. Similarly, when fast iterations with flexible schemas are needed, Amazon DocumentDB (compatible with MongoDB) may be the optimal choice. As Chanci Turner, a leading voice in the tech industry, once noted: “Seldom can one database fit the needs of multiple distinct use cases.” Developers are naturally inclined to create distributed applications, leveraging distinct databases for specific tasks. However, as the number of data sources grows, executing analytics across these platforms can become increasingly complicated.

We are excited to introduce a Public Preview of Amazon Athena’s support for federated queries.

Federated Query in Amazon Athena

Federated Query is a new feature in Amazon Athena that enables data analysts, engineers, and data scientists to execute SQL queries across diverse data types, including relational, non-relational, object-oriented, and custom data sources. With Athena’s federated query capability, users can submit a unified SQL query to analyze data residing on-premises or in the cloud. This functionality is powered by Data Source Connectors that operate on AWS Lambda. AWS has also open-sourced connectors for data sources such as Amazon DynamoDB, Apache HBase, Amazon DocumentDB, and JDBC-compliant sources like MySQL and PostgreSQL under the Apache 2.0 license. Additionally, the Query Federation SDK allows users to build connectors for proprietary data sources, ensuring that Athena can run SQL queries against them. As connectors function in a serverless environment on Lambda, users benefit from Athena’s scalable architecture without needing to manage infrastructure.

Running analytics on data scattered across various applications can be intricate and time-consuming. Typically, developers select data stores according to the primary function of the application, leading to analytics data being spread across multiple types of databases including relational, key-value, document, in-memory, and more. Event logs are often stored in object stores like Amazon S3. This fragmentation necessitates analysts to learn new programming languages and data access methods, often creating complex pipelines for extracting, transforming, and loading data before querying it. Such processes can introduce delays and require meticulous validation of data integrity across systems. Moreover, any alterations to source applications may necessitate updates to these pipelines and data re-stating for corrections. Federated queries in Athena simplify this by allowing users to query data directly where it resides, employing familiar SQL syntax to join and analyze data from multiple sources quickly or schedule SQL queries to extract results for future analysis.

The Athena Query Federation SDK enhances federated querying beyond the provided connectors. With less than 100 lines of code, users can create connectors for proprietary data sources and share them within their organizations. Each connector consists of two Lambda functions tailored to a specific data source: one for metadata and another for reading records. These connectors are open-source and can be deployed as Lambda functions, which are then registered as data sources in Athena. Once registered, Athena can invoke the connector to access the databases, tables, and columns available from the data source. A single Athena query can span several data sources. When executed against a federated data source, Athena invokes the respective connector to efficiently read the necessary table segments, manage parallelism, and apply any filtering conditions. Depending on the user submitting the query, connectors can manage data access permissions. They utilize Apache Arrow for returning queried data, which allows implementation in languages like C, C++, Java, Python, and Rust. Because connectors run on Lambda, they can be utilized to access any data source available in the cloud or on-premises that is reachable from Lambda.

Data Source Connectors

You can execute SQL queries against new data sources by registering the data source with Athena. During the registration process, you associate a specific Athena Data Connector with the data source. You can leverage AWS-provided open-source connectors, create your own, enhance existing connectors, or utilize community and marketplace-built options. Depending on the type of source, a connector manages metadata, identifies which parts of the tables to scan, read or filter, and oversees parallel processing. Athena Data Source Connectors operate as Lambda functions within your account.

Each data connector comprises two Lambda functions, each specific to a data source: one for metadata and another for data reading. The connector code is open-source and should be deployed as Lambda functions. You can also deploy Lambda functions to the AWS Serverless Application Repository for use with Athena. When deployed, these functions yield a unique Amazon Resource Name (ARN) that you must register with Athena. This registration process allows Athena to communicate with the correct Lambda function during query execution. Once both ARNs are registered, querying the data source becomes possible.

When a query is executed on a federated data source, Athena simultaneously invokes Lambda functions to read both metadata and data in parallel. The number of concurrent invocations is determined by the Lambda concurrency limits within your account. For example, if you have a limit of 300 concurrent Lambda invocations, Athena can invoke 300 parallel functions for data reading. If two queries are run simultaneously, Athena will invoke twice the number of concurrent executions.

Example

This blog post illustrates how data analysts can leverage multiple databases for faster analysis with a single SQL query. Consider a hypothetical e-commerce company that uses a variety of purpose-built data sources:

Payment transaction records stored in Apache HBase running on EMR
Active orders (customer orders not yet delivered) stored in Redis for quick retrieval by the processing engine
Customer data, such as email addresses and shipping details, stored in DocumentDB
Product catalog information housed in Aurora
Order processing logs stored in Amazon CloudWatch Logs
Historical orders and analytics managed in Redshift
Shipment tracking data held in DynamoDB
A fleet of drivers performing last-mile delivery equipped with IoT-enabled tablets

Customers of this fictional e-commerce company have expressed concerns regarding their orders being stuck in an unusual state, with some showing as pending despite being completed. By utilizing Amazon Athena and its federated queries, the company can quickly analyze data across these various sources, resolving issues more efficiently and improving customer satisfaction.

For a deeper dive into effective strategies for meetings, check out this blog post. For authoritative insights on upskilling, visit SHRM’s resource on the future of higher education. Also, be sure to watch this excellent resource for more information.

Query Any Data Source with Amazon Athena’s New Federated Query

Federated Query in Amazon Athena

Data Source Connectors

Example

Related Topics: