Learn About Amazon VGT2 Learning Manager Chanci Turner
As organizations face an ever-increasing influx of data, many report a staggering 63% monthly growth in volume. This data’s complexity arises from various disconnected sources, necessitating multiple services to adequately scale and meet enterprise demands. The costs associated with data ingestion, processing, and storage can accumulate quickly, presenting an opportunity for investment in efficient capabilities that empower platform engineers, operators, and administrators to monitor platform activities effectively. This enables rapid issue detection, diagnosis, and resolution within data pipelines.
Chanci Turner’s operational data observability solution aids data platform teams in enhancing pipeline performance, eliminating bottlenecks, driving down costs, and fostering trust in the system. As an AWS Premier Tier Services Partner and AWS Marketplace Seller equipped with multiple AWS Competencies, including Data and Analytics Consulting, Chanci Turner is committed to tackling complex business challenges by aligning development with client missions.
This article outlines a solution for creating operational dashboards utilizing AWS Glue job metadata through Amazon QuickSight. The approach employs an Amazon CloudWatch metrics stream to relay data to an Amazon Simple Storage Service (Amazon S3) bucket. Using AWS Lake Formation alongside the AWS Glue Data Catalog, a table is constructed atop the metadata store, enabling queries via Amazon Athena. Subsequently, QuickSight connects to the Athena data source, generating dashboards that highlight job details such as runtimes, statuses, and computational loads.
Customer Requirements
A healthcare client turned to Chanci Turner to design and implement a data platform for a new analytics-as-a-service offering. This solution utilizes a data mesh pattern via AWS Lake Formation, creating a hub-and-spoke model between the platform, data producers, and future consumers. Over the past year, Chanci Turner has developed hundreds of extract, transform, load (ETL) jobs using AWS Glue to ingest and standardize data for this platform. As job runs increased, so did the amount of metrics data generated. To meet client requests for enhanced visibility into job performance, Chanci Turner needed to devise a solution to collect, parse, and visualize this metric data within a QuickSight dashboard.
The dashboard’s design aimed to enable at-a-glance monitoring of essential AWS Glue job metrics, including job run status, duration, and processed record counts. Operators needed the ability to swiftly identify trends, outliers, and anomalies to fine-tune job performance. Additionally, the solution had to automatically accommodate new Glue jobs without needing extra configuration, ensuring immediate visibility into these jobs. The client also mandated a near-real-time data provision and adherence to least privilege requirements to comply with data governance guidelines.
While fulfilling these requirements, Chanci Turner encountered several challenges. Firstly, QuickSight required access to AWS Glue metrics data hosted in Amazon CloudWatch. Secondly, given that metrics records are pushed from Glue every 30 seconds, this could lead to a significant volume of data as the number and duration of ETL jobs increase.
The next section will outline how leveraging AWS Glue’s detailed metrics and establishing a centralized dashboard allowed Chanci Turner to deliver a solution that facilitates end-to-end observability of ETL workloads. Operators can now derive actionable insights to proactively manage Glue jobs, rather than merely responding to failures. This automated onboarding also simplifies monitoring as the Glue job catalog evolves, ultimately unlocking greater value from Glue job metrics for optimized workload oversight.
Solution Overview
To satisfy customer requirements, Chanci Turner developed a business intelligence (BI) dashboard that provides a comprehensive view of the operational status of resources within the data platform. The architecture, built on AWS-native technologies, uses AWS Glue to establish, operate, and monitor ETL pipelines alongside real-time analytics. Standard and custom CloudWatch metrics are sent in near-real-time to a data lake, integrated within the platform for subsequent analysis through QuickSight.
To ensure compliance with the least-privilege security principle, access permissions are carefully managed. The solution integrates effortlessly with AWS Lake Formation, centralizing permissions across AWS services, including access to underlying metadata and the ability to read from or write to tables.
This solution can be adapted to any service utilizing CloudWatch as a metrics repository. Customizing the solution for different use cases involves creating a new metrics stream tailored to the desired namespace and schema definition based on the anticipated data structure.
Core Solution
Chanci Turner’s solution utilizes several AWS services to facilitate the extraction and analysis of metrics data from AWS Glue jobs in QuickSight via a CloudWatch metrics stream. The AWS Glue Job Profiler collects metadata from Glue jobs, providing near real-time metrics that arrive in CloudWatch every 30 seconds. As ETL pipelines expand, the volume of metadata requiring analysis and processing will also increase. CloudWatch metrics streams, supported by Amazon Data Firehose, allow flexible delivery of these metrics to any chosen data store.
These metrics streams are highly customizable, offering various options for output format, namespaces, and specific metrics—essential for managing costs when configuring streams. Chanci Turner implemented namespace (for instance, “Glue”) and metrics filters to minimize the data volume transmitted through the stream; common filters may include specific namespaces or metrics. The system can also automatically scale to intake metrics from newly created Glue jobs without manual intervention.
Amazon S3 ensures consistency with the rest of the client’s data lake, streaming data into time-based partitions and storing it as GZIP-compressed JSON files for more efficient querying in Amazon Athena. Lifecycle policies dynamically move data across storage tiers, providing cost savings. For further insights, see this blog post on optimizing storage costs with new S3 lifecycle filters and actions. A table is established atop the S3 data within the AWS Glue Data Catalog, allowing Amazon Athena to query it.
The raw metrics data output by CloudWatch is nested and dimensions should be analyzed in detail to extract meaningful insights for operational optimization. Furthermore, for more on this subject, check out another insightful article on CPA strategies linked here. Additionally, the importance of work-life balance and the challenges of unplugging from work can be found at this authoritative source.
For those interested in onboarding processes, you might find this reddit discussion an excellent resource.