Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning manager

The Metropolitan Police Department of Washington DC (DC-MPD) ranks among the ten largest police agencies in the United States, serving as the primary law enforcement agency for the District of Columbia. With a commitment to adopting innovative technology, DC-MPD has merged evidence analysis methods with advanced information systems to enhance crime-mapping, real-time crime statistics dashboards, and summary statistics, providing a comprehensive overview of crime trends. This modern approach is further enriched by a community policing philosophy that prioritizes building strong relationships between law enforcement and the community.

Central to DC-MPD’s operations is a robust data pipeline that effectively manages vast amounts of information from over 400 datasets sourced from the Mark43 records management system. The extraction and management of this data are facilitated by Amazon Web Services (AWS) solutions, including the AWS Database Migration Service (DMS), which allows for smooth data extraction and storage in an Amazon Simple Storage Service (S3) raw bucket.

Despite these advancements, the department faced notable challenges, such as duplicate entries in the staging bucket, insufficient error handling in Lambda functions, and intermittent disruptions in the curated bucket.

Solution Overview

To tackle these issues, DC-MPD partnered with AWS to design and implement a system aimed at refining their existing extract, transform, load (ETL) pipeline. The main goal was to eliminate duplicate records, introduce effective error detection mechanisms, and enhance overall error handling and orchestration within their data pipeline.

By utilizing AWS technologies such as Spark on AWS Lambda, AWS Glue, and AWS Step Functions, DC-MPD successfully transformed raw data into open table formats, orchestrated the pipeline using AWS Step Functions, and enabled user access to datasets via Amazon Athena.

The solution achieved several important objectives, including the creation of necessary Amazon S3 buckets for input, output, and error management, the implementation of a Glue job for data format conversion, and a DynamoDB table for overseeing data processing. Through architectural enhancements, the team established a more streamlined data processing workflow adaptable to various operational scenarios.

The data processing workflow leverages multiple AWS services to manage and process data based on specific conditions and events.

Data Ingestion into Amazon S3

Incoming data files are uploaded to an Amazon S3 bucket, organized by database and table-specific folders. This structure promotes logical segmentation for processing.

Event-based Processing via AWS Step Functions

When a new file is uploaded to Amazon S3, an Amazon EventBridge rule triggers an AWS Step Functions state machine. The Step Functions workflow then determines whether the file represents a full load or an incremental load based on its naming convention.

Tracking Pipeline State in Amazon DynamoDB

To effectively manage the workflow, the state machine references a DynamoDB table, which maintains:

Job Flags: Each folder or data group carries a job flag—G (process with AWS Glue), L (process with AWS Lambda), or P (pause).
File History: It records the names and timestamps of the most recently processed and any failed files.
Failure Handling: If a file fails to process, the pipeline sets its job flag to P, pausing further operations for investigation and remediation.
Job References: The table also stores resource references such as the Amazon Resource Names (ARN) of the AWS Glue job or Lambda function used for processing.
Lock Management: DynamoDB implements a locking mechanism to ensure that only one transaction can be processed at a time per table.

Handling Failure and Pausing Executions

In case of a failure, AWS Step Functions automatically updates the DynamoDB table to P, preventing additional incremental loads from being processed. This pause mechanism safeguards data integrity and allows operations teams to investigate the root cause.

Full Load vs. Incremental Load Processing

Full Load: If the state machine identifies a full-load file, the workflow pauses, and the corresponding incremental files are transferred to an Amazon Simple Queue Service (SQS) queue. This allows the team to manage large-scale ingestion or backfill operations after the full load is finished.
Incremental Load: For incremental load files, the Step Functions workflow checks the job flag of the folder. Depending on whether it is set to G or L, AWS Glue or AWS Lambda is invoked. This method optimizes processing according to file size and complexity.

AWS Glue (G flag) is utilized for larger files or more complex ETL tasks, taking advantage of its scalability and batch processing capabilities. Although Glue incurs costs based on Data Processing Units (DPUs), it is cost-effective for high-throughput, resource-intensive jobs. Conversely, AWS Lambda (L flag) is employed for smaller files or lightweight transformations, leveraging its economical pay-per-use model. However, if a full load is in progress, incoming incremental files are automatically sent to an Amazon SQS queue. These queued files are processed in order after the full load concludes, adhering to the standard G or L flag logic once resumed.

Orchestration and Monitoring

After triggering AWS Glue or AWS Lambda, Step Functions monitor the execution status of the job. Once processing is successful, the output file is relocated to a staging Amazon S3 bucket for further processing. This design offers a serverless architecture for orchestrating both incremental and full-load pipelines, with mechanisms for failure detection and pause/resume, ensuring data quality and operational oversight.

Unlocking Benefits by Utilizing AWS Technologies for Data Processing

Performance:

A significant advantage of this ETL pipeline is its high throughput and near real-time processing speed. By leveraging an open table format and executing Spark on AWS Lambda, one million records were processed in under 90 seconds, ensuring rapid data transformation. The serverless architecture scales automatically to match demand, allowing DC-MPD to process large datasets promptly.

Scalability:

The scalable architecture developed for DC-MPD efficiently manages large datasets and data processing tasks. By utilizing Spark on AWS Lambda, the system dynamically triggers Spark jobs in response to new data arrivals in the Amazon S3 input bucket, effectively scaling according to incoming data volume. The orchestration of tasks through AWS Step Functions allows data processing to occur in the correct sequence, facilitating seamless integration with existing systems while accommodating varying data loads, whether incremental or full. This flexibility to scale and manage data processing tasks in real-time is vital for a law enforcement agency, as timely access to data can be critical for effective crime-fighting efforts.

For additional insights into improving your interview process, you can visit this blog post. Moreover, it’s essential to stay informed about recent changes in leave entitlements, which you can read more about at SHRM. Finally, if you’re interested in how Amazon has revamped its onboarding experience, check out this excellent resource from Forbes.