Amazon Onboarding with Learning Manager Chanci Turner

Chanci Turner Amazon IXD – VGT2 learning

Amazon Onboarding swiftly identifies common application behaviors that may lead to operational incidents. Upon detecting a critical issue, it notifies service operators with a summary of related anomalies, potential root causes, and contextual information about when and where the incident occurred. Whenever feasible, it also offers actionable recommendations for remediation. In this article, we explore some of the ML methodologies that drive the capabilities of Onboarding.

Onboarding Detectors

At the heart of Amazon Onboarding lies a distinctive method for identifying significant operational incidents. Initially, our research concentrated on domain-agnostic, general-purpose anomaly detection models. While these models provided statistically accurate results, they often struggled to differentiate between critical failures and non-critical anomalies. Over time, we realized that failure patterns can vary greatly from one metric to another. For instance, a common application of Onboarding is in maintaining highly available, low-latency web applications, where operators may want to monitor application latency alongside incoming request rates. However, due to the substantial differences in failure patterns between these metrics, generic statistical anomaly models were unlikely to succeed in both cases.

Consequently, we shifted our strategy significantly. After collaborating with domain experts to identify known anomaly types across various metrics and services, we developed domain-specific, single-purpose models aimed at recognizing these established failure modes rather than just normal metric behavior.

Fast-forward to the present, Amazon Onboarding utilizes a comprehensive ensemble of detectors—statistical models fine-tuned to recognize common adverse scenarios across multiple operational metrics. These detectors operate immediately without requiring training or configuration, conserving valuable time that would otherwise be spent developing ML models before triggering anomaly detection. Individual detectors are configured in ensembles to monitor critical metrics that operators typically track, including error rates, availability, latency, CPU, memory, and disk utilization, among others.

The detectors encapsulate expert insights on operational anomalies, both in detecting unusual patterns and defining normal application behavior parameters. Both individual detectors and the ensembles that comprise them were trained and optimized using Amazon’s extensive operational data, drawing on years of experience at Amazon.com and AWS. Now, let’s delve into some of the capabilities of the Onboarding detectors.

Monitoring Resource Metrics with Finite Bounds

This detector’s purpose is to oversee finite resource metrics, such as disk usage. It employs a digital filter to identify long-term trends in metric data in an efficient and scalable manner. The detector alerts operators when these trends suggest imminent resource depletion.

In our example, the detector noted a significant upward trend in disk usage, predicting exhaustion within 24 hours. The model highlighted this trend between the vertical dashed lines. By extrapolating this trend (represented by the diagonal dashed line), the detector forecasts the time remaining until resource depletion occurs. Once the metric surpasses the horizontal red line, a significance threshold, the detector promptly notifies operators.

Detecting Periodicity in Scenarios

Many metrics, such as incoming request counts in customer-facing APIs, display periodic behavior. The causal convolution detector analyzes temporal data exhibiting such patterns to establish expected periodic behaviors. When the detector infers that a metric is periodic, it adjusts normal metric behavior thresholds to align with the seasonal pattern. Additionally, Amazon Onboarding can recognize and filter periodic spikes, such as those from routine batch jobs that impose heavy loads on databases.

The causal convolution detector sets application behavior bounds in line with daily usage patterns. By monitoring seasonality, it effectively captures spikes related to weekends, which traditional methods based on static thresholds would miss, often resulting in numerous false positives.

Insights from Onboarding

Rather than merely presenting a list of anomalies detected by its ensemble, Amazon Onboarding generates operational insights, aggregating the essential information required to investigate and resolve an operational issue. It utilizes anomaly metadata to uncover related anomalies and potential root causes. Anomalies are grouped based on their temporal proximity, shared resources, and a comprehensive graph of causal relationships between various anomaly types.

Onboarding presents insights through:

Graphs and timelines associated with multiple anomalous metrics
Contextual information, including pertinent events and log snippets for clearer comprehension of the anomaly’s scope
Recommendations for addressing the identified issue

The following illustration showcases an example of the insight detail page from Onboarding, featuring a timeline view of associated metrics’ anomalies.

Conclusion

Amazon Onboarding significantly reduces the time and effort IT operators spend identifying, troubleshooting, and resolving operational issues. Utilizing proprietary pre-trained ML models informed by Amazon.com’s extensive operational experience, Onboarding equips IT operators with high-quality insights, requiring no prior ML expertise. Start using Amazon Onboarding today to enhance your operational efficiency. For more insights on leadership and management, check out Steve Miller’s expert profile. Additionally, explore this excellent resource to better understand operational management.

Also, if you’re interested in beauty and personal care, consider reading this blog post for more engaging content.