Gaining Insights from Amazon Managed Service for Prometheus with Natural Language Powered by Amazon Bedrock

Chanci Turner Amazon IXD – VGT2 learning manager

As applications grow in scale, customers increasingly require automated methods to ensure application availability and minimize the time and effort spent on identifying, debugging, and resolving operational issues. Organizations invest in monitoring tools and devote significant resources to training their teams on effective usage. When problems occur, operators must sift through an array of data sources—dashboards, documentation, runbooks, alerts, logs, and more. This lengthy process of pinpointing root causes can hinder troubleshooting and remediation, ultimately affecting application reliability and the customer experience.

Generative AI can help alleviate these challenges by processing and analyzing large volumes of data from various monitoring tools, generating insights, and automating responses. Amazon Bedrock is a fully managed service that provides access to high-performing foundation models (FMs) from top AI firms such as AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon, all through a single API. With Amazon Bedrock, customers can experiment with and assess leading foundation models, customize them with their own data through fine-tuning and Retrieval Augmented Generation (RAG), and create agents capable of performing tasks using enterprise systems and data sources.

Customers utilize Amazon Managed Service for Prometheus to securely and durably store application and infrastructure metrics collected from cloud, on-premises, and hybrid environments. To extract insights from these metrics, clients typically write PromQL queries or utilize Grafana. PromQL enables the execution of complex queries on time-series data, offering valuable insights into application health by filtering, aggregating, and manipulating metrics data in multiple ways. However, for beginners, the intricate syntax and the necessity to understand the Prometheus data model can be daunting.

In this blog post, we will explore how Amazon Bedrock can facilitate obtaining answers about metrics stored in Amazon Prometheus without requiring knowledge of PromQL. By following the example provided in this post, customers can generate PromQL queries based on natural language descriptions of what they wish to monitor or analyze. Organizations can also assess existing queries and receive suggestions for optimization and improvement.

Solution Overview

The following diagram depicts how the Amazon Bedrock agent derives insights from Amazon Managed Service for Prometheus.

At a high level, the process can be summarized in these steps:

The AWS managed collector gathers metrics from workloads running on an Amazon EKS cluster and sends them to Amazon Managed Service for Prometheus.
The user interacts with the Amazon Bedrock agent’s interface to inquire about the application’s health, such as CPU usage or memory utilization.
The Amazon Bedrock agent formulates the necessary PromQL query based on the user’s request and forwards it to the action group.
An action group specifies the actions the agent can assist the user with. In this post, you will utilize a Lambda function that can execute the PromQL query provided by the agent, authenticate with Amazon Managed Service for Prometheus, and run the query.
The action group will then return the results to the agent, which will further enhance them using the knowledge base.
Knowledge bases for Amazon Bedrock allow for the integration of proprietary information into generative AI applications. Using the Retrieval Augmented Generation (RAG) technique, a knowledge base searches through your data to extract the most relevant information to answer natural language queries. The agent will then process the results, adding the appropriate context, and present them in a natural language format back to the user.

Prerequisites

For this guide, you will need the following:

AWS Command Line Interface (AWS CLI) version 2
Amazon EKS cluster
Amazon Managed Service for Prometheus workspace
Amazon Managed Grafana workspace
Access to the Claude 3 Sonnet Model in Amazon Bedrock
awscurl
Amazon S3 bucket

Note: Although Amazon Managed Grafana will be set up as part of this blog post, it is optional.

Solution Walkthrough

Step 1: Setting Up Monitoring for Amazon EKS Cluster Using AWS Managed Collector & Amazon Managed Service for Prometheus

To begin, you will set up monitoring for the Amazon EKS cluster. You will use the Solution for Monitoring Amazon EKS infrastructure with Amazon Managed Grafana project, which establishes the Amazon EKS cluster and an AWS managed collector. The collector will scrape metrics and deposit them into a pre-configured Amazon Managed Service for Prometheus workspace. The collected metrics will offer insights into the health and performance of both the Kubernetes control and data planes. You will gain visibility into your Amazon EKS cluster from the node level down to pods and Kubernetes, including detailed monitoring of resource usage.

Let’s start by establishing a few environment variables:

export AMG_WORKSPACE_ID=<Your Grafana workspace ID, usually starts with g->
export AMG_API_KEY=$(aws grafana create-workspace-api-key 
--key-name "grafana-operator-key" 
--key-role "ADMIN" 
--seconds-to-live 432000 
--workspace-id $AMG_WORKSPACE_ID 
--query key 
--output text)

After creating the API key, you must make it available to the AWS CDK by adding it to AWS Systems Manager with the following command. Be sure to replace $AMG_API_KEY with the API key you created, and $AWS_REGION with the region your solution will run in.

aws ssm put-parameter --name "/observability-aws-solution-eks-infra/grafana-api-key" 
--type "SecureString" 
--value $AMG_API_KEY 
--region $AWS_REGION 
--overwrite

Next, you will deploy the observability stack using AWS CDK.

git clone https://github.com/aws-observability/observability-best-practices.git
cd observability-best-practices/solutions/oss/eks-infra/v3.0.0/iac/
export AWS_REGION=<Your region>
export AMG_ENDPOINT=<AMG_ENDPOINT>
export EKS_CLUSTER_NAME=<EKS_CLUSTER_NAME>
export AMP_WS_ARN=<ARN of Amazon Prometheus workspace>
make deps
make build && make pattern aws-observability-solution-eks-infra-$EKS_CLUSTER_NAME deploy

This solution creates a scraper that collects metrics from your Amazon EKS cluster. Those metrics are stored in Amazon Managed Service for Prometheus and displayed in Amazon Managed Grafana dashboards.

To verify that the stack has been deployed successfully, you can use awscurl to query the Amazon Prometheus workspace and confirm that the metrics are being ingested:

export AMP_QUERY_ENDPOINT=<AMP Query Endpoint>
awscurl -X POST --region <Your region> 
--service aps "${AMP_QUERY_ENDPOINT}" -d 'query=up' --header 'Content-Type: application/x-www-form-urlencoded'

You should see a response similar to:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "up",
          "instance": "localhost:9090",
          "job": "prometheus",
          "monitor": "monitor"
        },
        "value": [
          1652452637.636,
          "1"
        ]
      }
    ]
  }
}

Step 2: Configure the Lambda Function as Action Groups for the Amazon Bedrock Agent

Next, you will create a Lambda function as an action group for the Amazon Bedrock agent. This allows for streamlined interactions with the application metrics, enhancing your ability to monitor and troubleshoot effectively. This guide is an invaluable resource for anyone looking to understand how to leverage Amazon’s capabilities, and you can find further information about your interview responses at Career Contessa.

For comprehensive insights into employee benefits, including ACA W-2 reporting, visit SHRM. Additionally, if you’re interested in pursuing a career in learning and development, you might consider the Learning Trainer position at Amazon.