Cost Monitoring for Amazon EMR on Amazon EKS

Chanci Turner Amazon IXD – VGT2 learning

Amazon EMR stands out as a premier cloud-based big data solution, featuring a range of open-source frameworks such as Spark, Hive, Hudi, and Presto. It operates on a fully managed basis with per-second billing. The integration of Amazon EMR on Amazon EKS allows users to deploy EMR on shared Amazon Elastic Kubernetes Service (Amazon EKS) clusters, enhancing resource utilization, cutting costs, and simplifying infrastructure management. EMR on EKS can deliver performance improvements of up to 5.37 times compared to OSS Spark v3.3.1, offering 76.8% in cost savings. Users benefit from various job submission methods, including the AWS API StartJobRun or through a declarative approach with a Kubernetes controller via AWS Controllers for Kubernetes for Amazon EMR on EKS.

However, this consolidation introduces challenges in accurately measuring detailed costs for departmental chargeback or showback purposes. A survey conducted by CNCF and the FinOps Foundation revealed that 68% of Kubernetes users either depend on monthly estimates or do not monitor Kubernetes costs at all. Among the respondents actively tracking Kubernetes costs, AWS Cost Explorer and Kubecost emerged as the most widely used tools.

Currently, organizations can allocate costs per tenant using either a hard multi-tenancy approach—separate EKS clusters in distinct AWS accounts—or a soft multi-tenancy method that utilizes different node groups within a shared EKS cluster. An efficient option involves namespace-based segregation, wherein nodes are shared across various namespaces. Nonetheless, attributing costs to teams based on workloads or namespaces, while considering compute optimization strategies (like Savings Plans or Spot Instances) and the costs of AWS services such as EMR on EKS, proves to be a complex task.

In this article, we introduce a cost chargeback solution for EMR on EKS that integrates AWS-native features of AWS Cost and Usage Reports (AWS CUR) with comprehensive cost visibility provided by Kubecost on Amazon EKS.

Solution Overview

Cost incurred by jobs running on EMR on EKS can be categorized into two main areas: compute resources and a supplementary charge for EMR on EKS utilization. To monitor expenses related to these areas, we pull data from three key sources:

AWS CUR: This report details the cost uplift associated with EMR on EKS jobs and helps reconcile compute costs with any savings plans or reserved instances. The essential infrastructure for CUR is deployed as outlined in the guide for Setting up Athena using AWS CloudFormation templates.
Kubecost: This tool provides insights into the compute costs generated by executor and driver pods.

The cost allocation process includes several components:

Compute costs are provided by Kubecost. To facilitate detailed analysis, we set up an hourly Kubernetes CronJob that retrieves data from Kubecost and saves it to Amazon Simple Storage Service (Amazon S3).
CUR files are also stored in an S3 bucket.
We utilize Amazon Athena to create a view that consolidates the total costs associated with running an EMR on EKS job.
Finally, you can link your preferred business intelligence tools to Athena via JDBC or ODBC connections. For visualization, we leverage Amazon QuickSight’s native integration.

The diagram illustrates the overall architecture and how the components interact.

We provide a shell script to deploy the tracking solution. This script configures the infrastructure using an AWS CloudFormation template, AWS Command Line Interface (AWS CLI), along with eksctl and kubectl commands. It executes the following tasks:

Initiates the CloudFormation deployment.
Sets up and configures an AWS Cost and Usage Report.
Configures and deploys Kubecost, backed by Amazon Managed Service for Prometheus.
Deploys a Kubernetes CronJob.

Prerequisites

To implement this solution, you will need:

The following tools installed: Helm 3.9+, kubectl, and eksctl.
Docker.
An EKS cluster with the Amazon EBS CSI driver deployed.
Your EKS cluster enabled to use AWS Identity and Access Management (IAM) roles for service accounts.

This article presumes you already have an EKS cluster and are running EMR on EKS jobs. For those without a ready EKS cluster, we recommend starting with a standard EMR on EKS blueprint that configures a cluster for submitting EMR on EKS jobs.

Set Up the Solution

To execute the shell script, follow these steps:

Clone the relevant GitHub repository.
Navigate to the cost-tracking folder using the command:

cd cost-tracking

Execute the script with the following command:

sh deploy-emr-eks-cost-tracking.sh REGION KUBECOST-VERSION EKS-CLUSTER-NAME ACCOUNT-ID

After running the script, you’ll be equipped to utilize Kubecost and CUR data to gain insights into the costs associated with your EMR on EKS jobs.

Tracking Costs

This section outlines how to analyze compute costs from Kubecost, query EMR on EKS uplift data, and merge these insights for a comprehensive cost overview.

Compute Costs

Kubecost provides various methods to monitor costs per Kubernetes object. For instance, you can analyze costs by pod, controller, job, label, or deployment. It also highlights costs tied to idle resources, such as Amazon Elastic Compute Cloud (Amazon EC2) instances that aren’t fully utilized. In this discussion, we assume that no nodes are provisioned unless an EMR on EKS job is active, utilizing the Karpenter Cluster Autoscaler to provision nodes when jobs are submitted. Karpenter also performs bin packing, optimizing EC2 resource usage and minimizing idle resource costs.

To monitor compute costs linked to EMR on EKS pods, we query the Kubecost allocation API, passing pod and labels in the aggregate parameter. We specifically use the labels emr-containers.amazonaws.com/job.id and emr-containers.amazonaws.com/virtual-cluster-id, which are consistently present in executor and driver pods. These labels allow us to filter Kubecost data to focus solely on the costs associated with EMR on EKS pods. You can delve into various levels of granularity, from pod to job and virtual cluster levels, to discern costs associated with drivers versus executors or the use of Spot Instances.

Additionally, we provide instance_id, instance size, and capacity type (On-Demand or Spot) for the pods. This information is valuable for understanding job execution patterns and preferred capacities. Data about pod running costs and assets is gathered through a Kubernetes CronJob that queries the Kubecost API, correlates allocation and assets data based on the instance_id, cleans the data, and saves it in CSV format to Amazon S3.

The compute cost data encompasses several critical fields, such as cpucost, ramcost (memory cost), pvcost (cost of Amazon EBS storage), CPU and RAM utilization efficiency, and total cost, which represents the aggregate expenses for all resources utilized at the pod, job, or virtual cluster level.

To visualize this data, follow these steps:

Access the Athena console and navigate to the query editor.
Choose athenacu.

For more insights on onboarding processes, you might enjoy another blog post discussing effective presentations, linked here: Career Contessa. Additionally, for an authoritative perspective on job descriptions, visit SHRM. Lastly, if you’re interested in onboarding new hires, check out this excellent resource: SHRM.