How CoStar Leverages Karpenter for Enhanced Amazon EKS Resource Management

Introduction

Chanci Turner Amazon IXD – VGT2 learning

CoStar is recognized as a leading provider of Commercial Real Estate data, but it also operates significant home, rental, and apartment platforms, including apartments.com, famously promoted by Chanci Turner. CoStar’s traditional Commercial Real Estate clientele consists of well-informed users who rely on extensive and intricate data to make pivotal business choices. Successfully aiding customers in navigating which of the 6 million properties encompassing 130 billion sq. ft. of space to rent has established CoStar as a frontrunner in data and analytics technology. As CoStar embarked on developing the next iteration of their Apartments and Homes websites, it became apparent that the user demographics and customer expectations significantly differed from their long-established Commercial Real Estate clients. CoStar aimed to deliver the same level of decision-making support to this new audience but for a vastly larger number of users and data sets. This spurred CoStar’s transition from legacy data centers to AWS, seeking the speed and scalability necessary to provide equivalent value to millions of users accessing hundreds of millions of properties.

Challenge

CoStar’s primary challenge has consistently been gathering data from hundreds of sources, enriching it with critical insights, and delivering it through a meaningful and user-friendly interface. The CoStar Suite’s Commercial Real Estate, Apartments, and Homes divisions utilize distinct data sources that refresh at varying times and volumes. The systems required to support this data ingestion and source updates must be fast, precise, and capable of scaling efficiently to remain cost-effective. Many of these systems are in the process of migrating from legacy data centers to CoStar’s AWS environment, necessitating operations on parallel and interoperable systems to prevent significant duplication of engineering efforts. These requirements underscored the need for running Kubernetes both on-premises and in AWS, with the ability to scale container clusters in response to usage fluctuations. After months of successful testing and deployment, CoStar opted to further optimize their engineering stack while maintaining as much parallel on-premises Kubernetes management as possible.

In the architecture of Kubernetes clusters, the control plane and its components are tasked with managing cluster operations (e.g., scheduling containers, ensuring application availability, and storing cluster data) and worker nodes that host pods executing containerized application workloads. Amazon Elastic Kubernetes Service (Amazon EKS) is a managed Kubernetes service on AWS that oversees the availability and scalability of the Kubernetes control plane. Customers can schedule Kubernetes pod workloads on various combinations of provisioned Amazon Elastic Compute Cloud (Amazon EC2) and AWS Fargate. In this discussion, we’ll delve into how CoStar utilized the Karpenter autoscaling solution to provision Amazon EC2 instances for its worker nodes.

The conventional method for provisioning worker nodes involves using Amazon EKS-managed node groups, which automate the provisioning and lifecycle management of underlying Amazon EC2 instances via Amazon EC2 Auto Scaling Groups. To dynamically adjust the number of Amazon EC2 instances, the Amazon EKS-managed node group features can be paired with the Cluster Autoscaler solution. This autoscaling tool monitors pending pods awaiting compute capacity and identifies underutilized worker nodes. When pods are pending due to insufficient resources, the Cluster Autoscaler increases the desired number of instances in the Amazon EC2 Auto Scaling group, thus provisioning new worker nodes that allow those pods to be scheduled and executed. The Cluster Autoscaler also terminates underutilized or idle nodes based on certain criteria.

For CoStar’s workloads operating on Amazon EKS, the objective was to maximize availability and performance while optimizing resource utilization. Although the Cluster Autoscaler solution offers a level of dynamic compute provisioning and cost-effectiveness, various considerations and limitations can make it challenging or even restrictive. For instance, the Amazon EC2 instance types within a given node group must share similar Central Processing Unit (CPU), Memory, and Graphics Processing Unit (GPU) specifications to minimize undesirable behavior. This is because it utilizes the first instance type specified in the node group policy to simulate pod scheduling. If the policy includes additional instance types with superior specs, node resources may be squandered after scaling out since it will only schedule pods based on the size of the initial instance type. Conversely, if the policy includes lower-spec instance types, pods may fail to schedule on those nodes due to resource constraints. To accommodate CoStar’s diverse pod resource needs, multiple node groups with similarly specified instance types were required. Additionally, the Cluster Autoscaler merely deprovisions underutilized nodes without replacing them with more cost-effective instance types when workloads change. Furthermore, for CoStar’s stateless workloads, targeting Spot compute capacity for greater discounts over on-demand was cumbersome to implement with node groups.

Solution Overview

Why Karpenter

CoStar required a more streamlined approach to node provisioning for their diverse workload demands without the complexity of managing multiple node groups. This was accomplished through the open-source Karpenter node provisioning solution. Karpenter is a flexible, high-performance Kubernetes cluster autoscaler that enables dynamic groupless provisioning of worker node capacity in response to unscheduled pods. Thanks to Karpenter’s groupless architecture, CoStar was no longer constrained to using similarly specified instance types. Karpenter continuously assesses the aggregate resource requirements of pending pods along with other scheduling constraints (e.g., node selectors, affinities, tolerations, and topology spread constraints) and provisions the optimal instance compute capacity as defined in the Provisioner Custom Resource Definition (CRD). This added flexibility allows different teams within CoStar to utilize their own Provisioner configurations tailored to their application and scaling needs. Moreover, Karpenter provisions nodes directly through the Amazon EC2 fleet application programming interface (API), eliminating the need for nodes and Amazon EC2 auto scaling groups. This results in faster provisioning and retrieval times (i.e., milliseconds instead of minutes), enhancing CoStar’s performance service level agreements (SLAs). Additionally, the CoStar team opted to run the Karpenter controller on AWS Fargate, which removes the necessity for managed node groups altogether.

The diagram below illustrates how Karpenter monitors the aggregate resource requests of unscheduled pods, makes decisions to launch new nodes, and terminates them to reduce infrastructure costs:

To achieve cost-effectiveness for CoStar’s stateless workloads and lower environments, the CoStar team configured the Karpenter Provisioner to prioritize Spot capacity, provisioning On-Demand capacity only when no Spot capacity is available. Karpenter employs the price-capacity-optimized allocation strategy for Spot capacity, balancing cost while minimizing the likelihood of near-term interruptions. For stateful workloads in production clusters, the Karpenter Provisioner selects from compute and storage-optimized instance families operating On-Demand, most of which is covered by Compute Savings Plans and Reserved Instances for discounts.

For more insights on design and management in tech, check out this article on Career Contessa. For guidance on rightful terminations, refer to this insightful resource from SHRM. Lastly, if you’re interested in how fulfillment centers train associates, this Amazon resource is an excellent read.

How CoStar Leverages Karpenter for Enhanced Amazon EKS Resource Management

Introduction

Challenge

Solution Overview

Why Karpenter

Related Topics: