
TECH WRITERS
How Karpenter optimized the management of our EKS infrastructure on AWS
Companies face daily challenges in managing Kubernetes infrastructure, especially to maintain efficiency and reduce costs. Here at Softplan, we discovered a solution that transforms the way we manage our EKS clusters on AWS: Karpenter. Challenges in instance management Before talking about Karpenter, it is necessary to take a few steps back and explain a little about what node auto-scaling is. Suppose we have our cluster with some machines (instances) available running our workloads. What happens if there is a spike in usage in our applications and we need to launch more instances (replicas) of our pods? Without autoscaling, we would need to provision a node, instruct it to join our cluster so that our pods would be able to be started on this new instance. Remembering that provisioning an instance is not instantaneous, there is a whole bootstrapping of the machine, network configurations and many other things before it becomes fully available. Okay, we talked about peak users in our applications, but what about when there is idleness? Do we really want to leave these nodes standing with underutilized computing power? To resolve this and other issues, the concept of auto scalers comes into play. Auto Scalers Auto scaler implementations are basically responsible for node provisioning and consolidation. Here we are talking about horizontal scaling, that is, adding more machines to our cluster. There are several implementations of node autoscaling, but in this article the focus will be on the AWS implementation and why we decided to migrate to another solution. Below is a figure exemplifying how node autoscaling works: Figure 01: AWS autoscaling - Auto Scaling Groups When defining a scaling group in AWS we need to define several properties, such as the minimum/maximum number of node instances allowed for this group, resources used, disk type, network configurations (subnets, etc.) and many other details. For example, for a certain type of application that uses more CPU, we will configure a group that contains instance types with more CPU than memory. In the end we will possibly have some distinct groups for certain types of applications. Putting the pieces together – Cluster Auto Scaler In order for my cluster to be able to “talk” to my cloud provider (in this example AWS), we need a component called Cluster Auto Scaler, or CAS. This component was created by the community that maintains Kubernetes, and is available here. A default CAS configuration can be seen below, using helm for installation: nameOverride: cluster-autoscaler awsRegion: us-east-1 autoDiscovery: clusterName: my-cluster image: repository: registry.k8s.io/autoscaling/cluster-autoscaler tag: v1.30.1 tolerations: - key: infra operator: Exists effect: NoSchedule nodeSelector: environment: "infra" rbac: create: true serviceAccount: name: cluster-autoscaler annotations: eks.amazonaws.com/role-arn: "role-aws" extraArgs: v: 1 stderrthreshold: info With this configured and installed and our autoscaling groups created we have just enabled automatic management of our nodes! Why we decided to migrate to Karpenter Our use case here at Projuris is as follows: we have a development cluster and a production cluster. After migrating to Gitlab SaaS, we had a challenge of how to provision runners to execute our pipelines. It was decided that we would use the development cluster to provision these runners. In the “first version” we chose the auto scaler cluster because it was a simpler configuration and already met our production setup. But then we started to face some problems with this choice: Provisioning time: when starting a pipeline the machine provisioning time was a little slow. The big point is that the auto scaler cluster pays a “toll” to the cloud provider for provisioning a new node. Difficulty in configuring groups: as we have some pipeline “profiles”, this management became a little complicated, because for each new profile a new node group needs to be created. Cost: to mitigate the problem of slow startup of a new node, we had an “online” machine profile that was always on, even without executing any pipeline. What is Karpenter? It is an autoscaling cluster solution created by AWS, which promises the provisioning and consolidation of nodes always at the lowest possible cost. He is smart enough to know that, for example, when buying an on-demand machine on AWS, depending on the situation, it is more cost-effective than if it were a spot machine. And that's just one of the features of this tool. Karpenter also works with the idea of “groups” of machines (which in the Karpenter world we call NodePools), but the difference is that we do this through CRDs (custom resource definitions) from Karpenter itself, that is, we have manifests within our cluster with all these configurations, eliminating the need for any node group created in AWS. pool. How did Karpenter help us overcome the challenges presented? Provisioning time: since Karpenter talks directly to the cloud provider's APIs, there is no need to pay the autoscaler cluster toll. We had many timeout issues when provisioning new nodes; after switching to Karpenter, this problem simply disappeared, precisely because provisioning is more efficient. Difficulty in configuring groups: with Karpenter's NodePools and NodeClass solution, this configuration became trivial, and most importantly, versioned in our version control on Gitlab. In other words, do you need to include a new machine profile in the NodePool? No problem, just one commit and Karpenter will already consider it in new provisioning. Cost: We were able to use machines, since now runners with similar characteristics are allocated to nodes that support the required memory and CPU requirements. In other words, we are really using all the computing power that that node provides. This also applies to node consolidation. With the cluster auto scaler, there were complex scripts to drain the nodes before consolidation. With Karpenter, this is configured in the NodePool in a very simplified way. A great argument for management to justify investing in this type of change is cost. Below we have a cost comparison using Cluster AutoScaler and Karpenter in January/25, where we achieved a total savings of 16%: Figure 02: Period from 01/01 to 15/01 with ClusterAutoScaler Figure 03: Period from 16/01 to 31/01 with Karpenter Final considerations The migration to Karpenter was a wise choice. We were able to simplify the management of our nodes with different profiles in a very simplified way. There is still room for some improvements, such as using a single NodePool to simplify things even further, and letting runners configure specific labels for the machine profile that should be provisioned for the runner (more at https://kubernetes.io/docs/reference/labels-annotations-taints/).