Karpenter Overview
Karpenter is an open‑source, node provisioning engine for Kubernetes that dynamically launches right‑sized compute instances based on real‑time pod demand. By integrating directly with cloud provider APIs, it eliminates the need for pre‑defined Auto Scaling groups, delivering faster scaling, higher utilization, and lower cost.
Migration Challenges at Salesforce Scale
Operating over a thousand Amazon EKS clusters, Salesforce faced growing complexity from thousands of node groups, slow scale‑up latency, and inefficient bin‑packing. These constraints hindered rapid development and increased operational toil, prompting a shift to a more responsive, policy‑driven autoscaling solution.
Legacy Autoscaling Limitations
The traditional Cluster Autoscaler relied on static Auto Scaling groups, causing multi‑minute provisioning delays during traffic spikes and leading to under‑utilized resources. Label length restrictions and rigid volume configurations further complicated migrations.
Automated Transition Tooling
Salesforce built a custom Karpenter transition tool to map existing ASG definitions to Karpenter EC2NodeClass specs, validate AMIs, and manage graceful pod eviction. A complementary patching check tool ensured node health before cut‑over, enabling repeatable, zero‑downtime migrations across all clusters.
Operational Outcomes
Post‑migration, scaling latency dropped from minutes to seconds, node utilization improved via advanced bin‑packing, and manual overhead shrank by 80%. The platform realized a 5% cost reduction in FY2026, with an additional 5‑10% projected for FY2027, while empowering developers with self‑service node pool definitions.
Key Best Practices
1. Adopt a phased rollout with sequential node cordoning.
2. Enforce Kubernetes label length limits during naming conventions.
3. Use OPA policies to validate Pod Disruption Budgets before node replacement.
4. Align storage settings between ASG volumes and Karpenter EC2NodeClass parameters.
Future Directions
Salesforce plans to extend heterogeneous instance support, incorporate GPU and ARM nodes within shared pools, and further refine IP efficiency by decoupling provisioning from fixed subnets, continuing the drive toward cost‑effective, high‑performance cloud native workloads.
For detailed guidance, refer to the Karpenter documentation and the Amazon EKS best‑practice guide.