Before We Start: A Reality Check
Kubernetes is probably the most over-adopted technology in the software industry right now. That is not a knock against it. It is excellent at what it does. But a significant percentage of the teams running K8s in production do not need it, and they are paying a complexity tax that exceeds the benefit they receive.
I have been running Kubernetes in production for two years across three different projects. Two on managed services (EKS and GKE), one self-managed with kubeadm on bare metal. I have seen it save months of manual infrastructure work. I have also seen it consume months of engineering time that could have been spent building product. What follows are the lessons I actually learned, not the ones Kubernetes marketing wants you to learn.
Lesson 1: The Declarative Model Actually Works
This is the core idea. You write YAML files describing what you want -- three replicas of this container, exposed on port 8080, with this much memory. Kubernetes reads that description and makes it happen. If a container crashes, K8s restarts it. If a node goes down, K8s reschedules the affected pods onto healthy nodes. If you change the desired state, K8s reconciles reality to match.
In practice, this works exactly as advertised. During our production use, we had nodes fail, pods crash from out-of-memory conditions, and even a full availability zone outage on AWS. Each time, Kubernetes recovered automatically. Pods came back. Traffic rerouted. No manual intervention required. The self-healing is not a marketing claim. It is how the system works at a deep architectural level.
The catch is that "declarative" means you are writing a lot of YAML. A simple web application with a deployment, service, ingress, configmap, secret, and horizontal pod autoscaler is six YAML files before you add monitoring or TLS certificates. Multiply that across twenty microservices and you have a hundred configuration files to maintain. The YAML fatigue is real. Tools like Helm, Kustomize, and Timoni help manage it but add their own complexity layer.
Lesson 2: Managed Services Changed the Equation
Self-managing Kubernetes is a full-time job. Actually, it is multiple full-time jobs. Running the control plane (API server, etcd, scheduler, controller manager) with high availability requires deep understanding of distributed systems, careful backup procedures, security hardening, and version upgrade planning. We tried it on bare metal. It worked. It also consumed roughly 30% of one senior engineer's time just to keep things running smoothly.
Managed services (EKS, GKE, AKS) eliminate the control plane management entirely. You pay roughly $72/month per cluster and the cloud provider handles the control plane availability, security patches, and upgrades. This shifts the equation dramatically. Instead of needing a Kubernetes expert, you need someone who understands Kubernetes concepts and can write the YAML.
Between the three providers, GKE is the smoothest experience. Autopilot mode manages even the worker nodes for you, charging per pod resource instead of per node. It is the closest thing to "serverless Kubernetes" available. EKS is tighter with AWS IAM and networking, which matters if your whole stack is AWS. AKS offers a free control plane tier, making it the cheapest entry point.
In my experience, unless you have a strong reason to self-manage (regulatory requirements, air-gapped environments, extreme customization needs), use a managed service. The engineering time you save more than pays for the management fee.
Lesson 3: Autoscaling Is the Killer Feature Nobody Talks About Enough
Horizontal Pod Autoscaler (HPA) scales your application replicas based on CPU, memory, or custom metrics. Cluster Autoscaler adds or removes nodes based on pending pod demand. Together, they create a system that right-sizes itself to actual traffic.
We configured HPA for our API service with a target of 60% CPU utilization. During normal hours, it ran three replicas. During a promotional campaign that tripled traffic, it scaled to eleven replicas automatically. When traffic subsided, it scaled back down. No manual intervention. No 2 AM pages. The cost savings compared to running peak-capacity infrastructure 24/7 were significant -- roughly 40% lower compute costs.
Vertical Pod Autoscaler (VPA) is less mature but useful for right-sizing resource requests. It watches actual usage patterns and recommends (or automatically adjusts) CPU and memory allocations. This prevents the common problem of developers requesting 2 GB of memory for a service that uses 200 MB, which wastes cluster resources.
Lesson 4: Networking Will Make You Miserable at First
Kubernetes networking is the area where I spent the most debugging time. The model is elegant in theory: every pod gets its own IP, pods can talk to each other without NAT, Services provide stable endpoints. In practice, the number of things that can go wrong is staggering.
DNS resolution issues. Service mesh conflicts. Ingress controller misconfigurations. Network policies that silently block traffic you expected to flow. Load balancer health check failures. Each of these sent us down multi-hour debugging sessions in the first six months.
It gets better. Once you understand the networking model and have seen the common failure modes, debugging time drops dramatically. But the initial learning curve is steep enough that I would recommend any team new to K8s budget significant time for networking troubleshooting in their first few months.
A practical tip that saved us countless hours: learn kubectl debug early. This command lets you attach an ephemeral container to a running pod for troubleshooting purposes. When your minimal distroless production image has no shell, no curl, and no dig, you can still inject a debug container with all the networking tools you need. Before we discovered this, our debugging workflow involved building special debug images and redeploying. With ephemeral containers, troubleshooting a network issue takes minutes instead of the deploy-test-redeploy cycle that used to eat entire afternoons. Similarly, learn to read pod events and logs fluently. kubectl describe pod reveals scheduling decisions, resource limit violations, and probe failures that the pod's own logs never mention. The number of issues we resolved just by reading the events section carefully -- without looking at application logs at all -- was higher than I would have expected.
Network policies deserve a specific mention because they are disabled by default on most managed services, which gives teams a false sense that networking "just works." The moment you enable a network policy engine like Calico and start restricting traffic between namespaces, you discover all the implicit communication paths your services relied on. We recommend implementing network policies early in a project, not as an afterthought. Retrofitting them onto a running system with dozens of services and undocumented communication patterns is significantly harder than defining them from the start.
Lesson 5: The Ecosystem Is the Real Product
Kubernetes by itself is a container orchestrator. Kubernetes plus its ecosystem is a complete platform for running modern software. The CNCF landscape includes hundreds of projects, and many of them are genuinely excellent.
- Prometheus + Grafana for monitoring. The Prometheus Operator makes setup declarative. After the initial configuration, adding monitoring for a new service is a single ServiceMonitor YAML file.
- cert-manager for automatic TLS certificates. Set it up once with Let's Encrypt and never think about certificate renewal again. Saved us from at least two certificate expiration incidents.
- ArgoCD for GitOps deployments. Define your desired state in a Git repo. ArgoCD syncs the cluster to match. Every deployment is a Git commit. Every rollback is a revert. Clean, auditable, reliable.
- External DNS for automatic DNS record management. Create an Ingress, get a DNS record. Delete the Ingress, DNS record disappears. Simple and useful.
- Istio or Linkerd for service mesh. Mutual TLS between services, traffic splitting for canary deploys, observability without code changes. Istio is powerful but heavy. Linkerd is lighter and easier to operate.
The Custom Resource Definition (CRD) and operator pattern deserve special mention. CRDs let you extend the Kubernetes API with your own resource types. Operators encode operational knowledge into controllers that automate management of complex applications. The PostgreSQL operator from Zalando, for example, lets you deploy a production-grade PostgreSQL cluster with replication, automated failover, and backups by applying a single YAML file. That used to be a week of DBA work.
Lesson 6: Stateful Workloads Are Doable (With Caveats)
When I started with K8s, the conventional wisdom was "don't run databases on Kubernetes." That advice has aged. StatefulSets provide stable identities and persistent storage for each pod. CSI drivers integrate with every major storage backend. Operators automate database lifecycle management.
We ran PostgreSQL on GKE with the CloudNativePG operator for about eight months. It worked. Automated failover, point-in-time recovery, connection pooling built in. Performance was comparable to Cloud SQL for our workload. The cost was about 35% lower because we were using existing cluster resources rather than paying for a separate managed database instance.
The caveat: this only works if someone on your team understands both Kubernetes and database operations. When things go wrong with a database running on K8s, you need to debug at two layers simultaneously. If your team does not have that expertise, use a managed database service. The premium is worth the simplicity.
What K8s Actually Costs
Kubernetes itself is free and open-source. Everything else costs money.
Self-managed: free control plane + infrastructure costs. A three-node production cluster on AWS with m5.large instances runs about $300/month for compute alone. Add load balancers ($20/month each), persistent volumes ($0.10/GB/month), and data transfer ($0.09/GB outbound), and a modest setup lands around $500-800/month.
EKS and GKE: $0.10/hour (~$72/month) per cluster for the control plane, plus all the same infrastructure costs above. So add $72/month to whatever your self-managed costs would be, and subtract the salary of the engineer who was managing the control plane.
AKS: free control plane on the basic tier. Standard tier at $0.10/hour per cluster. This makes AKS the cheapest entry point for a managed K8s experience.
GKE Autopilot: no control plane fee, no node management. You pay per pod resource: $0.000017 per vCPU per second, $0.000002 per GB memory per second. For variable workloads, this can be significantly cheaper because you only pay for what your pods actually use.
Enterprise platforms (OpenShift, Rancher): add $50-200+ per node per month on top of infrastructure for enhanced security, multi-cluster management, and enterprise support.
Bottom line for most teams: a production-grade managed K8s setup with three to five nodes, monitoring, logging, and ingress runs $400-1,200/month on any major cloud. More nodes, more environments (dev/staging/prod), or high-traffic workloads push that into the thousands.
The Honest Conversation: Do You Need It?
When K8s pays for itself
- You run 10+ microservices and need automated deployment, scaling, and recovery
- Self-healing and autoscaling cut your infrastructure costs by 30-40%
- CRDs and operators automate operational tasks that used to require manual runbooks
- Cloud-agnostic portability means your infra knowledge transfers across providers
- The ecosystem (Prometheus, ArgoCD, cert-manager) creates a complete platform
- StatefulSets and CSI drivers make running databases on K8s genuinely viable
- Industry standard means abundant talent, documentation, and community support
When K8s costs you more than it saves
- Learning curve measured in months. Your team will be slower before they are faster.
- YAML configuration burden is real -- hundreds of files across dozens of services
- Networking debugging eats significant time in the first six months
- Overkill for a monolith or a handful of services that could run on a single VM
- Security surface area is large: RBAC, network policies, pod security, supply chain
- Without someone who owns the platform, configs drift and best practices erode
Here is the uncomfortable truth. If your application runs on fewer than five services, fits on two or three servers, and does not need multi-cloud portability, you probably do not need Kubernetes. AWS ECS with Fargate, Fly.io, Railway, or even a well-configured set of VMs behind a load balancer will serve you fine at a fraction of the complexity.
K8s makes sense when the complexity of managing your infrastructure manually exceeds the complexity of managing Kubernetes. For most organizations, that crossover happens somewhere around ten to fifteen services with real scaling requirements. Below that threshold, simpler alternatives offer better ROI.
My Two-Year Assessment
Our Verdict: 4.4 / 5
Kubernetes is the most capable infrastructure platform available. That is not hype. After two years of running it in production across different providers and workload types, I can say that the declarative model, self-healing architecture, autoscaling capabilities, and ecosystem depth solve real problems that every organization running containerized applications at scale faces. The managed services have lowered the operational bar enough that you no longer need a dedicated platform team just to keep the lights on.
But capability and appropriateness are different things. K8s demands a genuine investment in learning, tooling, and ongoing maintenance. The teams that get the most value are the ones that have real scale requirements, invest in platform engineering, and treat Kubernetes as infrastructure worth operating well rather than a checkbox on a tech stack. For those teams, 4.4 out of 5 might be generous only because I rounded down. For teams that adopt it because it is trendy without having the scale or engineering capacity to operate it well, the rating would be considerably lower.
Be honest about what you actually need. If the answer is Kubernetes, invest properly. If the answer is something simpler, there is no shame in that. The best infrastructure is the one that matches your actual requirements, not the one with the most impressive architecture diagram.
Comments (3)