Kubernetes Problem Solving

Fairwinds
4 min readAug 6, 2020

No doubt when running Kubernetes you will spend a good amount of time problem solving. Our team does a lot of problem solving across many different Kubernetes deployments. Here are a few common issues and how we solve them.

1. OOMKilled pod

kubelet preserves node stability when available compute resources are low. The kubelet can proactively monitor for and prevent total starvation of a compute resource. In those cases, the can reclaim the starved resource by proactively failing one or more pods. The kubelet ranks pods for eviction first by whether or not their usage of the starved resource exceeds requests, then by priority.

To avoid problems, you’ll need to set memory requests and limits. Although if your memory limits are too low, you’ll get familiar with OOMKilled! But if you set your limits too high, you’re inherently wasting money by overallocating.

To overcome this problem, you must set memory requests and limits. This, however, isn’t always done, so you’ll run into instances where a pod gets killed simply because nothing was defined. You will need to go through all Pods to check for memory requests and limits. Manually doing this can take a lot of time and is prone to human error. We solve this problem by using Goldilocks, an open source tool that helps teams allocate resources to their Kubernetes deployments and get those resource calibrations just right.

2. Higher than expected costs

Cost reduction is one of the many benefits of Kubernetes, and you should have some idea of expected costs. If at the end of the month your bill shows skyrocketing costs, then you are likely to have a problem.

To solve this problem you must compare recommendations to actual costs. Many organizations set their CPU and memory requests and limits too high, but don’t know where to make changes. We use Goldilocks, as mentioned above, to review requests and limits. For cost analysis, though, we use Fairwinds Insights, which ingests Goldilocks data and helps us prioritize which workloads need tuning of requests and limits. We also use Cluster Autoscaler to ensure any extra nodes are removed when they are unused, which saves time and money.

3. Knowing when to update Helm

Patching a Kubernetes add-on isn’t typically that hard. On the other hand, keeping track of when to update can be hard. While Kubernetes updates come on a regular quarterly schedule, Helm chart updates are incredibly hard to monitor and predict. Our Fairwinds team uses Nova, an open source, command-line interface for cross-checking the Helm charts running in your cluster with the latest versions available. Nova will let you know if you’re running a chart that’s out-of-date or deprecated, so you can make sure you’re always aware of updates. If using Fairwinds Insights, the tool can ingest Nova data and provide alerts when new updates are available, rather than you having to run Nova periodically.

4. Handling an unexpected burst in traffic

Whether you have suddenly gone viral or have a DDoS attack, Kubernetes offers autoscaling so your app won’t fail. While that is good if you’ve gone viral, it can be really bad if there is a DDoS attack occurring.

  • Set up rate-limiting in your application, as well as in your service mesh or ingress controllers. This will prevent any single machine from using up too much bandwidth. For example, nginx-ingress can limit the requests per second or per minute, the payload size, and the number of concurrent connections from a single IP address.
  • Run load testing to understand how well your application scales. You’ll want to set up your application on a staging cluster, and hit it with traffic. It’s a bit harder to test against distributed DDoS attacks, where traffic could be coming from many different IPs at once, but there are a few services out there to help with this.
  • Enlist a third-party service like Cloudflare to do DoS/DDoS protection.

Referring back to point one, you’ll need to ensure you have the right limit requests to ensure that your regular traffic can reach your services.

5. Missing tags

Docker allows you to rewrite a tag when you push a container. A common example of this is the latest tag. This is a dangerous practice because it presents a possible scenario where you don’t know what code is running in your container. We use Polaris, another open source tool that supports a number of checks related to the image specified by pods. When used, if a tag is missing, Polaris will show either images.tagNotSpecified or images.pullPolicyNotAlways.

For those of you that don’t have the resources to work with all of these open source projects, we’ve combined them all into Fairwinds Insights, a configuration validation tool that will scan your clusters and check for configurations that may be costing you money, leaving you vulnerable or causing downtime. You can check it out and try it by spinning up a cluster on GKE, AKS or EKS to see how it can help with Kubernetes problem solving.

--

--

Fairwinds

Fairwinds — The Kubernetes Enablement Company | Editor of uptime 99