CKA Study notes - Troubleshooting

Continuing with my Certified Kubernetes Administrator exam preparations I'm now going to take a look at the Troubleshooting objective. I've split this into three posts, Application Monitoring, Logging and Troubleshooting (this post)

The troubleshooting objective counts for 30% so based on weight it's the most important objective in the exam so be sure to spend some time studying it. The Kubernetes Documentation is as always the place to go, as this is available during the exam. Troubleshooting is often very situation specific, and oftentimes we need to combine multiple troubleshooting techniques. These posts will be fairly generic and again based on studying for the CKA exam

In this post we'll take a look at a few Troubleshooting steps in Kubernetes. As mentioned I'll use the CKA exam objectives as the starting point, and currently there's three specific sub-objectives mentioning Troubleshooting

  • Troubleshoot application failure
  • Troubleshoot cluster component failure
  • Troubleshoot networking

Note #1: I'm using documentation for version 1.19 in my references below as this is the version used in the current (jan 2021) CKA exam. Please check the version applicable to your usecase and/or environment

Note #2: This is a post covering my study notes preparing for the CKA exam and reflects my understanding of the topic, and what I have focused on during my preparations.

Troubleshoot application failure

Kubernetes Documentation reference

If our application has an error I'd first start with checking if the Pod was running, and in case of multiple replicas, if there's an issue on all or just a few. One important thing here is to try to find out if the failure is because of something inside the Application, or if there's an error running the Pod, or maybe the Service it is a part of.

To check if a pod is running we can start with the kubectl get pods command. This outputs the status of the Pod

We can continue with a kubectl describe pod <pod-name> to further investigate the Pod. Status messages appear here.

We also have the kubectl logs <pod-name> which outputs the stdout and stderr from the containers in the Pod.

Note that application logging is very much up to the application and the developers of that application. Kubernetes cannot do any magic stuff inside an application.

Pod failure

Kubernetes Documentation reference

A Pod failure can show it self in a few ways. We'll quickly take a look of some of them, most of them gives clues and details through the kubectl describe pod <pod-name> command

  • Pod stays in Pending state
    • The pod doesn't get scheduled on a Node. Often because of insufficient resources
    • Fix with freeing up resources, or add more nodes to the cluster
    • Might also be because of a Pod requesting more resources than is available
  • Pod stays in Waiting state
    • Pod has been scheduled, but cannot start. Most commonly because of an issue with the image not being pulled
    • Check that the image name is correct, and if the image is available in the registry, and that it can be downloaded
  • Pod is in ImagePullBackOff
    • As the previous, there's something wrong with the image.
    • Check if the image name is correct, and that the image exists in the registry, and that it can be downloaded
  • Pod is in CrashLoopBackOff
    • This status comes when there's an error in a container that causes the Pod to restart. If the error doesn't get fixed by a restart it'll go in a loop (depending on the RestartPolicy of the Pod)
    • Describe the Pod to see if any clues can be found on why the Pod crashes
  • Pod is in Error
    • This can be anything.. Describe the pod and check logs
    • If nothing can be found in logs, can the Pod be recreated?
    • Export existing spec with kubectl get pod <pod-name> -o yaml --export > <file-name>

If a Pod is running we're off to check the logs with the kubectl logs <pod-name> command. If there is multiple containers you can specify which container to get logs from.

Note that if the Pod has crashed and restarted Kubernetes will keep the logs from the previous Pod which can be accessed by adding the --previous parameter

If the container has a shell, or we know the location of a specific log file, or maybe have a specific debugging command that can be run we can do so by running the kubectl exec <pod-name> -- <command> command. If the pod has multiple containers we specify the container we want by adding the -c <container-name> parameter

To get a shell of a container we can run kubectl exec -it <pod-name> -- sh.

To output a specific log file we can run kubectl exec <pod-name> -- cat <path-to-log>

Troubleshoot network failure

Service failure

Kubernetes Documentation reference

If a service is not working as expected the first step is to run the kubectl get service command to check your service status. We continue with the kubectl describe service <service-name> command to get more details

Based on the type of failure we have a few steps that can be tried

  • If the service should be reachable by DNS name
  • Check if you can do a nslookup from a Pod in the same namespace. If this work, test from a different namespace
  • Check if other services are available through DNS. If not there might be a cluster level error
    • Test a nslookup to kubernetes.default. If this fails there's an error with the DNS service and not your service or application
  • If DNS lookup is working, check if the service is accessible by IP
    • Try a curl or wget from a Pod to the service IP
    • If this fails there's something wrong with your service or network
  • Check if the service is configured correctly, review the yaml and double check
  • Check if the service has any endpoints, kubectl get endpoints <service-name>
    • If endpoints are created, check if they point to the correct Pods, kubectl get pods -o wide
    • If no endpoints are created, we might have a typo in the service selector so that the service doesn't find any matching pods
  • Check if the Pods are working, refer to the above sections
  • Finally it might be worth checking the kube-proxy
    • Check if the proxy is running, ```ps aux | grep kube-proxy````
    • Check the system logs, i.e. journalctl

CNI plugin

Early on in a cluster we might also have issues with the CNI plugin installed. Be sure to check the status of the installed plugin(s) and verify that they are working

Troubleshoot cluster failure

Kubernetes Documentation reference

First thing to check if we suspect a cluster issue is the kubectl get nodes command. All nodes should be in Ready state.

Node failure

If a Node has an incorrect state run a kubectl describe node <node-name> command to learn more.

Check the kubectl get events for errors

We also have the kubectl cluster-info dump command which gives lots of details about the cluster, as well as the overall health.

Basic troubleshooting, like ping and nslookup can also be worth doing early on to rule out any obvious reasons.

Also remember that one of the mindsets in a container world is to redeploy instead of fixing, so it might be worth just redeploying a node instead of trying to fix it

Services

Check the system services on the nodes. Both the container runtime and the kubelet needs to run

1systemctl status kubelet

Logs

The following logs can be investigated

Control plane nodes

  • /var/log/kube-apiserver.log -- API server
  • /var/log/kube-scheduler.log -- Scheduler
  • /var/log/kube-controller-manager.log -- Controller that manages replication controllers

Worker nodes

  • /var/log/kubelet.log -- The kubelet is the service running the containers on a node
  • /var/log/kube-proxy.log -- Kube-proxy is responsible for service load balancing
This page was modified on January 14, 2021: Changed draft status