CKA Study notes - Troubleshooting
Continuing with my Certified Kubernetes Administrator exam preparations I'm now going to take a look at the Troubleshooting objective. I've split this into three posts, Application Monitoring, Logging and Troubleshooting (this post)
The troubleshooting objective counts for 30% so based on weight it's the most important objective in the exam so be sure to spend some time studying it. The Kubernetes Documentation is as always the place to go, as this is available during the exam. Troubleshooting is often very situation specific, and oftentimes we need to combine multiple troubleshooting techniques. These posts will be fairly generic and again based on studying for the CKA exam
In this post we'll take a look at a few Troubleshooting steps in Kubernetes. As mentioned I'll use the CKA exam objectives as the starting point, and currently there's three specific sub-objectives mentioning Troubleshooting
- Troubleshoot application failure
- Troubleshoot cluster component failure
- Troubleshoot networking
Note #1: I'm using documentation for version 1.19 in my references below as this is the version used in the current (jan 2021) CKA exam. Please check the version applicable to your usecase and/or environment
Note #2: This is a post covering my study notes preparing for the CKA exam and reflects my understanding of the topic, and what I have focused on during my preparations.
Troubleshoot application failure
Kubernetes Documentation reference
If our application has an error I'd first start with checking if the Pod was running, and in case of multiple replicas, if there's an issue on all or just a few. One important thing here is to try to find out if the failure is because of something inside the Application, or if there's an error running the Pod, or maybe the Service it is a part of.
To check if a pod is running we can start with the kubectl get pods
command. This outputs the status of the Pod
We can continue with a kubectl describe pod <pod-name>
to further investigate the Pod. Status messages appear here.
We also have the kubectl logs <pod-name>
which outputs the stdout
and stderr
from the containers in the Pod.
Note that application logging is very much up to the application and the developers of that application. Kubernetes cannot do any magic stuff inside an application.
Pod failure
Kubernetes Documentation reference
A Pod failure can show it self in a few ways. We'll quickly take a look of some of them, most of them gives clues and details through the kubectl describe pod <pod-name>
command
- Pod stays in Pending state
- The pod doesn't get scheduled on a Node. Often because of insufficient resources
- Fix with freeing up resources, or add more nodes to the cluster
- Might also be because of a Pod requesting more resources than is available
- Pod stays in Waiting state
- Pod has been scheduled, but cannot start. Most commonly because of an issue with the image not being pulled
- Check that the image name is correct, and if the image is available in the registry, and that it can be downloaded
- Pod is in ImagePullBackOff
- As the previous, there's something wrong with the image.
- Check if the image name is correct, and that the image exists in the registry, and that it can be downloaded
- Pod is in CrashLoopBackOff
- This status comes when there's an error in a container that causes the Pod to restart. If the error doesn't get fixed by a restart it'll go in a loop (depending on the RestartPolicy of the Pod)
- Describe the Pod to see if any clues can be found on why the Pod crashes
- Pod is in Error
- This can be anything.. Describe the pod and check logs
- If nothing can be found in logs, can the Pod be recreated?
- Export existing spec with
kubectl get pod <pod-name> -o yaml --export > <file-name>
If a Pod is running we're off to check the logs with the kubectl logs <pod-name>
command. If there is multiple containers you can specify which container to get logs from.
Note that if the Pod has crashed and restarted Kubernetes will keep the logs from the previous Pod which can be accessed by adding the --previous
parameter
If the container has a shell, or we know the location of a specific log file, or maybe have a specific debugging command that can be run we can do so by running the kubectl exec <pod-name> -- <command>
command. If the pod has multiple containers we specify the container we want by adding the -c <container-name>
parameter
To get a shell of a container we can run kubectl exec -it <pod-name> -- sh
.
To output a specific log file we can run kubectl exec <pod-name> -- cat <path-to-log>
Troubleshoot network failure
Service failure
Kubernetes Documentation reference
If a service is not working as expected the first step is to run the kubectl get service
command to check your service status. We continue with the kubectl describe service <service-name>
command to get more details
Based on the type of failure we have a few steps that can be tried
- If the service should be reachable by DNS name
- Check if you can do a nslookup from a Pod in the same namespace. If this work, test from a different namespace
- Check if other services are available through DNS. If not there might be a cluster level error
- Test a nslookup to
kubernetes.default
. If this fails there's an error with the DNS service and not your service or application
- Test a nslookup to
- If DNS lookup is working, check if the service is accessible by IP
- Try a curl or wget from a Pod to the service IP
- If this fails there's something wrong with your service or network
- Check if the service is configured correctly, review the yaml and double check
- Check if the service has any endpoints,
kubectl get endpoints <service-name>
- If endpoints are created, check if they point to the correct Pods,
kubectl get pods -o wide
- If no endpoints are created, we might have a typo in the service selector so that the service doesn't find any matching pods
- If endpoints are created, check if they point to the correct Pods,
- Check if the Pods are working, refer to the above sections
- Finally it might be worth checking the
kube-proxy
- Check if the proxy is running, ```ps aux | grep kube-proxy````
- Check the system logs, i.e.
journalctl
CNI plugin
Early on in a cluster we might also have issues with the CNI plugin installed. Be sure to check the status of the installed plugin(s) and verify that they are working
Troubleshoot cluster failure
Kubernetes Documentation reference
First thing to check if we suspect a cluster issue is the kubectl get nodes
command. All nodes should be in Ready
state.
Node failure
If a Node has an incorrect state run a kubectl describe node <node-name>
command to learn more.
Check the kubectl get events
for errors
We also have the kubectl cluster-info dump
command which gives lots of details about the cluster, as well as the overall health.
Basic troubleshooting, like ping and nslookup can also be worth doing early on to rule out any obvious reasons.
Also remember that one of the mindsets in a container world is to redeploy instead of fixing, so it might be worth just redeploying a node instead of trying to fix it
Services
Check the system services on the nodes. Both the container runtime and the kubelet needs to run
1systemctl status kubelet
Logs
The following logs can be investigated
Control plane nodes
- /var/log/kube-apiserver.log -- API server
- /var/log/kube-scheduler.log -- Scheduler
- /var/log/kube-controller-manager.log -- Controller that manages replication controllers
Worker nodes
- /var/log/kubelet.log -- The kubelet is the service running the containers on a node
- /var/log/kube-proxy.log -- Kube-proxy is responsible for service load balancing