CKA Study notes - Pod Scheduling

Overview

This post is part of my preparations towards the Certified Kubernetes Administrator exam. I have also done a post on Daemon Sets, Replica Sets and Deployments which we normally use for deploying workloads.

In this post we'll take a look at Pod scheduling and how we can work with the Kubernetes scheduler and decide how and where Pods are scheduled. This can also be incorporated in Daemon Sets, Replica Sets and Deployments, which is how you normally deploy stuff. I'll keep it easy and just schedule Pods in these examples.

Note #1: I'm using documentation for version 1.19 in my references below as this is the version used in the current (dec 2021) CKA exam. Please check the version applicable to your usecase and/or environment

Note #2: This is a post covering my study notes preparing for the CKA exam and reflects my understanding of the topic, and what I have focused on during my preparations.

Kubernetes Scheduler

Kubernetes Documentation reference

First let's take a quick look at how Kubernetes schedules workloads.

The scheduler does not work with individual containers directly, the smallest entity Kubernetes will manage is a Pod. A Pod contains one or more containers.

To decide which Node a Pod will be run on Kubernetes works through several decisions which can be divided in to two steps

  1. Filtering
  2. Scoring

Filtering

The first operation is to filter which Nodes are candidates to run the Pod. These are called feasible nodes. In this step Kubernetes is filtering which nodes have enough resources to meet the Pods requirements. Resources include both cpu and memory requirements, but also storage mounts and networking.

In this step we are also checking if the Pod is requesting to run on a specific set of nodes. This can be done with Taints and Tolerations, Affinity, NodeSelectors etc

Scoring

When the Kube scheduler has filtered out the list of feasible nodes it goes ahead and sorts that list based on scoring. The scheduler ranks the nodes to choose the most suitable node

Custom schedulers

Kubernetes is extensible in it's nature and it allows for custom schedulers so you can write your own scheduler to fit your exact needs if the default one comes too short.

Taints and Tolerations

Kubernetes Documentation reference

Taints are a way for a Node to repel Pods from running there. Meaning you can add a taint on a node which will instruct the scheduler to not run a specific (or any) pod on that node.

By default a Kubernetes cluster already have taints set on nodes. The master/control plane nodes has a NoSchedule taint by default to prevent normal pods to be scheduled on them.

1kubectl describe node <CONTROL-PLANE-NODE>
Check configuration of a control plane node

This control plane node has the taint master:NoSchedule. Because of this the scheduler will not schedule a Pod to it as long as it doesn't have a Toleration for it

A Toleration is assigned to a Pod which allows the pod to be scheduled on a node with a corresponding taint. An example of this can be found in Daemon set pods which oftentimes runs on all nodes.

From my post on Daemon sets we used the calico networking component pod which are running on all nodes. First let's identify a pod running on a control plane node

1kubectl -n kube-system get pod -o wide
List of kube-system pods

Let's check the Tolerations on that pod

1kubectl -n kube-system describe pod <POD-NAME>
Tolerations on a daemon set pod

This toleration will match the master:NoSchedule taint on the control plane nodes and thereby let the kube-scheduler schedule the pod on the node.

Let's do a couple of our own examples

First we'll taint our worker nodes with some taints. We'll do the classic Prod/Dev example

1kubectl taint node <PROD-NODE> env=dev:NoSchedule
2kubectl taint node <PROD-NODE> env=dev:NoSchedule
Tainted nodes

With these taints the kube-a-04 and -05 nodes should not run dev pods, only production pods. Production pods can run on all three workers.

Let's create a pod with a dev tag and see where the kube-scheduler puts it

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: nginx
 5  labels:
 6    env: dev
 7spec:
 8  containers:
 9    - name: nginx
10      image: nginx
11      imagePullPolicy: IfNotPresent

Note the env:dev tag. We'll create this pod with the kubectl create -f filename command. And then we'll retrieve our pods and see what node it got scheduled on.

1kubectl create -f <file-name-1>
2
3kubectl get pods -o wide
First dev pod created

As expected it was scheduled to node -06 which was not tainted earlier. Let's create a second pod with the same spec and tag, just with a different name to see where that gets scheduled

1kubectl create -f <file-name-2>
2
3kubectl get pods -o wide
Second dev pod created

Again the pod was scheduled on node -06

Now let's create a pod tagged with the env:prod tag and see where that gets scheduled

Not working example

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: nginx-prod-1
 5  labels:
 6    env: prod
 7spec:
 8  containers:
 9    - name: nginx-prod-1
10      image: nginx
11      imagePullPolicy: IfNotPresent
1kubectl create -f <file-name-3>
2
3kubectl get pods -o wide
First prod pod scheduled

But wait! Why wasn't that scheduled on one of the prod tainted nodes?

Well, as I mentioned previously, a pod needs to have a Toleration matching the taint to be scheduled. Let's change the config for our prod pod and add that toleration

Working example

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: nginx-prod-1
 5  labels:
 6    env: prod
 7spec:
 8  containers:
 9    - name: nginx-prod-1
10      image: nginx
11      imagePullPolicy: IfNotPresent
12  tolerations:
13    - key: "env"
14      effect: "Equal"
15      value: "dev"
16      effect: "NoSchedule"

And let's delete and recreate the prod pod and see if it will be scheduled on a "prod node" now

1kubectl delete pod <prod-pod-name>
2
3kubectl create -f <file-name-3>
4
5kubectl get pods -o wide
Prod pod scheduled on correct node

Now we finally got the prod pod scheduled to one of the prod nodes.

It actually took me a little while to get a grip on this thing and that's why I wanted to document why it didn't work as expected the first time.

There are obviously some use cases for using Tains and Tolerations, with the control plane/master node as an example. More can be found here. To me it's a bit confusing, and I see it as a way to ensure that only a specific set of Pods gets scheduled on a set of nodes.

Taints are also used to prevent pods in case of conditions on the nodes, e.g. Memory pressure etc. Again this can be overridden by Tolerations. reference

Another way of controlling scheduling can be to use node affinity which we'll take a look at next, but first we'll untaint our nodes

1kubectl taint node <NODE> env-
Untaint nodes

Node affinity

So, let's see how we can explicitly say which node we want to schedule the pods on. To do this we'll either use the NodeSelector spec or an Affinity

As opposed to Taints and Tolerations where we instruct the scheduler with which nodes should NOT have pods assigned UNLESS they have a specific Toleration, we specify which Nodes a Pod MUST or SHOULD run on.

First let's see how to use the nodeSelector

NodeSelector

nodeSelector is a field in the podSpec and it works by tagging a node with a label which will be used in the nodeSelector. Again let's use our prod/dev example from before and tag a couple of nodes with a prod label

1kubectl label nodes <NODE-NAME> <KEY=VALUE>
2
3kubectl get nodes --show-labels
Node labels

Now let's add the nodeSelector to the Pod configuration

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: nginx-dev-1
 5spec:
 6  containers:
 7    - name: nginx-dev-1
 8      image: nginx
 9      imagePullPolicy: IfNotPresent
10   nodeSelector:
11     env: dev

And let's create the Pod

1kubectl create -f <file-name-1>
2
3kubectl get pod -o wide
Dev pod with node selector

As expected the pod is scheduled on my dev node.

Let's create a couple of prod nodes as well

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: nginx-prod-1
 5spec:
 6  containers:
 7    - name: nginx-prod-1
 8      image: nginx
 9      imagePullPolicy: IfNotPresent
10  nodeSelector:
11    env: prod
1kubectl create -f prod-pod.yaml
2kubectl create -f prod-pod2.yaml
3
4kubectl get pod -o wide
Created prod pods

Again we can verify that the prod pods have been scheduled on a prod node while a dev pod have been scheduled on a dev node.

The nodeSelector scheduling is easier, and to me more understandable, however it might not fit more complex use cases. For this you might use Node Affinity rules.

It's also not as strict as the Taints and Tolerations example so you could probably end up scheduling a pod on a wrong node if the labels are not specified correctly.

Affinity/anti-affinity

Kubernetes Documentation reference

Node Affinity is similar to nodeSelector and also uses node labels. However the possibilities on selecting the correct node(s) is more expressive.

Affinity (and anti-affinity) rules can use more matching rules, we can also incorporate "soft" rules where we specify that a pod should (preferred) run on instead of must (required) run on. And there's also the possibility to have a affinity rule concerning pods instead of nodes. E.g. Pod A should not run on the same node as Pod B. For a VI admin this is quite similar to DRS rules with both VM to Host rules, and VM to VM rules.

So let's see it in action.

We'll add an affinity portion to the podSpec where we specify that the nodeAffinity must match a nodeSelector label. Notice I've used the "In" operator indicating that we can have multiple values.

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: nginx-dev-2
 5  labels:
 6    env: dev
 7spec:
 8  affinity:
 9    nodeAffinity:
10      requiredDuringSchedulingIgnoredDuringExecution:
11        nodeSelectorTerms:
12          - matchExpressions:
13            - key: "env"
14              operator: In
15              values:
16                - dev
17  containers:
18    - name: nginx-dev-2
19      image: nginx
20      imagePullPolicy: IfNotPresent
1kubectl create -f <file-name-2>
2
3kubectl get pods -o wide
Pod with Affinity rule

Inter pod-affinity/ant-affinity

Kubernetes Documentation reference

So let's try the anti-affinity option where we instruct that a pod needs to stay away from another pod.

We'll add a new Dev pod which needs to stay away from other pods with the same env=dev label. This is done by adding a podAntiAffinity in the podSpec. We also need to add a topologyKey from the nodes. I'm using the hostname label, but you could use others or create your own.

Similarly we could add a podAffinity if we wanted the pods to be together.

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: nginx-dev-3
 5  labels:
 6    env: dev
 7spec:
 8  containers:
 9    - name: nginx-dev-3
10      image: nginx
11      imagePullPolicy: IfNotPresent
12  affinity:
13    nodeAffinity:
14      requiredDuringSchedulingIgnoredDuringExecution:
15        nodeSelectorTerms:
16          - matchExpressions:
17            - key: "env"
18              operator: In
19              values:
20                - dev
21    podAntiAffinity:
22      requiredDuringSchedulingIgnoredDuringExecution:
23        - labelSelector:
24            matchExpressions:
25              - key: env
26                operator: In
27                values:
28                  - dev
29          topologyKey: "kubernetes.io/hostname"

Let's create this pod and see what happens

1kubectl create -f <file-name-3>
2
3kubectl get pods -o wide
Schedule dev pod with anti-affinity

The pod can't run because there's no node available that matches both the nodeAffinity AND the podAntiAffinity. This can be verified by checking the status of the pod with the kubectl describe pod command

1kubectl describe pod <POD-NAME>
Pod describe

Now, let's try to re-label one of the prod nodes to be a dev node.

1kubectl label node <PROD-NODE-NAME> env-
2kubectl label node <PROD-NODE-NAME> env=dev #Could've used the --overwrite flag to do this in one operation
Relabel node and pod is running

Voila! Our pod has now been scheduled on the newly relabeled node. Check out a few other use-cases from the Kubernetes documentation

A little warning as it comes to using podAffinity rules: This requires a lot of processing in the kube-scheduler which can slow down scheduling overall in large clusters. reference

Summary

A quick summary:

Taints is a way for a node to keep pod(s) away, can be overridden with Tolerations on a Pod

labels can be used in a nodeSelector expression in the podSpec to tell the scheduler where a pod wants to run. Affinity and anti-affinity can be set up so that a pod gets scheduled together with or away from a specific set of other pods, affinity rules can be written with a hard (required) or soft (preferred) requirement.

Kubernetes allows for custom schedulers

There's obviously a lot more to scheduling, nodeAffinity and podAffinity than what's covered here. Check out the documentation for more information and examples.

This page was modified on January 14, 2021: Fixed note reg. doc