Lets learn "About k8s Descheduler"

Introduction

One of the components of the kubernetes stack is Kube-scheduler which takes care of scheduling pods on k8s worker nodes based on some strategies. These strategies include node resources, node affinity, etc. Basically, Kube-scheduler tries to schedule a pod on that node that meets its scheduling requirements the best. The problem here is once a pod is scheduled on a node, it will never be auto rescheduled on another node unless it is manually deleted and recreated.

Now one question that can come into mind in what scenarios we want a pod to get rescheduled automatically:

The original scheduling decisions (made by Kube-scheduler) may no longer hold true because of the following reasons:

Node resources are changed.
New taints/labels are added to the node.
New nodes are added to the cluster, which makes existing nodes overutilized and new nodes underutilized.
Some nodes which were failed are brought back and its pods are scheduled to other nodes.

So the conclusion is there are some scenarios where original Kube-scheduler decisions are not valid anymore and decisions need to be re-taken by some entity which is when Kube DESCHEDULER comes into the picture.

If you are interested in going through code for k8s descheduler, you can find it here.

How descheduler for k8s works?

Basically descheduler binary when runs with the policy config file, it evicts the pods based on the policy defined in policy config files. Unlike Kube-scheduler, it is not a long-running process. So if you want to evict pods based on the strategies defined multiple times, you need to run the descheduler binary that many times.

In general, if we wanna run it once, we can run it as a k8s job. And if we wanna run it periodically, we can create a k8s cron job which runs descheduler binary periodically and evict pods based on strategies provided. I've been talking about the strategies multiple times, we'll go through that in detail later in this tutorial.

NOTE: The descheduler pod needs to be run as a critical pod in the Kube-system namespace so that it should not be evicted by itself or by kubelet in any scenario.

Descheduler as k8s CronJob

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: descheduler-cluster-role
  namespace: kube-system
  labels:
    app: descheduler
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: ["scheduling.k8s.io"]
  resources: ["priorityclasses"]
  verbs: ["get", "watch", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: descheduler-sa
  namespace: kube-system
  labels:
    app: descheduler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: descheduler-cluster-role-binding
  namespace: kube-system
  labels:
    app: descheduler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: descheduler-cluster-role
subjects:
  - name: descheduler-sa
    kind: ServiceAccount
    namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy-configmap
  namespace: kube-system
  labels:
    app: descheduler
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemoveDuplicates":
         enabled: true
      "RemovePodsViolatingInterPodAntiAffinity":
         enabled: true
      "LowNodeUtilization":
         enabled: true
         params:
           nodeResourceUtilizationThresholds:
             thresholds:
               "pods": 20
             targetThresholds:
               "pods": 33

---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: descheduler-cronjob
  namespace: kube-system
  labels:
    app: descheduler
spec:
  schedule: "*/2 * * * *" # Runs every 2 minutes
  concurrencyPolicy: "Forbid"
  jobTemplate:
    spec:
      template:
        metadata:
          name: descheduler-pod
        spec:
          priorityClassName: system-cluster-critical
          containers:
          - name: descheduler
            image: k8s.gcr.io/descheduler/descheduler:v0.19.0
            volumeMounts:
            - mountPath: /policy-dir
              name: policy-volume
            command:
              - "/bin/descheduler"
            args:
              - "--policy-config-file"
              - "/policy-dir/policy.yaml"
              - "--v"
              - "3"
          restartPolicy: "Never"
          serviceAccountName: descheduler-sa
          volumes:
          - name: policy-volume
            configMap:
              name: descheduler-policy-configmap

Descheduler as K8s Job

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: descheduler-cluster-role
  namespace: kube-system
  labels:
    app: descheduler
rules:
- apiGroups: [""]
  resources: ["events"]
  verbs: ["create", "update"]
- apiGroups: [""]
  resources: ["nodes"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list", "delete"]
- apiGroups: [""]
  resources: ["pods/eviction"]
  verbs: ["create"]
- apiGroups: ["scheduling.k8s.io"]
  resources: ["priorityclasses"]
  verbs: ["get", "watch", "list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: descheduler-sa
  namespace: kube-system
  labels:
    app: descheduler
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: descheduler-cluster-role-binding
  namespace: kube-system
  labels:
    app: descheduler
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: descheduler-cluster-role
subjects:
  - name: descheduler-sa
    kind: ServiceAccount
    namespace: kube-system
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: descheduler-policy-configmap
  namespace: kube-system
  labels:
    app: descheduler
data:
  policy.yaml: |
    apiVersion: "descheduler/v1alpha1"
    kind: "DeschedulerPolicy"
    strategies:
      "RemoveDuplicates":
         enabled: true
      "RemovePodsViolatingInterPodAntiAffinity":
         enabled: true
      "LowNodeUtilization":
         enabled: true
         params:
           nodeResourceUtilizationThresholds:
             thresholds:
               "pods": 20
             targetThresholds:
               "pods": 33
---

apiVersion: batch/v1
kind: Job
metadata:
  name: descheduler-job
  namespace: kube-system
spec:
  parallelism: 1
  completions: 1
  template:
    metadata:
      name: descheduler-pod
    spec:
      priorityClassName: system-cluster-critical
      containers:
        - name: descheduler
          image: k8s.gcr.io/descheduler/descheduler:v0.19.0
          volumeMounts:
          - mountPath: /policy-dir
            name: policy-volume
          command:
            - "/bin/descheduler"
          args:
            - "--policy-config-file"
            - "/policy-dir/policy.yaml"
            - "--v"
            - "3"
      restartPolicy: "Never"
      serviceAccountName: descheduler-sa
      volumes:
      - name: policy-volume
        configMap:
          name: descheduler-policy-configmap

Where you can find a docker image for descheduler?

Docker image for descheduler is available at k8s.gcr.io/descheduler/descheduler:<version>

Policy and Strategies for Descheduler

Now Let's Learn about different strategies for descheduler and their use-cases in detail.

Descheduler's policy is configurable and includes strategies that can be enabled or disabled. There is a total of 7 strategies that are implemented as of today ( 9/27/20 ). As part of the policy, the parameters associated with the strategies can be configured too. Also by default, all the strategies are enabled.

RemoveDuplicates :

It might be possible that multiple pods associated with the same replica set, replica controller, or k8s jobs are scheduled on the same node.
For e.g, if one of the k8s worker nodes goes down and pods scheduled on that node will get rescheduled on other active nodes ending up more than one similar pods on one node. If you wanna reschedule such pod back to the initial node when it comes back, you can use this strategy and the descheduler will try to remove duplicates from a node if possible to do so.
Parameters:

excludeOwnerKinds	Type: list( string ) Details: List of ownerRefs Kind(s). Pods with any of these kinds won't be evicted.
namespaces	Details: See namespace filtering section
thresholdPriority	Type: int Details: See priority filtering section
thresholdPriorityClassName	Type: string Details: See priority filtering section

Example:

apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
strategies:
"RemoveDuplicates":
enabled: true
params:
removeDuplicates:
excludeOwnerKinds:
- "ReplicaSet"

LowNodeUtilization:

This strategy is used to ensure the proper distribution of pods on all the k8s worker nodes. If this strategy is enabled, then based on parameters provided to this strategy, descheduler will try to classify the node in three types

1) Over utilized - This means the percentage of pods scheduled on the node is more than the target threshold.
2) Under utilized - This means the percentage of pods scheduled on the node is less than the threshold.
3) Appropriately utilized - This means the percentage of pods scheduled on the node is fine ( i.e more than the threshold and less than the target threshold )

With this strategy enabled descheduler will try to make every node appropriately utilized by evicting the pods from an over-utilized node and scheduling it on under utilized node.
For e.g if a node goes down and comes back up after some time and in the meantime the pods on the node were get rescheduled on other active nodes when this node comes back up descheduler will declare this node as under utilized and thus it tries to schedule pods on this node if there is any over-utilized node in the system.
NOTE: You should tune the parameters of this strategy carefully to avoid unnecessarily re-scheduling of pods.
Parameters:

thresholds	Type: map(string:int) Details: % of resources to mark node as under-utilized.
targetThresholds	Type: map(string:int) Details: % of resources to mark node as over-utilized.
thresholdPriority	Type: int Details: See priority filtering section
thresholdPriorityClassName	Type: string Details: See priority filtering section

RemovePodsViolatingInterPodAntiAffinity:

Inter-pod anti-affinity means when we don't want two different pods to run on the same node.
This strategy helps to evict pod podX in case there is other pod on the same node podY and anti-affinity rules say podX and podY can't be scheduled on the same node.
One question that can come to mind that how podX and podY get scheduled at first place on the same node if they have anti-affinity rules in the pod spec, the scenario when this can happen is when anti-affinity rules are added after the pods are already scheduled.
Parameters:

namespaces
Details: See namespace filtering section
thresholdPriority
Type: int
Details: See priority filtering section
thresholdPriorityClassName
Type: string
Details: See priority filtering section

RemovePodsViolatingPodsAffinity:

So today when one defines an affinity rule it has requiredDuringSchedulingIgnoredDuringExecution which means when the Kube-scheduler schedules the pod it will take care of affinity rules but after that kubelet will no longer take care of that. Let's discuss with an example, say we have two pods podX and podY. Say podY is scheduled on node N1 and podX has an affinity rule which states podX and podY to be scheduled on the same node. when Kube-scheduler first schedule podX it will schedule it to node N1 ( as podY is on node N1 ). But say in case podY gets rescheduled to node N2 due to whatever reasons, kubelet won't schedule podX to node N2 automatically violating podX affinity rule.
With this strategy enabled on Kube-scheduler, it tries to ensure that affinity rules should not be violated on the execution part as well. So its kind of provide requiredDuringSchedulingRequiredDuringExecution thing which is great.
Parameters:

nodeAffinityType	Type: string Details: Type of node affinity
namespaces	Details: See namespace filtering section
thresholdPriority	Type: int Details: See priority filtering section
thresholdPriorityClassName	Type: string Details: See priority filtering section

RemovePodsHavingTooManyRestarts:

This is clear from the name itself descheduler will evict the pods having a restart count more than the threshold defined and reschedule on other nodes.
One of the scenarios where this can be useful is: If the pod has missing resources limit defined or the node gets out of resources after the pod is scheduled. So due to the limitation of resources available process running in the pod might be crashing and hence incrementing the restart count. This strategy of descheduler will check the restart count of the pod and if it is greater than the threshold, assuming that the pod is no longer loving the current node, it evicts it and reschedule it to other active nodes.

PodLifetime:

This strategy evicts the pods with a lifetime greater than provided maxPodLifeTimeSeconds.

RemovePodsViolatingNodeTaints:

If you not familiar with taints and tolerations, it is better to understand that first here.
Basically, this strategy ensures that pods violating NoSchedule taints on nodes are removed.
For e.g: If a pod podX is scheduled on node N1 and all the tolerations of podX match taint on node N1. If taints on node N1 are updated or removed, podX needs to be evicted from node N1. This strategy of descheduler will take care of that.

I HOPE YOU LEARNT SOMETHING NEW IN THIS TUTORIAL. IF YOU HAVE ANY DOUBTS, PLEASE PUT IT IN COMMENTS BELOW :)

Search This Blog

Lets learn