Scheduling in a cluster is the process of binding pending pods to nodes, and is performed by a component called kube-scheduler or Volcano Scheduler. The scheduler uses a series of algorithms to compute the optimal node for running pods. However, Kubernetes clusters are dynamic and their state changes over time. For example, if a node needs to be maintained, all pods on the node will be evicted to other nodes. After the maintenance is complete, the evicted pods will not automatically return back to the node because descheduling will not be triggered once a pod is bound to a node. Due to these changes, the load of a cluster may be unbalanced after the cluster runs for a period of time.
CCE has resolved this issue by using Volcano Scheduler to evict pods that do not comply with the configured policy so that pods can be rescheduled. In this way, the cluster load is balanced and resource fragmentation is minimized.
Load-aware Descheduling
During Kubernetes cluster management, over-utilized nodes are due to high CPU or memory usage, which affects the stable running of pods on these nodes and increases the probability of node faults. To dynamically balance the resource usage between nodes in a cluster, a cluster resource view is required based on node monitoring metrics. During cluster management, real-time monitoring can be used to detect issues such as high resource usage on a node, node faults, and excessive number of pods on a node so that the system can take measures promptly, for example, by migrating some pods from an over-utilized node to under-utilized nodes.
When using this add-on, ensure the highThresholds value is greater than the lowThresholds value. Otherwise, the descheduler cannot work.
HighNodeUtilization
This policy finds nodes that are under-utilized and evicts pods from the nodes in the hope that these pods will be scheduled compactly into fewer nodes. This policy must be used with the bin packing policy of Volcano Scheduler or the MostAllocated policy of the kube-scheduler scheduler. Thresholds can be configured for CPU and memory.
When configuring a load-aware descheduling policy, do as follows to enable load-aware scheduling on Volcano Scheduler:
{ "colocation_enable": "", "default_scheduler_conf": { "actions": "allocate, backfill, preempt", "tiers": [ { "plugins": [ { "name": "priority" }, { "enablePreemptable": false, "name": "gang" }, { "name": "conformance" } ] }, { "plugins": [ { "enablePreemptable": false, "name": "drf" }, { "name": "predicates" }, { "name": "nodeorder" }, { "name": "usage", "enablePredicate": true, "arguments": { "usage.weight": 5, "cpu.weight": 1, "memory.weight": 1, "thresholds": { "cpu": 80, "mem": 80 } } } ] }, { "plugins": [ { "name": "cce-gpu-topology-predicate" }, { "name": "cce-gpu-topology-priority" }, { "name": "cce-gpu" } ] }, { "plugins": [ { "name": "nodelocalvolume" }, { "name": "nodeemptydirvolume" }, { "name": "nodeCSIscheduling" }, { "name": "networkresource" } ] } ] }, "deschedulerPolicy": { "profiles": [ { "name": "ProfileName", "pluginConfig": [ { "args": { "ignorePvcPods": true, "nodeFit": true, "priorityThreshold": { "value": 100 } }, "name": "DefaultEvictor" }, { "args": { "evictableNamespaces": { "exclude": ["kube-system"] }, "metrics": { "type": "prometheus_adaptor" }, "targetThresholds": { "cpu": 80, "memory": 85 }, "thresholds": { "cpu": 30, "memory": 30 } }, "name": "LoadAware" } ], "plugins": { "balance": { "enabled": ["LoadAware"] } } } ] }, "descheduler_enable": "true", "deschedulingInterval": "10m" }
Parameter |
Description |
---|---|
descheduler_enable |
Whether to enable a cluster descheduling policy.
|
deschedulingInterval |
Descheduling period. |
deschedulerPolicy |
Cluster descheduling policy. For details, see Table 2. |
Parameter |
Description |
---|---|
profiles.[].plugins.balance.enable.[] |
Descheduling policy for a cluster. LoadAware: a load-aware descheduling policy is used. |
profiles.[].pluginConfig.[].name |
Configuration of a load-aware descheduling policy. Options:
|
profiles.[].pluginConfig.[].args |
Descheduling policy configuration of a cluster.
|
When configuring a HighNodeUtilization policy, do as follows to enable the bin packing policy on Volcano Scheduler:
{ "colocation_enable": "", "default_scheduler_conf": { "actions": "allocate, backfill, preempt", "tiers": [ { "plugins": [ { "name": "priority" }, { "enablePreemptable": false, "name": "gang" }, { "name": "conformance" }, { "arguments": { "binpack.weight": 5 }, "name": "binpack" } ] }, { "plugins": [ { "enablePreemptable": false, "name": "drf" }, { "name": "predicates" }, { "name": "nodeorder" } ] }, { "plugins": [ { "name": "cce-gpu-topology-predicate" }, { "name": "cce-gpu-topology-priority" }, { "name": "cce-gpu" } ] }, { "plugins": [ { "name": "nodelocalvolume" }, { "name": "nodeemptydirvolume" }, { "name": "nodeCSIscheduling" }, { "name": "networkresource" } ] } ] }, "deschedulerPolicy": { "profiles": [ { "name": "ProfileName", "pluginConfig": [ { "args": { "ignorePvcPods": true, "nodeFit": true, "priorityThreshold": { "value": 100 } }, "name": "DefaultEvictor" }, { "args": { "evictableNamespaces": { "exclude": ["kube-system"] }, "thresholds": { "cpu": 25, "memory": 25 } }, "name": "HighNodeUtilization" } ], "plugins": { "balance": { "enabled": ["HighNodeUtilization"] } } } ] }, "descheduler_enable": "true", "deschedulingInterval": "10m" }
Parameter |
Description |
---|---|
descheduler_enable |
Whether to enable a cluster descheduling policy.
|
deschedulingInterval |
Descheduling period. |
deschedulerPolicy |
Cluster descheduling policy. For details, see Table 4. |
Parameter |
Description |
---|---|
profiles.[].plugins.balance.enable.[] |
Descheduling policy for a cluster. HighNodeUtilization: the policy for minimizing CPU and memory fragments is used. |
profiles.[].pluginConfig.[].name |
Configuration of a load-aware descheduling policy. Options:
|
profiles.[].pluginConfig.[].args |
Descheduling policy configuration of a cluster.
|
HighNodeUtilization
If an input parameter is incorrect, for example, the entered value is beyond the accepted value range or in an incorrect format, an event will be generated. In this case, modify the parameter setting as prompted.