Volcano Scheduler offers CPU and memory load-aware scheduling for pods and preferentially schedules pods to the node with the lightest load to balance node loads. This prevents an application or node failure due to heavy loads on a single node.
The native Kubernetes scheduler schedules resources only based on requested resources. However, the actual resource usage of a pod differs greatly from the requested or limited value of the requested resources, which is the cause of cluster load imbalancing.
Volcano resolves the preceding issues based on actual loads. If there are plenty of resources, pods are preferentially scheduled to nodes with the lightest load to balance the load on each node in the cluster.
The status, workload traffic, and requests of a cluster change dynamically, and the resource usage of nodes changes in real time. To prevent extreme load imbalance in a cluster after pod scheduling, Volcano provides load-aware hotspot descheduling for the optimal load balancing of cluster nodes. For details about hotspot descheduling, see Descheduling.
Load-aware scheduling is implemented using both Volcano and the CCE cloud native monitoring add-on (kube-prometheus-stack). After load-aware scheduling is enabled, metrics such as CPU and memory loads are defined by following Prometheus adapter rules. Then, the kube-prometheus-stack add-on collects and saves the actual CPU and memory loads of each node based on the defined metric rules. Volcano scores and sorts nodes based on the metric values provided by the kube-prometheus-stack add-on and preferentially schedules pods to the node with the lightest load.
Load-aware scheduling scores each node using the weighted average of the CPU and memory metrics as well as the load-aware scheduling policy and preferentially selects the node with the highest score for scheduling. You can customize the weights of the CPU, memory, and load-aware scheduling policy on the Scheduling tab by choosing Settings in the navigation pane of the target cluster.
The formula for scoring a node is as follows: Weight of the load-aware scheduling policy x [(1 - CPU usage) x CPU weight + (1 - Memory usage) x Memory weight]/(CPU weight + Memory weight)
After the kube-prometheus-stack add-on is installed, enable the function of automatically obtaining resource metrics through the metrics API. For details, see Providing Resource Metrics Through the Metrics API.
rules: - seriesQuery: '{__name__=~"node_cpu_seconds_total"}' resources: overrides: instance: resource: node name: matches: node_cpu_seconds_total as: node_cpu_usage_avg metricsQuery: avg_over_time((1 - avg (irate(<<.Series>>{mode="idle"}[5m])) by (instance))[10m:30s]) - seriesQuery: '{__name__=~"node_memory_MemTotal_bytes"}' resources: overrides: instance: resource: node name: matches: node_memory_MemTotal_bytes as: node_memory_usage_avg metricsQuery: avg_over_time(((1-node_memory_MemAvailable_bytes/<<.Series>>))[10m:30s]) resourceRules: cpu: containerQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>,container!="",pod!=""}[1m])) by (<<.GroupBy>>) nodeQuery: sum(rate(container_cpu_usage_seconds_total{<<.LabelMatchers>>, id='/'}[1m])) by (<<.GroupBy>>) resources: overrides: instance: resource: node namespace: resource: namespace pod: resource: pod containerLabel: container memory: containerQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>,container!="",pod!=""}) by (<<.GroupBy>>) nodeQuery: sum(container_memory_working_set_bytes{<<.LabelMatchers>>,id='/'}) by (<<.GroupBy>>) resources: overrides: instance: resource: node namespace: resource: namespace pod: resource: pod containerLabel: container window: 1m
metricsQuery indicates to obtain the average CPU usage of all nodes in the target cluster in the last 10 minutes. To change the period, for example, to the last 5 or 30 minutes, change 10m in red to 5m or 30m.
metricsQuery indicates to obtain the average memory usage of all nodes in the target cluster in the last 10 minutes. To change the period, for example, to the last 5 or 30 minutes, change 10m in red to 5m or 30m.
For optimal load-aware scheduling, disable bin packing because this policy preferentially schedules pods to the node with the maximal resources allocated based on pods' requested resources. This affects load-aware scheduling to some extent. For details about the combination of multiple policies, see Configuration Cases for Resource Usage-based Scheduling.
Parameter |
Description |
Default Value |
---|---|---|
Load-Aware Scheduling Policy Weight |
A larger value indicates a higher weight of the load-aware policy in overall scheduling. |
5 |
CPU Weight |
A larger value indicates CPU resources will be preferentially balanced. |
1 |
Memory Weight |
A larger value indicates memory resources will be preferentially balanced. |
1 |
Actual load threshold effective mode |
|
Hard constraint |
Actual CPU Load Threshold |
When a node's CPU usage goes beyond this threshold, pods will be preferentially or forcibly scheduled to other nodes based on how the load threshold takes effect. |
80 |
Actual Memory Load Threshold |
When a node's memory usage goes beyond this threshold, pods will be preferentially or forcibly scheduled to other nodes based on how the load threshold takes effect. |
80 |