Jobs can be classified into online jobs and offline jobs based on whether services are always online.
Many services see surges in traffic. To ensure performance and stability, resources are often requested at the maximum needed. However, the surges may ebb very shortly and resources, if not released, are wasted in non-peak hours. Especially for online jobs that request a large quantity of resources to ensure SLA, resource utilization can be as low as it gets.
Resource oversubscription is the process of making use of idle requested resources. Oversubscribed resources are suitable for deploying offline jobs, which focus on throughput but have low SLA requirements and can tolerate certain failures.
Hybrid deployment of online and offline jobs in a cluster can better utilize cluster resources.
Hybrid deployment is supported, and CPU and memory resources can be oversubscribed. The key features are as follows:
If both oversubscribed and non-oversubscribed nodes exist, the former will score higher than the latter and offline jobs are preferentially scheduled to oversubscribed nodes.
Offline jobs can use both oversubscribed and non-oversubscribed resources of an oversubscribed node.
If both online and offline jobs exist, online jobs are scheduled first. When the node resource usage exceeds the upper limit and the node requests exceed 100%, offline jobs will be evicted.
CPU isolation: Online jobs can quickly preempt CPU resources of offline jobs and suppress the CPU usage of the offline jobs.
Memory isolation: When system memory resources are used up and OOM Kill is triggered, the kernel evicts offline jobs first.
After the the pod is scheduled to a node, kubelet starts the pod only when the node resources can meet the pod request (predicateAdmitHandler.Admit). kubelet starts the pod when both of the following conditions are met:
If only hybrid deployment is used, you need to configure the label volcano.sh/colocation=true for the node and delete the node label volcano.sh/oversubscription or set its value to false.
Hybrid Deployment Enabled (volcano.sh/colocation=true) |
Resource oversubscription Enabled (volcano.sh/oversubscription=true) |
Use Oversubscribed Resources? |
Conditions for Evicting Offline Pods |
---|---|---|---|
No |
No |
No |
None |
Yes |
No |
No |
The node resource usage exceeds the high threshold. |
No |
Yes |
Yes |
The node resource usage exceeds the high threshold, and the node request exceeds 100%. |
Yes |
Yes |
Yes |
The node resource usage exceeds the high threshold. |
If the label volcano.sh/oversubscription=true is configured for a node in the cluster, the oversubscription configuration must be added to the volcano add-on. Otherwise, the scheduling of oversubscribed nodes will be abnormal. For details about the related configuration, see Table 1.
# kubectl edit cm volcano-scheduler-configmap -n kube-system apiVersion: v1 data: volcano-scheduler.conf: | actions: "enqueue, allocate, backfill" tiers: - plugins: - name: gang - name: priority - name: conformance - name: oversubscription - plugins: - name: drf - name: predicates - name: nodeorder - name: binpack - plugins: - name: cce-gpu-topology-predicate - name: cce-gpu-topology-priority - name: cce-gpu
A label can be configured to use oversubscribed resources only after the oversubscription feature is enabled for a node. Related nodes can be created only in a node pool. To enable the oversubscription feature, perform the following steps:
The volcano.sh/oversubscription label needs to be configured for an oversubscribed node. If this label is set for a node and the value is true, the node is an oversubscribed node. Otherwise, the node is not an oversubscribed node.
kubectl label node 192.168.0.0 volcano.sh/oversubscription=true
An oversubscribed node also supports the oversubscription thresholds, as listed in Table 2. For example:
kubectl annotate node 192.168.0.0 volcano.sh/evicting-cpu-high-watermark=70
Querying the node information
# kubectl describe node 192.168.0.0 Name: 192.168.0.0 Roles: <none> Labels: ... volcano.sh/oversubscription=true Annotations: ... volcano.sh/evicting-cpu-high-watermark: 70
Name |
Description |
---|---|
volcano.sh/evicting-cpu-high-watermark |
When the CPU usage of a node exceeds the specified value, offline job eviction is triggered and the node becomes unschedulable. The default value is 80, indicating that offline job eviction is triggered when the CPU usage of a node exceeds 80%. |
volcano.sh/evicting-cpu-low-watermark |
After eviction is triggered, the scheduling starts again when the CPU usage of a node is lower than the specified value. The default value is 30, indicating that scheduling starts again when the CPU usage of a node is lower than 30%. |
volcano.sh/evicting-memory-high-watermark |
When the memory usage of a node exceeds the specified value, offline job eviction is triggered and the node becomes unschedulable. The default value is 60, indicating that offline job eviction is triggered when the memory usage of a node exceeds 60%. |
volcano.sh/evicting-memory-low-watermark |
After eviction is triggered, the scheduling starts again when the memory usage of a node is lower than the specified value. The default value is 30, indicating that the scheduling starts again when the memory usage of a node is less than 30%. |
volcano.sh/oversubscription-types |
Oversubscribed resource type. The options are as follows:
The default value is cpu,memory. |
The volcano.sh/qos-level label needs to be added to annotation to distinguish offline jobs. The value is an integer ranging from -7 to 7. If the value is less than 0, the job is an offline job. If the value is greater than or equal to 0, the job is a high-priority job, that is, online job. You do not need to set this label for online jobs. For both online and offline jobs, set schedulerName to volcano to enable the Volcano scheduler.
The priorities of online/online and offline/offline jobs are not differentiated, and the value validity is not verified. If the value of volcano.sh/qos-level of an offline job is not a negative integer ranging from -7 to 0, the job is processed as an online job.
For an offline job:
kind: Deployment apiVersion: apps/v1 spec: replicas: 4 template: metadata: annotations: metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"","path":"","port":"","names":""}]' volcano.sh/qos-level: "-1" # Offline job label spec: schedulerName: volcano # The Volcano scheduler is used. ...
For an online job:
kind: Deployment apiVersion: apps/v1 spec: replicas: 4 template: metadata: annotations: metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"","path":"","port":"","names":""}]' spec: schedulerName: volcano # The Volcano scheduler is used. ...
kubectl describe node <nodeIP>
# kubectl describe node 192.168.0.0 Name: 192.168.0.0 Roles: <none> Labels: ... volcano.sh/oversubscription=true Annotations: ... volcano.sh/oversubscription-cpu: 2335 volcano.sh/oversubscription-memory: 341753856 Allocatable: cpu: 3920m memory: 6263988Ki Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 4950m (126%) 4950m (126%) memory 1712Mi (27%) 1712Mi (27%)
The following uses an example to describe how to deploy online and offline jobs in hybrid mode.
# kubectl get node NAME STATUS ROLES AGE VERSION 192.168.0.173 Ready <none> 4h58m v1.19.16-r2-CCE22.5.1 192.168.0.3 Ready <none> 148m v1.19.16-r2-CCE22.5.1
# kubectl describe node 192.168.0.173 Name: 192.168.0.173 Roles: <none> Labels: beta.kubernetes.io/arch=amd64 ... volcano.sh/oversubscription=true
apiVersion: apps/v1 kind: Deployment metadata: name: offline namespace: default spec: replicas: 2 selector: matchLabels: app: offline template: metadata: labels: app: offline annotations: volcano.sh/qos-level: "-1" #Offline job label spec: schedulerName: volcano # The Volcano scheduler is used. containers: - name: container-1 image: nginx:latest imagePullPolicy: IfNotPresent resources: requests: cpu: 500m memory: 512Mi limits: cpu: "1" memory: 512Mi imagePullSecrets: - name: default-secret
# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE offline-69cdd49bf4-pmjp8 1/1 Running 0 5s 192.168.10.178 192.168.0.173 offline-69cdd49bf4-z8kxh 1/1 Running 0 5s 192.168.10.131 192.168.0.173
apiVersion: apps/v1 kind: Deployment metadata: name: online namespace: default spec: replicas: 2 selector: matchLabels: app: online template: metadata: labels: app: online spec: schedulerName: volcano # The Volcano scheduler is used. containers: - name: container-1 image: resource_consumer:latest imagePullPolicy: IfNotPresent resources: requests: cpu: 1400m memory: 512Mi limits: cpu: "2" memory: 512Mi imagePullSecrets: - name: default-secret
# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE online-ffb46f656-4mwr6 1/1 Running 0 5s 192.168.10.146 192.168.0.3 online-ffb46f656-dqdv2 1/1 Running 0 5s 192.168.10.67 192.168.0.3
apiVersion: apps/v1 kind: Deployment metadata: name: online namespace: default spec: replicas: 2 selector: matchLabels: app: online template: metadata: labels: app: online spec: affinity: # Submit an online job to an oversubscribed node. nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/hostname operator: In values: - 192.168.0.173 schedulerName: volcano # The Volcano scheduler is used. containers: - name: container-1 image: resource_consumer:latest imagePullPolicy: IfNotPresent resources: requests: cpu: 700m memory: 512Mi limits: cpu: 700m memory: 512Mi imagePullSecrets: - name: default-secret
# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE offline-69cdd49bf4-pmjp8 1/1 Running 0 13m 192.168.10.178 192.168.0.173 offline-69cdd49bf4-z8kxh 1/1 Running 0 13m 192.168.10.131 192.168.0.173 online-6f44bb68bd-b8z9p 1/1 Running 0 3m4s 192.168.10.18 192.168.0.173 online-6f44bb68bd-g6xk8 1/1 Running 0 3m12s 192.168.10.69 192.168.0.173
# kubectl describe node 192.168.0.173 Name: 192.168.0.173 Roles: <none> Labels: … volcano.sh/oversubscription=true Annotations: … volcano.sh/oversubscription-cpu: 2343 volcano.sh/oversubscription-memory: 3073653200 … Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 4750m (121%) 7350m (187%) memory 3760Mi (61%) 4660Mi (76%) …
# kubectl get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE offline-69cdd49bf4-bwdm7 1/1 Running 0 11m 192.168.10.208 192.168.0.3 offline-69cdd49bf4-pmjp8 0/1 Evicted 0 26m <none> 192.168.0.173 offline-69cdd49bf4-qpdss 1/1 Running 0 11m 192.168.10.174 192.168.0.3 offline-69cdd49bf4-z8kxh 0/1 Evicted 0 26m <none> 192.168.0.173 online-6f44bb68bd-b8z9p 1/1 Running 0 24m 192.168.10.18 192.168.0.173 online-6f44bb68bd-g6xk8 1/1 Running 0 24m 192.168.10.69 192.168.0.173
If the volcano add-on has been installed, click Edit to view or modify the parameter colocation_enable.
kubectl edit configmap -nkube-system volcano-agent-configuration
Example:
cpuBurstConfig: enable: true
apiVersion: apps/v1 kind: Deployment metadata: name: nginx namespace: default spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx annotations: volcano.sh/enable-quota-burst=true volcano.sh/quota-burst-time=200000 spec: containers: - name: container-1 image: nginx:latest resources: limits: cpu: "4" requests: cpu: "2" imagePullSecrets: - name: default-secret --- apiVersion: v1 kind: Service metadata: name: nginx namespace: default labels: app: nginx spec: selector: app: nginx ports: - name: cce-service-0 targetPort: 80 nodePort: 0 port: 80 protocol: TCP type: ClusterIP
Annotation |
Mandatory |
Description |
---|---|---|
volcano.sh/enable-quota-burst=true |
Yes |
CPU Burst is enabled for the workload. |
volcano.sh/quota-burst-time=200000 |
No |
To ensure CPU scheduling stability and reduce contention when multiple containers encounter CPU bursts at the same time, the default CPU Burst value is the same as the CPU Quota value. That is, a container can use a maximum of twice the CPU Limit value. By default, CPU Burst is set for all service containers in a pod. In this example, the CPU Limit of the container is 4, that is, the default value is 400,000 (1 core = 100,000), indicating that a maximum of four additional cores can be used after the CPU Limit value is reached. |
You can use the wrk tool to increase load of the workload and observe the service latency, traffic limiting, and CPU limit exceeding when CPU Burst is enabled and disabled, respectively.
# You need to download and install the wrk tool on the node.
# The Gzip compression module is enabled in the Apache configuration to simulate the computing logic for the server to process requests.
# Run the following command to increase the load. Note that you need to change the IP address of the target application.
wrk -H "Accept-Encoding: deflate, gzip" -t 4 -c 28 -d 120 --latency --timeout 2s http://$service_ip
kubectl get pods -n <namespace> <pod-name> -o jsonpath='{.metadata.uid}'
$cat /sys/fs/cgroup/cpuacct/kubepods/$PodID/cpu.stat
nr_periods 0 # Number of scheduling periods
nr_throttled 0 # Traffic limiting times
throttled_time 0 # Traffic limiting duration (ns)
nr_bursts 0 # CPU Limit exceeding times
burst_time 0 # Total Limit exceeding duration
CPU Burst |
P99 Latency |
nr_throttled Traffic Limiting Times |
throttled_time Traffic Limiting Duration |
nr_bursts Limit Exceeding Times |
bursts_time Total Limit Exceeding Duration |
---|---|---|---|---|---|
CPU Burst not enabled |
2.96 ms |
986 |
14.3s |
0 |
0 |
CPU Burst enabled |
456 µs |
0 |
0 |
469 |
3.7s |
You can reduce the oversubscribed resource types only when the resource allocation rate does not exceed 100%.