Volcano is a batch processing platform based on Kubernetes. It provides a series of features required by machine learning, deep learning, bioinformatics, genomics, and other big data applications, as a powerful supplement to Kubernetes capabilities.
Volcano provides general computing capabilities such as high-performance job scheduling, heterogeneous chip management, and job running management. It accesses the computing frameworks for various industries such as AI, big data, gene, and rendering and schedules up to 1000 pods per second for end users, greatly improving scheduling efficiency and resource utilization.
Volcano provides job scheduling, job management, and queue management for computing applications. Its main features are as follows:
Volcano has been open-sourced in GitHub at https://github.com/volcano-sh/volcano.
Install and configure the Volcano add-on in CCE clusters. For details, see Volcano Scheduling.
When using Volcano as a scheduler, use it to schedule all workloads in the cluster. This prevents resource scheduling conflicts caused by simultaneous working of multiple schedulers.
Parameter |
Description |
---|---|
Add-on Specifications |
Select Standalone, Custom, or HA for Add-on Specifications. |
Pods |
Number of pods that will be created to match the selected add-on specifications. If you select Custom, you can adjust the number of pods as required. |
Containers |
CPU and memory quotas of the container allowed for the selected add-on specifications. If you select Custom, the recommended values for volcano-controller and volcano-scheduler are as follows:
|
Nodes/Pods in a Cluster |
Requested vCPUs (m) |
vCPU Limit (m) |
Requested Memory (MiB) |
Memory Limit (MiB) |
---|---|---|---|---|
50/5000 |
500 |
2000 |
500 |
2000 |
100/10,000 |
1000 |
2500 |
1500 |
2500 |
200/20,000 |
1500 |
3000 |
2500 |
3500 |
300/30,000 |
2000 |
3500 |
3500 |
4500 |
400/40,000 |
2500 |
4000 |
4500 |
5500 |
500/50,000 |
3000 |
4500 |
5500 |
6500 |
600/60,000 |
3500 |
5000 |
6500 |
7500 |
700/70,000 |
4000 |
5500 |
7500 |
8500 |
colocation_enable: '' default_scheduler_conf: actions: 'allocate, backfill, preempt' tiers: - plugins: - name: 'priority' - name: 'gang' - name: 'conformance' - name: 'lifecycle' arguments: lifecycle.MaxGrade: 10 lifecycle.MaxScore: 200.0 lifecycle.SaturatedTresh: 1.0 lifecycle.WindowSize: 10 - plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder' - plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu' - plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 60 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 60
Plugin |
Function |
Description |
Demonstration |
---|---|---|---|
colocation_enable |
Whether to enable hybrid deployment. |
Value:
|
None |
default_scheduler_conf |
Used to schedule pods. It consists of a series of actions and plugins and features high scalability. You can specify and implement actions and plugins based on your requirements. |
It consists of actions and tiers.
|
None |
actions |
Actions to be executed in each scheduling phase. The configured action sequence is the scheduler execution sequence. For details, see Actions. The scheduler traverses all jobs to be scheduled and performs actions such as enqueue, allocate, preempt, and backfill in the configured sequence to find the most appropriate node for each job. |
The following options are supported:
|
actions: 'allocate, backfill, preempt' NOTE:
When configuring actions, use either preempt or enqueue. |
plugins |
Implementation details of algorithms in actions based on different scenarios. For details, see Plugins. |
For details, see Table 4. |
None |
tolerations |
Tolerance of the add-on to node taints. |
By default, the add-on can run on nodes with the node.kubernetes.io/not-ready or node.kubernetes.io/unreachable taint and the taint effect value is NoExecute, but it'll be evicted in 60 seconds. |
tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 60 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 60 |
Plugin |
Function |
Description |
Demonstration |
---|---|---|---|
binpack |
Schedule pods to nodes with high resource usage (not allocating pods to light-loaded nodes) to reduce resource fragments. |
arguments:
|
- plugins: - name: binpack arguments: binpack.weight: 10 binpack.cpu: 1 binpack.memory: 1 binpack.resources: nvidia.com/gpu, example.com/foo binpack.resources.nvidia.com/gpu: 2 binpack.resources.example.com/foo: 3 |
conformance |
Prevent key pods, such as the pods in the kube-system namespace from being preempted. |
None |
- plugins: - name: 'priority' - name: 'gang' enablePreemptable: false - name: 'conformance' |
lifecycle |
By collecting statistics on service scaling rules, pods with similar lifecycles are preferentially scheduled to the same node. With the horizontal scaling capability of the Autoscaler, resources can be quickly scaled in and released, reducing costs and improving resource utilization. 1. Collects statistics on the lifecycle of pods in the service load and schedules pods with similar lifecycles to the same node. 2. For a cluster configured with an automatic scaling policy, adjust the scale-in annotation of the node to preferentially scale in the node with low usage. |
arguments:
|
- plugins: - name: priority - name: gang enablePreemptable: false - name: conformance - name: lifecycle arguments: lifecycle.MaxGrade: 10 lifecycle.MaxScore: 200.0 lifecycle.SaturatedTresh: 1.0 lifecycle.WindowSize: 10 NOTE:
|
Gang |
Consider a group of pods as a whole for resource allocation. This plugin checks whether the number of scheduled pods in a job meets the minimum requirements for running the job. If yes, all pods in the job will be scheduled. If no, the pods will not be scheduled. NOTE:
If a gang scheduling policy is used, if the remaining resources in the cluster are greater than or equal to half of the minimum number of resources for running a job but less than the minimum of resources for running the job, Autoscaler scale-outs will not be triggered. |
|
- plugins: - name: priority - name: gang enablePreemptable: false enableJobStarving: false - name: conformance |
priority |
Schedule based on custom load priorities. |
None |
- plugins: - name: priority - name: gang enablePreemptable: false - name: conformance |
overcommit |
Resources in a cluster are scheduled after being accumulated in a certain multiple to improve the workload enqueuing efficiency. If all workloads are Deployments, remove this plugin or set the raising factor to 2.0. NOTE:
This plugin is supported in Volcano 1.6.5 and later versions. |
arguments:
|
- plugins: - name: overcommit arguments: overcommit-factor: 2.0 |
drf |
The Dominant Resource Fairness (DRF) scheduling algorithm, which schedules jobs based on their dominant resource share. Jobs with a smaller resource share will be scheduled with a higher priority. |
- |
- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder' |
predicates |
Determine whether a task is bound to a node by using a series of evaluation algorithms, such as node/pod affinity, taint tolerance, node repetition, volume limits, and volume zone matching. |
None |
- plugins: - name: 'drf' - name: 'predicates' - name: 'nodeorder' |
nodeorder |
A common algorithm for selecting nodes. Nodes are scored in simulated resource allocation to find the most suitable node for the current job. |
Scoring parameters:
|
- plugins: - name: nodeorder arguments: leastrequested.weight: 1 mostrequested.weight: 0 nodeaffinity.weight: 2 podaffinity.weight: 2 balancedresource.weight: 1 tainttoleration.weight: 3 imagelocality.weight: 1 podtopologyspread.weight: 2 |
cce-gpu-topology-predicate |
GPU-topology scheduling preselection algorithm |
None |
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu' |
cce-gpu-topology-priority |
GPU-topology scheduling priority algorithm |
None |
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu' |
cce-gpu |
GPU resource allocation that supports decimal GPU configurations by working with the gpu add-on. |
None |
- plugins: - name: 'cce-gpu-topology-predicate' - name: 'cce-gpu-topology-priority' - name: 'cce-gpu' |
numa-aware |
NUMA affinity scheduling. |
arguments:
|
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' arguments: NetworkType: vpc-router - name: numa-aware arguments: weight: 10 |
networkresource |
The ENI requirement node can be preselected and filtered. The parameters are transferred by CCE and do not need to be manually configured. |
arguments:
|
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' arguments: NetworkType: vpc-router |
nodelocalvolume |
Filter out nodes that do not meet local volume requirements. |
None |
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' |
nodeemptydirvolume |
Filter out nodes that do not meet the emptyDir requirements. |
None |
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' |
nodeCSIscheduling |
Filter out nodes with malfunctional Everest. |
None |
- plugins: - name: 'nodelocalvolume' - name: 'nodeemptydirvolume' - name: 'nodeCSIscheduling' - name: 'networkresource' |
Parameter |
Description |
---|---|
Multi AZ |
|
Node Affinity |
|
Toleration |
Using both taints and tolerations allows (not forcibly) the add-on Deployment to be scheduled to a node with the matching taints, and controls the Deployment eviction policies after the node where the Deployment is located is tainted. The add-on adds the default tolerance policy for the node.kubernetes.io/not-ready and node.kubernetes.io/unreachable taints, respectively. The tolerance time window is 60s. For details, see Taints and Tolerations. |
Component |
Description |
Resource Type |
---|---|---|
volcano-scheduler |
Schedule pods. |
Deployment |
volcano-controller |
Synchronize CRDs. |
Deployment |
volcano-admission |
Webhook server, which verifies and modifies resources such as pods and jobs |
Deployment |
volcano-agent |
Cloud native hybrid agent, which is used for node QoS assurance, CPU burst, and dynamic resource oversubscription |
DaemonSet |
resource-exporter |
Report the NUMA topology information of nodes. |
DaemonSet |
volcano-scheduler is the component responsible for pod scheduling. It consists of a series of actions and plugins. Actions should be executed in every step. Plugins provide the action algorithm details in different scenarios. volcano-scheduler is highly scalable. You can specify and implement actions and plugins based on your requirements.
Volcano allows you to configure the scheduler during installation, upgrade, and editing. The configuration will be synchronized to volcano-scheduler-configmap.
This section describes how to configure volcano-scheduler.
Only Volcano of v1.7.1 and later support this function. On the new add-on page, options such as resource_exporter_enable are replaced by default_scheduler_conf.
Log in to the CCE console and click the cluster name to access the cluster console. Choose Add-ons in the navigation pane. On the right of the page, locate Volcano Scheduler and click Install or Upgrade. In the Parameters area, configure the Volcano parameters.
{ "ca_cert": "", "default_scheduler_conf": { "actions": "allocate, backfill, preempt", "tiers": [ { "plugins": [ { "name": "priority" }, { "name": "gang" }, { "name": "conformance" } ] }, { "plugins": [ { "name": "drf" }, { "name": "predicates" }, { "name": "nodeorder" } ] }, { "plugins": [ { "name": "cce-gpu-topology-predicate" }, { "name": "cce-gpu-topology-priority" }, { "name": "cce-gpu" }, { "name": "numa-aware" # add this also enable resource_exporter } ] }, { "plugins": [ { "name": "nodelocalvolume" }, { "name": "nodeemptydirvolume" }, { "name": "nodeCSIscheduling" }, { "name": "networkresource" } ] } ] }, "server_cert": "", "server_key": "" }
After this function is enabled, you can use the functions of both numa-aware and resource_exporter.
If you want to use the original configuration after the plugin is upgraded, perform the following steps:
# kubectl edit cm volcano-scheduler-configmap -n kube-system apiVersion: v1 data: default-scheduler.conf: |- actions: "enqueue, allocate, backfill" tiers: - plugins: - name: priority - name: gang - name: conformance - plugins: - name: drf - name: predicates - name: nodeorder - name: binpack arguments: binpack.cpu: 100 binpack.weight: 10 binpack.resources: nvidia.com/gpu binpack.resources.nvidia.com/gpu: 10000 - plugins: - name: cce-gpu-topology-predicate - name: cce-gpu-topology-priority - name: cce-gpu - plugins: - name: nodelocalvolume - name: nodeemptydirvolume - name: nodeCSIscheduling - name: networkresource
{ "ca_cert": "", "default_scheduler_conf": { "actions": "enqueue, allocate, backfill", "tiers": [ { "plugins": [ { "name": "priority" }, { "name": "gang" }, { "name": "conformance" } ] }, { "plugins": [ { "name": "drf" }, { "name": "predicates" }, { "name": "nodeorder" }, { "name": "binpack", "arguments": { "binpack.cpu": 100, "binpack.weight": 10, "binpack.resources": "nvidia.com/gpu", "binpack.resources.nvidia.com/gpu": 10000 } } ] }, { "plugins": [ { "name": "cce-gpu-topology-predicate" }, { "name": "cce-gpu-topology-priority" }, { "name": "cce-gpu" } ] }, { "plugins": [ { "name": "nodelocalvolume" }, { "name": "nodeemptydirvolume" }, { "name": "nodeCSIscheduling" }, { "name": "networkresource" } ] } ] }, "server_cert": "", "server_key": "" }
When this function is used, the original content in volcano-scheduler-configmap will be overwritten. Therefore, you must check whether volcano-scheduler-configmap has been modified during the upgrade. If yes, synchronize the modification to the upgrade page.
volcano-scheduler exposes Prometheus metrics through port 8080. You can build a Prometheus collector to identify and obtain volcano-scheduler scheduling metrics from http://{{volcano-schedulerPodIP}}:{{volcano-schedulerPodPort}}/metrics.
Prometheus metrics can be exposed only by the Volcano add-on of version 1.8.5 or later.
Metric |
Type |
Description |
Label |
---|---|---|---|
e2e_scheduling_latency_milliseconds |
Histogram |
E2E scheduling latency (ms) (scheduling algorithm + binding) |
None |
e2e_job_scheduling_latency_milliseconds |
Histogram |
E2E job scheduling latency (ms) |
None |
e2e_job_scheduling_duration |
Gauge |
E2E job scheduling duration |
labels=["job_name", "queue", "job_namespace"] |
plugin_scheduling_latency_microseconds |
Histogram |
Add-on scheduling latency (µs) |
labels=["plugin", "OnSession"] |
action_scheduling_latency_microseconds |
Histogram |
Action scheduling latency (µs) |
labels=["action"] |
task_scheduling_latency_milliseconds |
Histogram |
Task scheduling latency (ms) |
None |
schedule_attempts_total |
Counter |
Number of pod scheduling attempts. unschedulable indicates that the pods cannot be scheduled, and error indicates that the internal scheduler is faulty. |
labels=["result"] |
pod_preemption_victims |
Gauge |
Number of selected preemption victims |
None |
total_preemption_attempts |
Counter |
Total number of preemption attempts in a cluster |
None |
unschedule_task_count |
Gauge |
Number of unschedulable tasks |
labels=["job_id"] |
unschedule_job_count |
Gauge |
Number of unschedulable jobs |
None |
job_retry_counts |
Counter |
Number of job retries |
labels=["job_id"] |
After the add-on is uninstalled, all custom Volcano resources (Table 8) will be deleted, including the created resources. Reinstalling the add-on will not inherit or restore the tasks before the uninstallation. It is a good practice to uninstall the Volcano add-on only when no custom Volcano resources are being used in the cluster.
Item |
API Group |
API Version |
Resource Level |
---|---|---|---|
Command |
bus.volcano.sh |
v1alpha1 |
Namespaced |
Job |
batch.volcano.sh |
v1alpha1 |
Namespaced |
Numatopology |
nodeinfo.volcano.sh |
v1alpha1 |
Cluster |
PodGroup |
scheduling.volcano.sh |
v1beta1 |
Namespaced |
Queue |
scheduling.volcano.sh |
v1beta1 |
Cluster |