When a node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and work fine without any intervention. However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, additional latency will occur when CPU cores are from different NUMA nodes. To resolve this issue, kubelet allows you to use Topology Manager to replace the CPU management policies to determine node allocation.
Both the CPU Manager and Topology Manager are kubelet components, but they have the following limitations:
Volcano targets to lift the limitation to make scheduler NUMA topology aware so that:
For more information, see https://github.com/volcano-sh/volcano/blob/master/docs/design/numa-aware.md.
After a topology policy is configured for pods, Volcano predicts matched nodes based on the topology policy. The scheduling process is as follows:
Volcano Topology Policy |
Node Scheduling |
|
---|---|---|
1. Filter nodes with the same policy. |
2. Check whether node's CPU topology meets the policy requirements. |
|
none |
No filtering:
|
None |
best-effort |
Filter the nodes with the best-effort topology policy.
|
Best-effort scheduling: Pods are preferentially scheduled to a single NUMA node. If a single NUMA node cannot meet the requested CPU cores, the pods can be scheduled to multiple NUMA nodes. |
restricted |
Filter the nodes with the restricted topology policy.
|
Restricted scheduling:
|
single-numa-node |
Filter the nodes with the single-numa-node topology policy.
|
Pods can only be scheduled to a single NUMA node. |
For example, two NUMA nodes provide resources, each with a total of 32 CPU cores. The following table lists resource allocation.
Worker Node |
Node Topology Policy |
Total CPU Cores on NUMA Node 1 |
Total CPU Cores on NUMA Node 2 |
---|---|---|---|
Node 1 |
best-effort |
16 |
16 |
Node 2 |
restricted |
16 |
16 |
Node 3 |
restricted |
16 |
16 |
Node 4 |
single-numa-node |
16 |
16 |
Figure 1 shows the scheduling of a pod after a topology policy is configured.
A topology policy aims to schedule pods to the optimal node. In this example, each node is scored to sort out the optimal node.
Principle: Schedule pods to the worker nodes that require the fewest NUMA nodes.
The scoring formula is as follows:
score = weight x (100 - 100 x numaNodeNum/maxNumaNodeNum)
Parameters:
For example, three nodes meet the CPU topology policy for a pod and the weight of NUMA Aware Plugin is set to 10.
According to the preceding formula, maxNumaNodeNum is 4.
Therefore, the optimal node is Node A.
Valid topology policies include none, best-effort, restricted, and single-numa-node. For details, see Pod Scheduling Prediction.
Volcano 1.7.1 or later
{ "ca_cert": "", "default_scheduler_conf": { "actions": "allocate, backfill, preempt", "tiers": [ { "plugins": [ { "name": "priority" }, { "name": "gang" }, { "name": "conformance" } ] }, { "plugins": [ { "name": "drf" }, { "name": "predicates" }, { "name": "nodeorder" } ] }, { "plugins": [ { "name": "cce-gpu-topology-predicate" }, { "name": "cce-gpu-topology-priority" }, { "name": "cce-gpu" }, { // add this also enable resource_exporter "name": "numa-aware", // the weight of the NUMA Aware Plugin "arguments": { "weight": "10" } } ] }, { "plugins": [ { "name": "nodelocalvolume" }, { "name": "nodeemptydirvolume" }, { "name": "nodeCSIscheduling" }, { "name": "networkresource" } ] } ] }, "server_cert": "", "server_key": "" }
{ "plugins": { "eas_service": { "availability_zone_id": "", "driver_id": "", "enable": "false", "endpoint": "", "flavor_id": "", "network_type": "", "network_virtual_subnet_id": "", "pool_id": "", "project_id": "", "secret_name": "eas-service-secret" } }, "resource_exporter_enable": "true" }
kubectl get numatopo NAME AGE node-1 4h8m node-2 4h8m node-3 4h8m
kind: ConfigMap apiVersion: v1 metadata: name: volcano-scheduler-configmap namespace: kube-system data: default-scheduler.conf: |- actions: "allocate, backfill, preempt" tiers: - plugins: - name: priority - name: gang - name: conformance - plugins: - name: overcommit - name: drf - name: predicates - name: nodeorder - plugins: - name: cce-gpu-topology-predicate - name: cce-gpu-topology-priority - name: cce-gpu - plugins: - name: nodelocalvolume - name: nodeemptydirvolume - name: nodeCSIscheduling - name: networkresource arguments: NetworkType: vpc-router - name: numa-aware # add it to enable numa-aware plugin arguments: weight: 10 # the weight of the NUMA Aware Plugin
kind: Deployment apiVersion: apps/v1 metadata: name: numa-tset spec: replicas: 1 selector: matchLabels: app: numa-tset template: metadata: labels: app: numa-tset annotations: volcano.sh/numa-topology-policy: single-numa-node # Configure the topology policy. spec: containers: - name: container-1 image: nginx:alpine resources: requests: cpu: 2 # The value must be an integer and must be the same as that in limits. memory: 2048Mi limits: cpu: 2 # The value must be an integer and must be the same as that in requests. memory: 2048Mi imagePullSecrets: - name: default-secret
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: vj-test spec: schedulerName: volcano minAvailable: 1 tasks: - replicas: 1 name: "test" topologyPolicy: best-effort # set the topology policy for task template: spec: containers: - image: alpine command: ["/bin/sh", "-c", "sleep 1000"] imagePullPolicy: IfNotPresent name: running resources: limits: cpu: 20 memory: "100Mi" restartPolicy: OnFailure
The following table shows example NUMA nodes.
Worker Node |
Topology Manager Policy |
Allocatable CPU Cores on NUMA Node 0 |
Allocatable CPU Cores on NUMA Node 1 |
---|---|---|---|
Node 1 |
single-numa-node |
16 |
16 |
Node 2 |
best-effort |
16 |
16 |
Node 3 |
best-effort |
20 |
20 |
In the preceding examples,
Run the lscpu command to check the CPU usage of the current node.
# Check the CPU usage of the current node. lscpu ... CPU(s): 32 NUMA node(s): 2 NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31
Then, check the NUMA node usage.
# Check the CPU allocation of the current node. cat /var/lib/kubelet/cpu_manager_state {"policyName":"static","defaultCpuSet":"0,10-15,25-31","entries":{"777870b5-c64f-42f5-9296-688b9dc212ba":{"container-1":"16-24"},"fb15e10a-b6a5-4aaa-8fcd-76c1aa64e6fd":{"container-1":"1-9"}},"checksum":318470969}
The preceding example shows that two containers are running on the node. One container uses CPU cores 1 to 9 of NUMA node 0, and the other container uses CPU cores 16 to 24 of NUMA node 1.