Reviewed-by: Eotvos, Oliver <oliver.eotvos@t-systems.com> Co-authored-by: qiujiandong1 <qiujiandong1@huawei.com> Co-committed-by: qiujiandong1 <qiujiandong1@huawei.com>
30 KiB
CCE AI Suite (NVIDIA GPU)
Add-on Overview
CCE AI Suite (NVIDIA GPU) is a device management add-on that supports GPUs in containers. To use GPU nodes in a cluster, this add-on must be installed.
Add-on Parameters
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
basic |
Yes |
object |
Basic add-on configuration parameters |
custom |
Yes |
Table 3 object |
Custom parameters |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
cluster_version |
No |
String |
CCE cluster version |
device_version |
Yes |
String |
Add-on version |
driver_version |
Yes |
String |
Image tag of an add-on pod where a driver is installed. Generally, the value is the same as that of device_version. |
obs_url |
Yes |
String |
When a GPU driver is downloaded from the default driver address, the value is the GPU driver address. |
swr_addr |
Yes |
String |
Image repository address |
swr_user |
Yes |
String |
Tenant path of an image repository |
Parameter |
Mandatory |
Type |
Description |
---|---|---|---|
compatible_with_legacy_api |
No |
Bool |
API compatibility switch Default value: false true: The add-on supports the GPU native mode and xGPU virtualization. |
component_schedulername |
Yes |
String |
Name of the scheduler used by the add-on. Default value: default-scheduler |
disable_mount_path_v1 |
No |
Bool |
Default value: false true: /opt/cloud/cce/nvidia is not mounted to the /usr/lib/nvidia directory of a GPU container. |
disable_nvidia_gsp |
No |
Bool |
Default value: true true: The GPU GSP firmware is disabled. |
driver_mount_paths |
No |
String |
Driver file directory that needs to be automatically mounted to a GPU container Default value: "bin,lib64" |
enable_fault_isolation |
No |
Bool |
Default value: true true: The add-on detects hardware faults or driver issues of a GPU and then sets the GPU to be unavailable. |
enable_health_monitoring |
No |
Bool |
Default value: true true: The add-on detects hardware faults or driver issues of a GPU. |
enable_metrics_monitoring |
No |
Bool |
Default value: true true: The add-on collects GPU metrics and reports these metrics to Prometheus. |
enable_simple_lib64_mount |
No |
Bool |
Default value: true true: Only the libxxx.so.x file is mounted to a container. |
enable_xgpu |
No |
Bool |
Default value: false Whether to enable xGPU virtualization. |
gpu_driver_config |
No |
Map |
Configurations of the GPU driver for a single node pool Default value: {} |
health_check_xids_v2 |
No |
String |
GPU error range for the add-on health checks Default value: "74,79" |
inject_ld_Library_path |
No |
String |
Value of the LD_LIBRARY_PATH environment variable automatically injected by the add-on to a GPU container Default value: "" |
lib64_container_paths |
No |
String |
Mount path of NVIDIA lib64 in a GPU container Default value: "/usr/lib64,/usr/lib/x86_64-linux-gnu" |
metrics_delete_interval |
No |
int |
Timeout threshold for deleting a metric when the metric cannot be obtained. The unit is millisecond. Default value: 30000 |
metrics_monitor_interval |
No |
int |
Interval for obtaining metrics, in milliseconds. Default value: 15000 |
nvidia_driver_download_url |
Yes |
String |
Path for downloading the NVIDIA driver Default value: "" |
Example Request
{ "kind": "Addon", "apiVersion": "v3", "metadata": { "name": "gpu-beta", }, "spec": { "clusterID": "80c9e306-***-***-***-0255ac100043", "version": "2.0.69", "addonTemplateName": "gpu-beta", "values": { "basic": { "cluster_version": "v1.27", "device_version": "2.0.69", "driver_version": "2.0.69", "obs_url": "***", "region": "***", "swr_addr": "***", "swr_user": "***" }, "custom": { "compatible_with_legacy_api": true, "component_schedulername": "kube-scheduler", "disable_mount_path_v1": false, "disable_nvidia_gsp": true, "driver_mount_paths": "bin,lib64", "enable_fault_isolation": true, "enable_health_monitoring": true, "enable_metrics_monitoring": true, "enable_simple_lib64_mount": true, "enable_xgpu": true, "gpu_driver_config": {}, "health_check_xids_v2": "74,79", "inject_ld_Library_path": "", "lib64_container_paths": "/usr/lib64,/usr/lib/x86_64-linux-gnu", "metrics_delete_interval": 30000, "metrics_monitor_interval": 15000, "nvidia_driver_download_url": "" }, } } }