diff --git a/umn/source/_static/images/en-us_image_0000001274543860.png b/umn/source/_static/images/en-us_image_0000001274543860.png deleted file mode 100644 index 43183cc..0000000 Binary files a/umn/source/_static/images/en-us_image_0000001274543860.png and /dev/null differ diff --git a/umn/source/_static/images/en-us_image_0000001274544060.png b/umn/source/_static/images/en-us_image_0000001274544060.png deleted file mode 100644 index 3226bda..0000000 Binary files a/umn/source/_static/images/en-us_image_0000001274544060.png and /dev/null differ diff --git a/umn/source/_static/images/en-us_image_0000001274864616.png b/umn/source/_static/images/en-us_image_0000001274864616.png deleted file mode 100644 index 1af07fd..0000000 Binary files a/umn/source/_static/images/en-us_image_0000001274864616.png and /dev/null differ diff --git a/umn/source/_static/images/en-us_image_0000001482541956.png b/umn/source/_static/images/en-us_image_0000001482541956.png new file mode 100644 index 0000000..ca9934b Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001482541956.png differ diff --git a/umn/source/_static/images/en-us_image_0000001482701968.png b/umn/source/_static/images/en-us_image_0000001482701968.png new file mode 100644 index 0000000..9b7e01f Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001482701968.png differ diff --git a/umn/source/_static/images/en-us_image_0000001533181077.png b/umn/source/_static/images/en-us_image_0000001533181077.png new file mode 100644 index 0000000..47324ea Binary files /dev/null and b/umn/source/_static/images/en-us_image_0000001533181077.png differ diff --git a/umn/source/add-ons/index.rst b/umn/source/add-ons/index.rst index d82ef54..e622c8b 100644 --- a/umn/source/add-ons/index.rst +++ b/umn/source/add-ons/index.rst @@ -9,6 +9,7 @@ Add-ons - :ref:`coredns (System Resource Add-On, Mandatory) ` - :ref:`storage-driver (System Resource Add-On, Discarded) ` - :ref:`everest (System Resource Add-On, Mandatory) ` +- :ref:`npd ` - :ref:`autoscaler ` - :ref:`metrics-server ` - :ref:`gpu-beta ` @@ -22,6 +23,7 @@ Add-ons coredns_system_resource_add-on_mandatory storage-driver_system_resource_add-on_discarded everest_system_resource_add-on_mandatory + npd autoscaler metrics-server gpu-beta diff --git a/umn/source/add-ons/npd.rst b/umn/source/add-ons/npd.rst new file mode 100644 index 0000000..10d233e --- /dev/null +++ b/umn/source/add-ons/npd.rst @@ -0,0 +1,384 @@ +:original_name: cce_10_0132.html + +.. _cce_10_0132: + +npd +=== + +Introduction +------------ + +node-problem-detector (npd for short) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. The npd add-on can run as a DaemonSet or a daemon. + +For more information, see `node-problem-detector `__. + +Notes and Constraints +--------------------- + +- When using this add-on, do not format or partition node disks. +- Each npd process occupies 30 mCPU and 100 MB memory. + +Permission Description +---------------------- + +To monitor kernel logs, the npd add-on needs to read the host **/dev/kmsg**. Therefore, the privileged mode must be enabled. For details, see `privileged `__. + +In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for npd running: + +- cap_dac_read_search: permission to access **/run/log/journal**. +- cap_sys_admin: permission to access **/dev/kmsg**. + +Installing the Add-on +--------------------- + +#. Log in to the CCE console, click the cluster name, and access the cluster console. Choose **Add-ons** in the navigation pane, locate **npd** on the right, and click **Install**. + +#. On the **Install Add-on** page, select the add-on specifications and set related parameters. + + - **Pods**: Set the number of pods based on service requirements. + - **Containers**: Select a proper container quota based on service requirements. + +#. Set the parameters according to the following table and click **Install**. + + Only 1.16.0 and later versions support the configurations. + + **npc.enable**: indicates whether to enable :ref:`Node-problem-controller `. + +npd Check Items +--------------- + +.. note:: + + Check items are supported only in 1.16.0 and later versions. + +Check items cover events and statuses. + +- Event-related + + For event-related check items, when a problem occurs, npd reports an event to the API server. The event type can be **Normal** (normal event) or **Warning** (abnormal event). + + .. table:: **Table 1** Event-related check items + + +-----------------------+-------------------------------------------------------+-----------------------+ + | Check Item | Function | Description | + +=======================+=======================================================+=======================+ + | OOMKilling | Check whether OOM events occur and are reported. | Warning event | + +-----------------------+-------------------------------------------------------+-----------------------+ + | TaskHung | Check whether taskHung events occur and are reported. | Warning event | + +-----------------------+-------------------------------------------------------+-----------------------+ + | KernelOops | Check kernel nil pointer panic errors. | Warning event | + +-----------------------+-------------------------------------------------------+-----------------------+ + | ConntrackFull | Check whether the conntrack table is full. | Warning event | + | | | | + | | | Interval: 30 seconds | + | | | | + | | | Threshold: 80% | + +-----------------------+-------------------------------------------------------+-----------------------+ + +- Status-related + + For status-related check items, when a problem occurs, npd reports an event to the API server and changes the node status synchronously. This function can be used together with :ref:`Node-problem-controller fault isolation ` to isolate nodes. + + **If the check period is not specified in the following check items, the default period is 30 seconds.** + + .. table:: **Table 2** Application and OS check items + + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Check Item | Function | Description | + +===========================+===============================================================================================================================================================+============================================================================================================================================================+ + | FrequentKubeletRestart | Check whether kubelet restarts frequently by listening to journald logs. | - Interval: 5 minutes | + | | | | + | | | - Backtracking: 10 minutes | + | | | | + | | | - Threshold: 10 times | + | | | | + | | | If the system restarts for 10 times within the backtracking period, it indicates that the system restarts frequently and a fault alarm is generated. | + | | | | + | | | - Listening object: logs in the **/run/log/journal** directory | + | | | | + | | | .. note:: | + | | | | + | | | The Ubuntu OS does not support the preceding check items due to incompatible log formats. | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FrequentDockerRestart | Check whether Docker restarts frequently by listening to journald logs. | | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FrequentContainerdRestart | Check whether containerd restarts frequently by listening to journald logs. | | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | CRIProblem | Check the CRI component status. | Check object: Docker or containerd | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | KUBELETProblem | Check the kubelet status. | None | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | NTPProblem | Check the NTP and Chrony service status. | Threshold of the clock offset: 8000 ms | + | | | | + | | Check whether the node clock offsets. | | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | PIDProblem | Check whether PIDs are sufficient. | - Threshold: 90% | + | | | - Usage: nr_threads in /proc/loadavg | + | | | - Maximum value: smaller value between **/proc/sys/kernel/pid_max** and **/proc/sys/kernel/threads-max**. | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | FDProblem | Check whether file handles are sufficient. | - Threshold: 90% | + | | | - Usage: the first value in **/proc/sys/fs/file-nr** | + | | | - Maximum value: the third value in **/proc/sys/fs/file-nr** | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MemoryProblem | Check whether the overall node memory is sufficient. | - Threshold: 90% | + | | | - Usage: **MemTotal-MemAvailable** in **/proc/meminfo** | + | | | - Maximum value: **MemTotal** in **/proc/meminfo** | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ResolvConfFileProblem | Check whether the ResolvConf file is lost. | Object: **/etc/resolv.conf** | + | | | | + | | Check whether the ResolvConf file is normal. | | + | | | | + | | Exception definition: No upstream domain name resolution server (nameserver) is included. | | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ProcessD | Check whether there is a process D on the node. | Source: | + | | | | + | | | - /proc/{PID}/stat | + | | | - Alternately, you can run **ps aux**. | + | | | | + | | | Exception scenario: ProcessD ignores the resident processes (heartbeat and update) that are in the D state that the SDI driver on the BMS node depends on. | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ProcessZ | Check whether the node has processes in Z state. | | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | ScheduledEvent | Check whether host plan events exist on the node. | Source: | + | | | | + | | Typical scenario: The host is faulty, for example, the fan is damaged or the disk has bad sectors. As a result, cold and live migration is triggered for VMs. | - http://169.254.169.254/meta-data/latest/events/scheduled | + | | | | + | | | This check item is an Alpha feature and is disabled by default. | + +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + .. table:: **Table 3** Network connection check items + + +------------------+------------------------------------------------------+-------------+ + | Check Item | Function | Description | + +==================+======================================================+=============+ + | CNIProblem | Check whether the CNI component is running properly. | None | + +------------------+------------------------------------------------------+-------------+ + | KUBEPROXYProblem | Check whether kube-proxy is running properly. | None | + +------------------+------------------------------------------------------+-------------+ + + .. table:: **Table 4** Storage check items + + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Check Item | Function | Description | + +================================+====================================================================================================================================================================================================================================================================================================================================================================================================+====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ + | ReadonlyFilesystem | Check whether the **Remount root filesystem read-only** error occurs in the system kernel by listening to the kernel logs. | Listening object: **/dev/kmsg** | + | | | | + | | Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is reattached as a read-only disk. | Matching rule: **Remounting filesystem read-only** | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DiskReadonly | Check whether the system disk, Docker disk, and kubelet disk are read-only. | Detection paths: | + | | | | + | | | - /mnt/paas/kubernetes/kubelet/ | + | | | - /var/lib/docker/ | + | | | - /var/lib/containerd/ | + | | | - /var/paas/sys/log/cceaddon-npd/ | + | | | | + | | | The temporary file **npd-disk-write-ping** is generated in the detection path. | + | | | | + | | | Currently, additional data disks are not supported. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DiskProblem | Check the usage of the system disk, Docker disk, and kubelet disk. | - Threshold: 80% | + | | | | + | | | - Source: | + | | | | + | | | .. code-block:: | + | | | | + | | | df -h | + | | | | + | | | Currently, additional data disks are not supported. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | EmptyDirVolumeGroupStatusError | Check whether the ephemeral volume group on the node is normal. | - Detection period: 60s | + | | | | + | | Impact: The pod that depends on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error. | - Source: | + | | | | + | | Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. The user deletes some data disks by mistake. As a result, the storage pool becomes abnormal. | .. code-block:: | + | | | | + | | | vgs -o vg_name, vg_attr | + | | | | + | | | - Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost. | + | | | | + | | | - Joint scheduling: The scheduler can automatically identify an abnormal node and prevent pods that depend on the storage pool from being scheduled to the node. | + | | | | + | | | - Exception scenario: The npd add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in **nodestatus.allocatable** to **0**. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected. In this case, the ReadonlyFilesystem detection is abnormal. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | LocalPvVolumeGroupStatusError | Check the PV group on the node. | | + | | | | + | | Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error. | | + | | | | + | | Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake. | | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MountPointProblem | Check the mount point on the node. | Alternatively, you can run the following command: | + | | | | + | | Exception definition: You cannot access the mount point by running the **cd** command. | .. code-block:: | + | | | | + | | Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails. | for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok" | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DiskHung | Check whether I/O faults occur on the disk of the node. | - Check object: all data disks | + | | | | + | | Definition of I/O faults: The system does not respond to disk I/O requests, and some processes are in the D state. | - Source: | + | | | | + | | Typical Scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network. | /proc/diskstat | + | | | | + | | | Alternatively, you can run the following command: | + | | | | + | | | .. code-block:: | + | | | | + | | | iostat -xmt 1 | + | | | | + | | | - Threshold: | + | | | | + | | | - Average usage. The value of ioutil is greater than or equal to 0.99. | + | | | - Average I/O queue length. avgqu-sz >=1 | + | | | - Average I/O transfer volume, iops (w/s) + ioth (wMB/s) < = 1 | + | | | | + | | | .. note:: | + | | | | + | | | In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait is greater than 0.8. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DiskSlow | Check whether slow I/O occurs on the disk of the node. | - Check object: all data disks | + | | | | + | | Definition of slow I/O: The average response time exceeds the threshold. | - Source: | + | | | | + | | Typical scenario: EVS disks have slow I/Os due to network fluctuation. | /proc/diskstat | + | | | | + | | | Alternatively, you can run the following command: | + | | | | + | | | .. code-block:: | + | | | | + | | | iostat -xmt 1 | + | | | | + | | | - Threshold: | + | | | | + | | | Average I/O latency: await > = 5000 ms | + | | | | + | | | .. note:: | + | | | | + | | | If I/O requests are not responded and the **await** data is not updated. In this case, this check item is invalid. | + +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + + The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using npd. + + .. table:: **Table 5** Default kubelet check items + + +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Check Item | Function | Description | + +=======================+========================================================================+==========================================================================================================================================================================================================================================================================================================================+ + | PIDPressure | Check whether PIDs are sufficient. | - Interval: 10 seconds | + | | | - Threshold: 90% | + | | | - Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see `issue 107107 `__. In community version 1.24 and earlier versions, thread-max is not considered in this check item. | + +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | MemoryPressure | Check whether the allocable memory for the containers is sufficient. | - Interval: 10 seconds | + | | | - Threshold: Max. 100 MiB | + | | | - Allocable = Total memory of a node - Reserved memory of a node | + | | | - Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node. | + +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | DiskPressure | Check the disk usage and inodes usage of the kubelet and Docker disks. | Interval: 10 seconds | + | | | | + | | | Threshold: 90% | + +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +.. _cce_10_0132__section1471610580474: + +Node-problem-controller Fault Isolation +--------------------------------------- + +.. note:: + + Fault isolation is supported only by add-ons of 1.16.0 and later versions. + + When installing the npd add-on, set **npc.enable** to **true** to deploy dual Node-problem-controller (NPC). You can deploy NPC as single-instance but such NPC does not ensure high availability. + + By default, if multiple nodes become faulty, NPC adds taints to only one node. You can set **npc.maxTaintedNode** to increase the threshold. When the fault is rectified, NPC is not running and taints remain. You need to manually clear the taints or start NPC. + +The open source NPD plug-in provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open source NPD. This component is implemented based on the Kubernetes `node controller `__. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation. + +You can modify **add-onnpc.customConditionToTaint** according to the following table to configure fault isolation rules. + +.. table:: **Table 6** Parameters + + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | Default | + +================================+=========================================================+=========================================================================================================================================+ + | npc.enable | Whether to enable NPC | true | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | npc.customCondtionToTaint | Fault isolation rules | See :ref:`Table 7 `. | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | npc.customConditionToTaint[i] | Fault isolation rule items | N/A | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | npc.customConditionToTaint[i]. | Fault status | true | + | | | | + | condition.status | | | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | npc.customConditionToTaint[i]. | Fault type | N/A | + | | | | + | condition.type | | | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | npc.customConditionToTaint[i]. | Whether to enable the fault isolation rule. | false | + | | | | + | enable | | | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | npc.customConditionToTaint[i]. | Fault isolation effect | NoSchedule | + | | | | + | .taint.effect | NoSchedule, PreferNoSchedule, or NoExecute | Value options: **NoSchedule**, **PreferNoSchedule**, and **NoExecute** | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | npc. maxTaintedNode | Number of nodes in a cluster that can be tainted by NPC | 1 | + | | | | + | | The int format and percentage format are supported. | Values: | + | | | | + | | | - The value is in int format and ranges from 1 to infinity. | + | | | - The value ranges from 1% to 100%, in percentage. The minimum value of this parameter multiplied by the number of cluster nodes is 1. | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + | Npc.affinity | Node affinity of the controller | N/A | + +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ + +.. _cce_10_0132__table147438134911: + +.. table:: **Table 7** Fault isolation rule configuration + + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | Fault | Fault Details | Taint | + +===========================+=====================================================================+======================================+ + | DiskReadonly | Disk read-only | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | DiskProblem | The disk space is insufficient, and key logical disks are detached. | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | FrequentKubeletRestart | kubelet restarts frequently. | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | FrequentDockerRestart | Docker restarts frequently. | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | FrequentContainerdRestart | containerd restarts frequently. | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | KUBEPROXYProblem | kube-proxy is abnormal. | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | PIDProblem | Insufficient PIDs | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | FDProblem | Insufficient file handles | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + | MemoryProblem | Insufficient node memory | **NoSchedule**: No new pods allowed. | + +---------------------------+---------------------------------------------------------------------+--------------------------------------+ + +Collecting Prometheus Metrics +----------------------------- + +The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation **metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'**. You can build a Prometheus collector to identify and obtain NPD metrics from **http://{{NpdPodIP}}:{{NpdPodPort}}/metrics**. + +.. note:: + + If the npd add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is **20257**. + +Currently, the metric data includes **problem_counter** and **problem_gauge**, as shown below. + +.. code-block:: + + # HELP problem_counter Number of times a specific type of problem have occurred. + # TYPE problem_counter counter + problem_counter{reason="DockerHung"} 0 + problem_counter{reason="DockerStart"} 0 + problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0 + ... + # HELP problem_gauge Whether a specific type of problem is affecting the node or not. + # TYPE problem_gauge gauge + problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0 + problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0 + problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0 + problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0 + .. diff --git a/umn/source/add-ons/overview.rst b/umn/source/add-ons/overview.rst index d5cab0a..ce588ac 100644 --- a/umn/source/add-ons/overview.rst +++ b/umn/source/add-ons/overview.rst @@ -9,20 +9,22 @@ CCE provides multiple types of add-ons to extend cluster functions and meet feat .. table:: **Table 1** Add-on list - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Add-on Name | Introduction | - +=========================================================================+==============================================================================================================================================================================================================================================================================================+ - | :ref:`coredns (System Resource Add-On, Mandatory) ` | The coredns add-on is a DNS server that provides domain name resolution services for Kubernetes clusters. coredns chains plug-ins to provide additional features. | - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | :ref:`storage-driver (System Resource Add-On, Discarded) ` | storage-driver is a FlexVolume driver used to support IaaS storage services such as EVS, SFS, and OBS. | - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | :ref:`everest (System Resource Add-On, Mandatory) ` | Everest is a cloud native container storage system. Based on the Container Storage Interface (CSI), clusters of Kubernetes v1.15.6 or later obtain access to cloud storage services. | - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | :ref:`autoscaler ` | The autoscaler add-on resizes a cluster based on pod scheduling status and resource usage. | - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | :ref:`metrics-server ` | metrics-server is an aggregator for monitoring data of core cluster resources. | - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | :ref:`gpu-beta ` | gpu-beta is a device management add-on that supports GPUs in containers. It supports only NVIDIA drivers. | - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | :ref:`volcano ` | Volcano provides general-purpose, high-performance computing capabilities, such as job scheduling, heterogeneous chip management, and job running management, serving end users through computing frameworks for different industries, such as AI, big data, gene sequencing, and rendering. | - +-------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Add-on Name | Introduction | + +=========================================================================+=================================================================================================================================================================================================================================================================================================================================+ + | :ref:`coredns (System Resource Add-On, Mandatory) ` | The coredns add-on is a DNS server that provides domain name resolution services for Kubernetes clusters. coredns chains plug-ins to provide additional features. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | :ref:`storage-driver (System Resource Add-On, Discarded) ` | storage-driver is a FlexVolume driver used to support IaaS storage services such as EVS, SFS, and OBS. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | :ref:`everest (System Resource Add-On, Mandatory) ` | Everest is a cloud native container storage system. Based on the Container Storage Interface (CSI), clusters of Kubernetes v1.15.6 or later obtain access to cloud storage services. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | :ref:`npd ` | node-problem-detector (npd for short) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. The npd add-on can run as a DaemonSet or a daemon. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | :ref:`autoscaler ` | The autoscaler add-on resizes a cluster based on pod scheduling status and resource usage. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | :ref:`metrics-server ` | metrics-server is an aggregator for monitoring data of core cluster resources. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | :ref:`gpu-beta ` | gpu-beta is a device management add-on that supports GPUs in containers. It supports only NVIDIA drivers. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | :ref:`volcano ` | Volcano provides general-purpose, high-performance computing capabilities, such as job scheduling, heterogeneous chip management, and job running management, serving end users through computing frameworks for different industries, such as AI, big data, gene sequencing, and rendering. | + +-------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ diff --git a/umn/source/auto_scaling/scaling_a_workload/creating_an_hpa_policy_for_workload_auto_scaling.rst b/umn/source/auto_scaling/scaling_a_workload/creating_an_hpa_policy_for_workload_auto_scaling.rst index 6974b41..9bff10e 100644 --- a/umn/source/auto_scaling/scaling_a_workload/creating_an_hpa_policy_for_workload_auto_scaling.rst +++ b/umn/source/auto_scaling/scaling_a_workload/creating_an_hpa_policy_for_workload_auto_scaling.rst @@ -17,8 +17,6 @@ Notes and Constraints - HPA policies can be created only for clusters of v1.13 or later. -- Only one policy can be created for each workload. You can create an HPA policy. - - For clusters earlier than v1.19.10, if an HPA policy is used to scale out a workload with EVS volumes mounted, the existing pods cannot be read or written when a new pod is scheduled to another node. For clusters of v1.19.10 and later, if an HPA policy is used to scale out a workload with EVS volume mounted, a new pod cannot be started because EVS disks cannot be attached. @@ -36,60 +34,69 @@ Procedure .. table:: **Table 1** HPA policy parameters - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Parameter | Description | - +==============================================================+===========================================================================================================================================================================================================================+ - | Policy Name | Name of the policy to be created. Set this parameter as required. | - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Namespace | Namespace to which the workload belongs. | - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Associated Workload | Workload with which the HPA policy is associated. | - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Pod Range | Minimum and maximum numbers of pods. | - | | | - | | When a policy is triggered, the workload pods are scaled within this range. | - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Cooldown Period | Interval between a scale-in and a scale-out. The unit is minute. **The interval cannot be shorter than 1 minute.** | - | | | - | | **This parameter is available only for clusters of v1.15 and later. It is not supported in clusters of v1.13 or earlier.** | - | | | - | | This parameter indicates the interval between consecutive scaling operations. The cooldown period ensures that a scaling operation is initiated only when the previous one is completed and the system is running stably. | - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | System Policy | - **Metric**: You can select **CPU usage** or **Memory usage**. | - | | | - | | .. note:: | - | | | - | | Usage = CPUs or memory used by pods/Requested CPUs or memory. | - | | | - | | - **Desired Value**: Enter the desired average resource usage. | - | | | - | | This parameter indicates the desired value of the selected metric. Number of pods to be scaled (rounded up) = (Current metric value/Desired value) x Number of current pods | - | | | - | | .. note:: | - | | | - | | When calculating the number of pods to be added or reduced, the HPA policy uses the maximum number of pods in the last 5 minutes. | - | | | - | | - **Tolerance Range**: Scaling is not triggered when the metric value is within the tolerance range. The desired value must be within the tolerance range. | - | | | - | | If the metric value is greater than the scale-in threshold and less than the scale-out threshold, no scaling is triggered. **This parameter is supported only in clusters of v1.15 or later.** | - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ - | Custom Policy (supported only in clusters of v1.15 or later) | .. note:: | - | | | - | | Before setting a custom policy, you need to install an add-on that supports custom metric collection in the cluster, for example, prometheus add-on. | - | | | - | | - **Metric Name**: name of the custom metric. You can select a name as prompted. | - | | | - | | For details, see :ref:`Custom Monitoring `. | - | | | - | | - **Metric Source**: Select an object type from the drop-down list. You can select **Pod**. | - | | | - | | - **Desired Value**: the average metric value of all pods. Number of pods to be scaled (rounded up) = (Current metric value/Desired value) x Number of current pods | - | | | - | | .. note:: | - | | | - | | When calculating the number of pods to be added or reduced, the HPA policy uses the maximum number of pods in the last 5 minutes. | - | | | - | | - **Tolerance Range**: Scaling is not triggered when the metric value is within the tolerance range. The desired value must be within the tolerance range. | - +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Parameter | Description | + +==============================================================+=========================================================================================================================================================================================================================================================================================================+ + | Policy Name | Name of the policy to be created. Set this parameter as required. | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Namespace | Namespace to which the workload belongs. | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Associated Workload | Workload with which the HPA policy is associated. | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Pod Range | Minimum and maximum numbers of pods. | + | | | + | | When a policy is triggered, the workload pods are scaled within this range. | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Cooldown Period | Interval between a scale-in and a scale-out. The unit is minute. **The interval cannot be shorter than 1 minute.** | + | | | + | | **This parameter is supported only from clusters of v1.15 to v1.23.** | + | | | + | | This parameter indicates the interval between consecutive scaling operations. The cooldown period ensures that a scaling operation is initiated only when the previous one is completed and the system is running stably. | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Scaling Behavior | **This parameter is supported only in clusters of v1.25 or later.** | + | | | + | | - **Default**: Scales workloads using the Kubernetes default behavior. For details, see `Default Behavior `__. | + | | - **Custom**: Scales workloads using custom policies such as stabilization window, steps, and priorities. Unspecified parameters use the values recommended by Kubernetes. | + | | | + | | - **Disable scale-out/scale-in**: Select whether to disable scale-out or scale-in. | + | | - **Stabilization Window**: A period during which CCE continuously checks whether the metrics used for scaling keep fluctuating. CCE triggers scaling if the desired state is not maintained for the entire window. This window restricts the unwanted flapping of pod count due to metric changes. | + | | - **Step**: specifies the scaling step. You can set the number or percentage of pods to be scaled in or out within a specified period. If there are multiple policies, you can select the policy that maximizes or minimizes the number of pods. | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | System Policy | - **Metric**: You can select **CPU usage** or **Memory usage**. | + | | | + | | .. note:: | + | | | + | | Usage = CPUs or memory used by pods/Requested CPUs or memory. | + | | | + | | - **Desired Value**: Enter the desired average resource usage. | + | | | + | | This parameter indicates the desired value of the selected metric. Number of pods to be scaled (rounded up) = (Current metric value/Desired value) x Number of current pods | + | | | + | | .. note:: | + | | | + | | When calculating the number of pods to be added or reduced, the HPA policy uses the maximum number of pods in the last 5 minutes. | + | | | + | | - **Tolerance Range**: Scaling is not triggered when the metric value is within the tolerance range. The desired value must be within the tolerance range. | + | | | + | | If the metric value is greater than the scale-in threshold and less than the scale-out threshold, no scaling is triggered. **This parameter is supported only in clusters of v1.15 or later.** | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + | Custom Policy (supported only in clusters of v1.15 or later) | .. note:: | + | | | + | | Before setting a custom policy, you need to install an add-on that supports custom metric collection in the cluster, for example, prometheus add-on. | + | | | + | | - **Metric Name**: name of the custom metric. You can select a name as prompted. | + | | | + | | For details, see :ref:`Custom Monitoring `. | + | | | + | | - **Metric Source**: Select an object type from the drop-down list. You can select **Pod**. | + | | | + | | - **Desired Value**: the average metric value of all pods. Number of pods to be scaled (rounded up) = (Current metric value/Desired value) x Number of current pods | + | | | + | | .. note:: | + | | | + | | When calculating the number of pods to be added or reduced, the HPA policy uses the maximum number of pods in the last 5 minutes. | + | | | + | | - **Tolerance Range**: Scaling is not triggered when the metric value is within the tolerance range. The desired value must be within the tolerance range. | + +--------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ #. Click **Create**. diff --git a/umn/source/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst b/umn/source/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst index 6a2ccb3..5acb210 100644 --- a/umn/source/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst +++ b/umn/source/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst @@ -137,7 +137,7 @@ Creating a Node Pool and a Node Scaling Policy - **Max. Nodes**: Set it to **5**, indicating the maximum number of nodes in a node pool. - **Specifications**: 2 vCPUs \| 4 GiB - Retain the defaults for other parameters. For details, see `Creating a Node Pool `__. + Retain the defaults for other parameters. For details, see `Creating a Node Pool `__. #. Click **Add-ons** on the left of the cluster console, click **Edit** under the autoscaler add-on, modify the add-on configuration, enable **Auto node scale-in**, and configure scale-in parameters. For example, trigger scale-in when the node resource utilization is less than 50%. @@ -147,7 +147,7 @@ Creating a Node Pool and a Node Scaling Policy #. Click **Node Scaling** on the left of the cluster console and click **Create Node Scaling Policy** in the upper right corner. Node scaling policies added here trigger scale-out based on the CPU/memory allocation rate or periodically. - As shown in the following figure, when the cluster CPU allocation rate is greater than 70%, one node will be added. A node scaling policy needs to be associated with a node pool. Multiple node pools can be associated. When you need to scale nodes, node with proper specifications will be added or reduced from the node pool based on the minimum waste principle. For details, see `Creating a Node Scaling Policy `__. + As shown in the following figure, when the cluster CPU allocation rate is greater than 70%, one node will be added. A node scaling policy needs to be associated with a node pool. Multiple node pools can be associated. When you need to scale nodes, node with proper specifications will be added or reduced from the node pool based on the minimum waste principle. For details, see `Creating a Node Scaling Policy `__. |image3| @@ -372,7 +372,7 @@ Observing the Auto Scaling Process You can also view the HPA policy execution history on the console. Wait until the one node is reduced. - The reason why the other two nodes in the node pool are not reduced is that they both have pods in the kube-system namespace (and these pods are not created by DaemonSets). For details about node scale-in, see `Node Scaling Mechanisms `__. + The reason why the other two nodes in the node pool are not reduced is that they both have pods in the kube-system namespace (and these pods are not created by DaemonSets). For details, see `Node Scaling Mechanisms `__. Summary ------- @@ -380,6 +380,6 @@ Summary Using HPA and CA can easily implement auto scaling in most scenarios. In addition, the scaling process of nodes and pods can be easily observed. .. |image1| image:: /_static/images/en-us_image_0000001360670117.png -.. |image2| image:: /_static/images/en-us_image_0000001274543860.png -.. |image3| image:: /_static/images/en-us_image_0000001274544060.png -.. |image4| image:: /_static/images/en-us_image_0000001274864616.png +.. |image2| image:: /_static/images/en-us_image_0000001533181077.png +.. |image3| image:: /_static/images/en-us_image_0000001482541956.png +.. |image4| image:: /_static/images/en-us_image_0000001482701968.png diff --git a/umn/source/best_practice/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst b/umn/source/best_practice/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst index 91a6a04..1904a80 100644 --- a/umn/source/best_practice/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst +++ b/umn/source/best_practice/auto_scaling/using_hpa_and_ca_for_auto_scaling_of_workloads_and_nodes.rst @@ -137,7 +137,7 @@ Creating a Node Pool and a Node Scaling Policy - **Max. Nodes**: Set it to **5**, indicating the maximum number of nodes in a node pool. - **Specifications**: 2 vCPUs \| 4 GiB - Retain the defaults for other parameters. For details, see `Creating a Node Pool `__. + Retain the defaults for other parameters. For details, see `Creating a Node Pool `__. #. Click **Add-ons** on the left of the cluster console, click **Edit** under the autoscaler add-on, modify the add-on configuration, enable **Auto node scale-in**, and configure scale-in parameters. For example, trigger scale-in when the node resource utilization is less than 50%. @@ -147,7 +147,7 @@ Creating a Node Pool and a Node Scaling Policy #. Click **Node Scaling** on the left of the cluster console and click **Create Node Scaling Policy** in the upper right corner. Node scaling policies added here trigger scale-out based on the CPU/memory allocation rate or periodically. - As shown in the following figure, when the cluster CPU allocation rate is greater than 70%, one node will be added. A node scaling policy needs to be associated with a node pool. Multiple node pools can be associated. When you need to scale nodes, node with proper specifications will be added or reduced from the node pool based on the minimum waste principle. For details, see `Creating a Node Scaling Policy `__. + As shown in the following figure, when the cluster CPU allocation rate is greater than 70%, one node will be added. A node scaling policy needs to be associated with a node pool. Multiple node pools can be associated. When you need to scale nodes, node with proper specifications will be added or reduced from the node pool based on the minimum waste principle. For details, see `Creating a Node Scaling Policy `__. |image3| @@ -372,7 +372,7 @@ Observing the Auto Scaling Process You can also view the HPA policy execution history on the console. Wait until the one node is reduced. - The reason why the other two nodes in the node pool are not reduced is that they both have pods in the kube-system namespace (and these pods are not created by DaemonSets). For details about node scale-in, see `Node Scaling Mechanisms `__. + The reason why the other two nodes in the node pool are not reduced is that they both have pods in the kube-system namespace (and these pods are not created by DaemonSets). For details, see `Node Scaling Mechanisms `__. Summary ------- @@ -380,6 +380,6 @@ Summary Using HPA and CA can easily implement auto scaling in most scenarios. In addition, the scaling process of nodes and pods can be easily observed. .. |image1| image:: /_static/images/en-us_image_0000001360670117.png -.. |image2| image:: /_static/images/en-us_image_0000001274543860.png -.. |image3| image:: /_static/images/en-us_image_0000001274544060.png -.. |image4| image:: /_static/images/en-us_image_0000001274864616.png +.. |image2| image:: /_static/images/en-us_image_0000001533181077.png +.. |image3| image:: /_static/images/en-us_image_0000001482541956.png +.. |image4| image:: /_static/images/en-us_image_0000001482701968.png diff --git a/umn/source/clusters/managing_a_cluster/cluster_overload_control.rst b/umn/source/clusters/managing_a_cluster/cluster_overload_control.rst index 2b33afd..a306066 100644 --- a/umn/source/clusters/managing_a_cluster/cluster_overload_control.rst +++ b/umn/source/clusters/managing_a_cluster/cluster_overload_control.rst @@ -27,31 +27,10 @@ When creating a cluster of v1.23 or later, you can enable overload control durin #. Log in to the CCE console and go to an existing cluster whose version is v1.23 or later. #. On the cluster information page, view the master node information. If overload control is not enabled, a message is displayed. You can click **Start Now** to enable the function. -Overload Monitoring -------------------- - -**Method 1: Using the CCE console** +Disabling Cluster Overload Control +---------------------------------- #. Log in to the CCE console and go to an existing cluster whose version is v1.23 or later. - -#. On the cluster information page, view the master node information. The overload level metric is displayed. - - The overload levels are as follows: - - - Circuit breaking: Rejects all external traffic. - - Severe overload: Rejects 75% external traffic. - - Moderate overload: Rejects 50% external traffic. - - Slight overload: Rejects 25% external traffic. - - Normal: Does not reject external traffic. - -**Method 2: Using the AOM concole** - -You can log in to the AOM console, create a dashboard, and add the metric named **vein_overload_level**. - -The meanings of the monitoring metrics are as follows: - -- 0: Circuit breaking: Rejects all external traffic. -- 1: Severe overload: Rejects 75% external traffic. -- 2: Moderate overload: Rejects 50% external traffic. -- 3: Slight overload: Rejects 25% external traffic. -- 4: Normal: Does not reject external traffic. +#. On the **Cluster Information** page, click **Manage** in the upper right corner. +#. Set **support-overload** to **false** under **kube-apiserver**. +#. Click **OK**. diff --git a/umn/source/networking/ingresses/using_kubectl_to_create_an_elb_ingress.rst b/umn/source/networking/ingresses/using_kubectl_to_create_an_elb_ingress.rst index b9af77d..b46a5f1 100644 --- a/umn/source/networking/ingresses/using_kubectl_to_create_an_elb_ingress.rst +++ b/umn/source/networking/ingresses/using_kubectl_to_create_an_elb_ingress.rst @@ -682,6 +682,8 @@ SNI allows multiple TLS-based access domain names to be provided for external sy You can enable SNI when the preceding conditions are met. The following uses the automatic creation of a load balancer as an example. In this example, **sni-test-secret-1** and **sni-test-secret-2** are SNI certificates. The domain names specified by the certificates must be the same as those in the certificates. +**For clusters of v1.21 or earlier:** + .. code-block:: apiVersion: networking.k8s.io/v1beta1 @@ -722,6 +724,51 @@ You can enable SNI when the preceding conditions are met. The following uses the property: ingress.beta.kubernetes.io/url-match-mode: STARTS_WITH +**For clusters of v1.23 or later:** + +.. code-block:: + + apiVersion: networking.k8s.io/v1 + kind: Ingress + metadata: + name: ingress-test + annotations: + kubernetes.io/elb.class: union + kubernetes.io/elb.port: '443' + kubernetes.io/elb.autocreate: + '{ + "type":"public", + "bandwidth_name":"cce-bandwidth-******", + "bandwidth_chargemode":"bandwidth", + "bandwidth_size":5, + "bandwidth_sharetype":"PER", + "eip_type":"5_bgp" + }' + kubernetes.io/elb.tls-ciphers-policy: tls-1-2 + spec: + tls: + - secretName: ingress-test-secret + - hosts: + - example.top # Domain name specified a certificate is issued + secretName: sni-test-secret-1 + - hosts: + - example.com # Domain name specified a certificate is issued + secretName: sni-test-secret-2 + rules: + - host: '' + http: + paths: + - path: '/' + backend: + service: + name: # Replace it with the name of your target Service. + port: + number: 8080 # Replace 8080 with the port number of your target Service. + property: + ingress.beta.kubernetes.io/url-match-mode: STARTS_WITH + pathType: ImplementationSpecific + ingressClassName: cce + Accessing Multiple Services --------------------------- diff --git a/umn/source/nodes/creating_a_node.rst b/umn/source/nodes/creating_a_node.rst index 8c3f15f..677daed 100644 --- a/umn/source/nodes/creating_a_node.rst +++ b/umn/source/nodes/creating_a_node.rst @@ -14,6 +14,7 @@ Prerequisites Notes and Constraints --------------------- +- The node has 2-core or higher CPU, 4 GB or larger memory. - To ensure node stability, a certain amount of CCE node resources will be reserved for Kubernetes components (such as kubelet, kube-proxy, and docker) based on the node specifications. Therefore, the total number of node resources and assignable node resources in Kubernetes are different. The larger the node specifications, the more the containers deployed on the node. Therefore, more node resources need to be reserved to run Kubernetes components. For details, see :ref:`Formula for Calculating the Reserved Resources of a Node `. - The node networking (such as the VM networking and container networking) is taken over by CCE. You are not allowed to add and delete NICs or change routes. If you modify the networking configuration, the availability of CCE may be affected. For example, the NIC named **gw_11cbf51a@eth0** on the node is the container network gateway and cannot be modified. - During the node creation, software packages are downloaded from OBS using the domain name. You need to use a private DNS server to resolve the OBS domain name, and configure the subnet where the node resides with a private DNS server address. When you create a subnet, the private DNS server is used by default. If you change the subnet DNS, ensure that the DNS server in use can resolve the OBS domain name. diff --git a/umn/source/permissions_management/cluster_permissions_iam-based.rst b/umn/source/permissions_management/cluster_permissions_iam-based.rst index f79b314..93b372a 100644 --- a/umn/source/permissions_management/cluster_permissions_iam-based.rst +++ b/umn/source/permissions_management/cluster_permissions_iam-based.rst @@ -66,6 +66,10 @@ The system policies preset for CCE in IAM are **CCEFullAccess** and **CCEReadOnl - **CCE FullAccess**: common operation permissions on CCE cluster resources, excluding the namespace-level permissions for the clusters (with Kubernetes RBAC enabled) and the privileged administrator operations, such as agency configuration and cluster certificate generation - **CCE ReadOnlyAccess**: permissions to view CCE cluster resources, excluding the namespace-level permissions of the clusters (with Kubernetes RBAC enabled) +.. note:: + + The **CCE Admin** and **CCE Viewer** roles will be discarded soon. You are advised to use **CCE FullAccess** and **CCE ReadOnlyAccess**. + Custom Policies --------------- diff --git a/umn/source/storage_flexvolume/using_evs_disks_as_storage_volumes/kubectl_creating_a_pv_from_an_existing_evs_disk.rst b/umn/source/storage_flexvolume/using_evs_disks_as_storage_volumes/kubectl_creating_a_pv_from_an_existing_evs_disk.rst index 7ea9ea4..cfe4975 100644 --- a/umn/source/storage_flexvolume/using_evs_disks_as_storage_volumes/kubectl_creating_a_pv_from_an_existing_evs_disk.rst +++ b/umn/source/storage_flexvolume/using_evs_disks_as_storage_volumes/kubectl_creating_a_pv_from_an_existing_evs_disk.rst @@ -147,7 +147,7 @@ Procedure | volumeName | Name of the PV. | +-----------------------------------------------+---------------------------------------------------------------------------------------------+ - **1.11 <= K8s version < 1.11.7** + **Clusters from v1.11 to v1.11.7** - .. _cce_10_0313__li19211184720504: