:original_name: cce_10_0659.html

.. _cce_10_0659:

Node Fault Detection Policy
===========================

The node fault detection function depends on the :ref:`node-problem-detector (npd) <cce_10_0132>` add-on. The add-on instances run on nodes and monitor nodes. This section describes how to enable node fault detection.

Prerequisites
-------------

The :ref:`npd <cce_10_0132>` add-on has been installed in the cluster.

Enabling Node Fault Detection
-----------------------------

#. Log in to the CCE console and click the cluster name to access the cluster console.

#. In the navigation pane on the left, choose **Nodes**. Check whether the npd add-on has been installed in the cluster or whether the add-on has been upgraded to the latest version. After the npd add-on has been installed, you can use the fault detection function.

   |image1|

#. If the npd add-on is running properly, click **Node Fault Detection Policy** to view the current fault detection items. For details about the npd check item list, see :ref:`npd Check Items <cce_10_0659__en-us_topic_0000001519314622_section321984418184>`.

#. If the check result of the current node is abnormal, a message is displayed in the node list, indicating that the metric is abnormal.

   |image2|

#. You can click **Abnormal metrics** and rectify the fault as prompted.

   |image3|

Customized Check Items
----------------------

#. Log in to the CCE console and click the cluster name to access the cluster console.

#. Choose Node Management on the left and click **Node Fault Detection Policy**.

#. On the displayed page, view the current check items. Click **Edit** in the **Operation** column and edit checks.

   Currently, the following configurations are supported:

   -  **Enable/Disable**: Enable or disable a check item.

   -  **Target Node**: By default, check items run on all nodes. You can change the fault threshold based on special scenarios. For example, the spot price ECS interruption reclamation check runs only on the spot price ECS node.

      |image4|

   -  **Trigger Threshold**: The default thresholds match common fault scenarios. You can customize and modify the fault thresholds as required. For example, change the threshold for triggering connection tracking table exhaustion from 90% to 80%.

      |image5|

   -  **Check Period**: The default check period is 30 seconds. You can modify this parameter as required.

      |image6|

   -  **Troubleshooting Strategy**: After a fault occurs, you can select the strategies listed in the following table.

      .. table:: **Table 1** Troubleshooting strategies

         +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
         | Troubleshooting Strategy | Effect                                                                                                                                                                                               |
         +==========================+======================================================================================================================================================================================================+
         | Prompting Exception      | Reports the Kuberentes events.                                                                                                                                                                       |
         +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
         | Disabling scheduling     | Reports the Kuberentes events and adds the **NoSchedule** taint to the node.                                                                                                                         |
         +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
         | Evict Node Load          | Reports the Kuberentes events and adds the **NoExecute** taint to the node. This operation will evict workloads on the node and interrupt services. Exercise caution when performing this operation. |
         +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

.. _cce_10_0659__en-us_topic_0000001519314622_section321984418184:

npd Check Items
---------------

.. note::

   Check items are supported only in 1.16.0 and later versions.

Check items cover events and statuses.

-  Event-related

   For event-related check items, when a problem occurs, npd reports an event to the API server. The event type can be **Normal** (normal event) or **Warning** (abnormal event).

   .. table:: **Table 2** Event-related check items

      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
      | Check Item            | Function                                                                                                                                                                                                                                                     | Description                                                                                           |
      +=======================+==============================================================================================================================================================================================================================================================+=======================================================================================================+
      | OOMKilling            | Listen to the kernel logs and check whether OOM events occur and are reported.                                                                                                                                                                               | Warning event                                                                                         |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       | Typical scenario: When the memory usage of a process in a container exceeds the limit, OOM is triggered and the process is terminated.                                                                                                                       | Listening object: **/dev/kmsg**                                                                       |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       |                                                                                                                                                                                                                                                              | Matching rule: "Killed process \\\\d+ (.+) total-vm:\\\\d+kB, anon-rss:\\\\d+kB, file-rss:\\\\d+kB.*" |
      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
      | TaskHung              | Listen to the kernel logs and check whether taskHung events occur and are reported.                                                                                                                                                                          | Warning event                                                                                         |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       | Typical scenario: Disk I/O suspension causes process suspension.                                                                                                                                                                                             | Listening object: **/dev/kmsg**                                                                       |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       |                                                                                                                                                                                                                                                              | Matching rule: "task \\\\S+:\\\\w+ blocked for more than \\\\w+ seconds\\\\."                         |
      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
      | ReadonlyFilesystem    | Check whether the **Remount root filesystem read-only** error occurs in the system kernel by listening to the kernel logs.                                                                                                                                   | Warning event                                                                                         |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       | Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk. | Listening object: **/dev/kmsg**                                                                       |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       |                                                                                                                                                                                                                                                              | Matching rule: **Remounting filesystem read-only**                                                    |
      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+

-  Status-related

   For status-related check items, when a problem occurs, npd reports an event to the API server and changes the node status synchronously. This function can be used together with :ref:`Node-problem-controller fault isolation <cce_10_0132__en-us_topic_0000001244261007_section1471610580474>` to isolate nodes.

   **If the check period is not specified in the following check items, the default period is 30 seconds.**

   .. table:: **Table 3** Checking system components

      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                        | Function                                                                                                  | Description                                                                                                                             |
      +===================================+===========================================================================================================+=========================================================================================================================================+
      | Container network component error | Check the status of the CNI components (container network components).                                    | None                                                                                                                                    |
      |                                   |                                                                                                           |                                                                                                                                         |
      | CNIProblem                        |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Container runtime component error | Check the status of Docker and containerd of the CRI components (container runtime components).           | Check object: Docker or containerd                                                                                                      |
      |                                   |                                                                                                           |                                                                                                                                         |
      | CRIProblem                        |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Frequent restarts of Kubelet      | Periodically backtrack system logs to check whether the key component Kubelet restarts frequently.        | -  Default threshold: 10 restarts within 10 minutes                                                                                     |
      |                                   |                                                                                                           |                                                                                                                                         |
      | FrequentKubeletRestart            |                                                                                                           |    If Kubelet restarts for 10 times within 10 minutes, it indicates that the system restarts frequently and a fault alarm is generated. |
      |                                   |                                                                                                           |                                                                                                                                         |
      |                                   |                                                                                                           | -  Listening object: logs in the **/run/log/journal** directory                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Frequent restarts of Docker       | Periodically backtrack system logs to check whether the container runtime Docker restarts frequently.     |                                                                                                                                         |
      |                                   |                                                                                                           |                                                                                                                                         |
      | FrequentDockerRestart             |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Frequent restarts of containerd   | Periodically backtrack system logs to check whether the container runtime containerd restarts frequently. |                                                                                                                                         |
      |                                   |                                                                                                           |                                                                                                                                         |
      | FrequentContainerdRestart         |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | kubelet error                     | Check the status of the key component Kubelet.                                                            | None                                                                                                                                    |
      |                                   |                                                                                                           |                                                                                                                                         |
      | KubeletProblem                    |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | kube-proxy error                  | Check the status of the key component kube-proxy.                                                         | None                                                                                                                                    |
      |                                   |                                                                                                           |                                                                                                                                         |
      | KubeProxyProblem                  |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

   .. table:: **Table 4** Checking system metrics

      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Check Item                     | Function                                                                                                                     | Description                                                                                                |
      +================================+==============================================================================================================================+============================================================================================================+
      | Conntrack table full           | Check whether the conntrack table is full.                                                                                   | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              |                                                                                                            |
      | ConntrackFullProblem           |                                                                                                                              | -  Usage: **nf_conntrack_count**                                                                           |
      |                                |                                                                                                                              | -  Maximum value: **nf_conntrack_max**                                                                     |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient disk resources    | Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              |                                                                                                            |
      | DiskProblem                    |                                                                                                                              | -  Source:                                                                                                 |
      |                                |                                                                                                                              |                                                                                                            |
      |                                |                                                                                                                              |    .. code-block::                                                                                         |
      |                                |                                                                                                                              |                                                                                                            |
      |                                |                                                                                                                              |       df -h                                                                                                |
      |                                |                                                                                                                              |                                                                                                            |
      |                                |                                                                                                                              | Currently, additional data disks are not supported.                                                        |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient file handles      | Check whether FD file handles are used up.                                                                                   | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              | -  Usage: the first value in **/proc/sys/fs/file-nr**                                                      |
      | FDProblem                      |                                                                                                                              | -  Maximum value: the third value in **/proc/sys/fs/file-nr**                                              |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient node memory       | Check whether memory is used up.                                                                                             | -  Default threshold: 80%                                                                                  |
      |                                |                                                                                                                              | -  Usage: **MemTotal-MemAvailable** in **/proc/meminfo**                                                   |
      | MemoryProblem                  |                                                                                                                              | -  Maximum value: **MemTotal** in **/proc/meminfo**                                                        |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient process resources | Check whether PID process resources are exhausted.                                                                           | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              | -  Usage: **nr_threads in /proc/loadavg**                                                                  |
      | PIDProblem                     |                                                                                                                              | -  Maximum value: smaller value between **/proc/sys/kernel/pid_max** and **/proc/sys/kernel/threads-max**. |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+

   .. table:: **Table 5** Checking the storage

      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                     | Function                                                                                                                                                                                                                                                                                                                                                                                           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
      +================================+====================================================================================================================================================================================================================================================================================================================================================================================================+=======================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+
      | Disk read-only                 | Periodically perform read and write tests on the system disk and CCE data disks (including the CRI logical disk and Kubelet logical disk) of the node to check the availability of key disks.                                                                                                                                                                                                      | Detection paths:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskReadonly                   |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /mnt/paas/kubernetes/kubelet/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/lib/docker/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/lib/containerd/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/paas/sys/log/cceaddon-npd/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | The temporary file **npd-disk-write-ping** is generated in the detection path.                                                                                                                                                                                                                                                                                                                                                                                                                                        |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | Currently, additional data disks are not supported.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Insufficient disk resources    | Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node.                                                                                                                                                                                                                                                                       | -  Default threshold: 90%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskProblem                    |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       df -h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | Currently, additional data disks are not supported.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | emptyDir storage pool error    | Check whether the ephemeral volume group on the node is normal.                                                                                                                                                                                                                                                                                                                                    | -  Detection period: 30s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | EmptyDirVolumeGroupStatusError | Impact: The pod that depends on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error.                                                                                                                                                                                                        | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. The user deletes some data disks by mistake. As a result, the storage pool becomes abnormal.                                                                                                                                                                                          |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       vgs -o vg_name, vg_attr                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost.                                                                                                                                                                                                                                                                                                                                                                                                          |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Joint scheduling: The scheduler can automatically identify a PV storage pool error and prevent pods that depend on the storage pool from being scheduled to the node.                                                                                                                                                                                                                                                                                                                                              |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Exceptional scenario: The npd add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in **nodestatus.allocatable** to **0**. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected by this check item, but by the ReadonlyFilesystem check item. |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | PV storage pool error          | Check the PV group on the node.                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | LocalPvVolumeGroupStatusError  | Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error.                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake.                                                                                                                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Mount point error              | Check the mount point on the node.                                                                                                                                                                                                                                                                                                                                                                 | Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | MountPointProblem              | Exceptional definition: You cannot access the mount point by running the **cd** command.                                                                                                                                                                                                                                                                                                           | .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails. |    for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok"                                                                                                                                                                                                                                                                                                                                                                                                                      |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Suspended disk I/O             | Check whether I/O suspension occurs on all disks on the node, that is, whether I/O read and write operations are not responded.                                                                                                                                                                                                                                                                    | -  Check object: all data disks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskHung                       | Definition of I/O suspension: The system does not respond to disk I/O requests, and some processes are in the D state.                                                                                                                                                                                                                                                                             | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network.                                                                                                                                                                                                                                                                            |    /proc/diskstat                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       iostat -xmt 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Threshold:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average usage: ioutil >= 0.99                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average I/O queue length: avgqu-sz >= 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average I/O transfer volume: iops (w/s) + ioth (wMB/s) <= 1                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. note::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait should be greater than 0.8.                                                                                                                                                                                                                                                                                                                                                                        |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Slow disk I/O                  | Check whether all disks on the node have slow I/Os, that is, whether I/Os respond slowly.                                                                                                                                                                                                                                                                                                          | -  Check object: all data disks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskSlow                       | Typical scenario: EVS disks have slow I/Os due to network fluctuation.                                                                                                                                                                                                                                                                                                                             | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    /proc/diskstat                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       iostat -xmt 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Default threshold:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Average I/O latency: await >= 5000 ms                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | .. note::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    If I/O requests are not responded and the **await** data is not updated, this check item is invalid.                                                                                                                                                                                                                                                                                                                                                                                                               |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

   .. table:: **Table 6** Other check items

      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item               | Function                                                                                                                                                                                                | Description                                                                                                                             |
      +==========================+=========================================================================================================================================================================================================+=========================================================================================================================================+
      | Abnormal NTP             | Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused.                                                                     | Default clock offset threshold: 8000 ms                                                                                                 |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | NTPProblem               |                                                                                                                                                                                                         |                                                                                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Process D error          | Check whether there is a process D on the node.                                                                                                                                                         | Default threshold: 10 abnormal processes detected for three consecutive times                                                           |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ProcessD                 |                                                                                                                                                                                                         | Source:                                                                                                                                 |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         | -  /proc/{PID}/stat                                                                                                                     |
      |                          |                                                                                                                                                                                                         | -  Alternately, you can run the **ps aux** command.                                                                                     |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         | Exceptional scenario: ProcessD ignores the resident D processes (heartbeat and update) on which the SDI driver on the BMS node depends. |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Process Z error          | Check whether the node has processes in Z state.                                                                                                                                                        |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ProcessZ                 |                                                                                                                                                                                                         |                                                                                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | ResolvConf error         | Check whether the ResolvConf file is lost.                                                                                                                                                              | Object: **/etc/resolv.conf**                                                                                                            |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ResolvConfFileProblem    | Check whether the ResolvConf file is normal.                                                                                                                                                            |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          | Exceptional definition: No upstream domain name resolution server (nameserver) is included.                                                                                                             |                                                                                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Existing scheduled event | Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer. | Source:                                                                                                                                 |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ScheduledEvent           | Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs.                                                    | -  http://169.254.169.254/meta-data/latest/events/scheduled                                                                             |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         | This check item is an Alpha feature and is disabled by default.                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

   The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using npd.

   .. table:: **Table 7** Default kubelet check items

      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                  | Function                                                               | Description                                                                                                                                                                                                                                                                                                              |
      +=============================+========================================================================+==========================================================================================================================================================================================================================================================================================================================+
      | Insufficient PID resources  | Check whether PIDs are sufficient.                                     | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                             |                                                                        | -  Threshold: 90%                                                                                                                                                                                                                                                                                                        |
      | PIDPressure                 |                                                                        | -  Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see `issue 107107 <https://github.com/kubernetes/kubernetes/issues/107107>`__. In community version 1.24 and earlier versions, thread-max is not considered in this check item. |
      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Insufficient memory         | Check whether the allocable memory for the containers is sufficient.   | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                             |                                                                        | -  Threshold: max. 100 MiB                                                                                                                                                                                                                                                                                               |
      | MemoryPressure              |                                                                        | -  Allocable = Total memory of a node - Reserved memory of a node                                                                                                                                                                                                                                                        |
      |                             |                                                                        | -  Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node.                                                                                                                                                                             |
      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Insufficient disk resources | Check the disk usage and inodes usage of the kubelet and Docker disks. | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                             |                                                                        | -  Threshold: 90%                                                                                                                                                                                                                                                                                                        |
      | DiskPressure                |                                                                        |                                                                                                                                                                                                                                                                                                                          |
      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

.. |image1| image:: /_static/images/en-us_image_0000001519067438.png
.. |image2| image:: /_static/images/en-us_image_0000001520080400.png
.. |image3| image:: /_static/images/en-us_image_0000001571360421.png
.. |image4| image:: /_static/images/en-us_image_0000001570344789.png
.. |image5| image:: /_static/images/en-us_image_0000001519063542.png
.. |image6| image:: /_static/images/en-us_image_0000001519544422.png