:original_name: cce_10_0132.html

.. _cce_10_0132:

npd
===

Introduction
------------

node-problem-detector (npd for short) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. The npd add-on can run as a DaemonSet or a daemon.

For more information, see `node-problem-detector <https://github.com/kubernetes/node-problem-detector>`__.

Notes and Constraints
---------------------

-  When using this add-on, do not format or partition node disks.
-  Each npd process occupies 30 mCPU and 100 MB memory.

Permission Description
----------------------

To monitor kernel logs, the npd add-on needs to read the host **/dev/kmsg**. Therefore, the privileged mode must be enabled. For details, see `privileged <https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged>`__.

In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for npd running:

-  cap_dac_read_search: permission to access **/run/log/journal**.
-  cap_sys_admin: permission to access **/dev/kmsg**.

Installing the Add-on
---------------------

#. Log in to the CCE console and access the cluster console. Choose **Add-ons** in the navigation pane, locate **npd** on the right, and click **Install**.

#. On the **Install Add-on** page, select the add-on specifications and set related parameters.

   -  **Pods**: Set the number of pods based on service requirements.
   -  **Containers**: Select a proper container quota based on service requirements.

#. Set the npd parameters and click **Install**.

   The parameters are configurable only in 1.16.0 and later versions. For details, see :ref:`Table 7 <cce_10_0132__en-us_topic_0000001244261007_table205378534248>`.

npd Check Items
---------------

.. note::

   Check items are supported only in 1.16.0 and later versions.

Check items cover events and statuses.

-  Event-related

   For event-related check items, when a problem occurs, npd reports an event to the API server. The event type can be **Normal** (normal event) or **Warning** (abnormal event).

   .. table:: **Table 1** Event-related check items

      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
      | Check Item            | Function                                                                                                                                                                                                                                                     | Description                                                                                           |
      +=======================+==============================================================================================================================================================================================================================================================+=======================================================================================================+
      | OOMKilling            | Listen to the kernel logs and check whether OOM events occur and are reported.                                                                                                                                                                               | Warning event                                                                                         |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       | Typical scenario: When the memory usage of a process in a container exceeds the limit, OOM is triggered and the process is terminated.                                                                                                                       | Listening object: **/dev/kmsg**                                                                       |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       |                                                                                                                                                                                                                                                              | Matching rule: "Killed process \\\\d+ (.+) total-vm:\\\\d+kB, anon-rss:\\\\d+kB, file-rss:\\\\d+kB.*" |
      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
      | TaskHung              | Listen to the kernel logs and check whether taskHung events occur and are reported.                                                                                                                                                                          | Warning event                                                                                         |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       | Typical scenario: Disk I/O suspension causes process suspension.                                                                                                                                                                                             | Listening object: **/dev/kmsg**                                                                       |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       |                                                                                                                                                                                                                                                              | Matching rule: "task \\\\S+:\\\\w+ blocked for more than \\\\w+ seconds\\\\."                         |
      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+
      | ReadonlyFilesystem    | Check whether the **Remount root filesystem read-only** error occurs in the system kernel by listening to the kernel logs.                                                                                                                                   | Warning event                                                                                         |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       | Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk. | Listening object: **/dev/kmsg**                                                                       |
      |                       |                                                                                                                                                                                                                                                              |                                                                                                       |
      |                       |                                                                                                                                                                                                                                                              | Matching rule: **Remounting filesystem read-only**                                                    |
      +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+

-  Status-related

   For status-related check items, when a problem occurs, npd reports an event to the API server and changes the node status synchronously. This function can be used together with :ref:`Node-problem-controller fault isolation <cce_10_0132__en-us_topic_0000001244261007_section1471610580474>` to isolate nodes.

   **If the check period is not specified in the following check items, the default period is 30 seconds.**

   .. table:: **Table 2** Checking system components

      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                        | Function                                                                                                  | Description                                                                                                                             |
      +===================================+===========================================================================================================+=========================================================================================================================================+
      | Container network component error | Check the status of the CNI components (container network components).                                    | None                                                                                                                                    |
      |                                   |                                                                                                           |                                                                                                                                         |
      | CNIProblem                        |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Container runtime component error | Check the status of Docker and containerd of the CRI components (container runtime components).           | Check object: Docker or containerd                                                                                                      |
      |                                   |                                                                                                           |                                                                                                                                         |
      | CRIProblem                        |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Frequent restarts of Kubelet      | Periodically backtrack system logs to check whether the key component Kubelet restarts frequently.        | -  Default threshold: 10 restarts within 10 minutes                                                                                     |
      |                                   |                                                                                                           |                                                                                                                                         |
      | FrequentKubeletRestart            |                                                                                                           |    If Kubelet restarts for 10 times within 10 minutes, it indicates that the system restarts frequently and a fault alarm is generated. |
      |                                   |                                                                                                           |                                                                                                                                         |
      |                                   |                                                                                                           | -  Listening object: logs in the **/run/log/journal** directory                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Frequent restarts of Docker       | Periodically backtrack system logs to check whether the container runtime Docker restarts frequently.     |                                                                                                                                         |
      |                                   |                                                                                                           |                                                                                                                                         |
      | FrequentDockerRestart             |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Frequent restarts of containerd   | Periodically backtrack system logs to check whether the container runtime containerd restarts frequently. |                                                                                                                                         |
      |                                   |                                                                                                           |                                                                                                                                         |
      | FrequentContainerdRestart         |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | kubelet error                     | Check the status of the key component Kubelet.                                                            | None                                                                                                                                    |
      |                                   |                                                                                                           |                                                                                                                                         |
      | KubeletProblem                    |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | kube-proxy error                  | Check the status of the key component kube-proxy.                                                         | None                                                                                                                                    |
      |                                   |                                                                                                           |                                                                                                                                         |
      | KubeProxyProblem                  |                                                                                                           |                                                                                                                                         |
      +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

   .. table:: **Table 3** Checking system metrics

      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Check Item                     | Function                                                                                                                     | Description                                                                                                |
      +================================+==============================================================================================================================+============================================================================================================+
      | Conntrack table full           | Check whether the conntrack table is full.                                                                                   | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              |                                                                                                            |
      | ConntrackFullProblem           |                                                                                                                              | -  Usage: **nf_conntrack_count**                                                                           |
      |                                |                                                                                                                              | -  Maximum value: **nf_conntrack_max**                                                                     |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient disk resources    | Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              |                                                                                                            |
      | DiskProblem                    |                                                                                                                              | -  Source:                                                                                                 |
      |                                |                                                                                                                              |                                                                                                            |
      |                                |                                                                                                                              |    .. code-block::                                                                                         |
      |                                |                                                                                                                              |                                                                                                            |
      |                                |                                                                                                                              |       df -h                                                                                                |
      |                                |                                                                                                                              |                                                                                                            |
      |                                |                                                                                                                              | Currently, additional data disks are not supported.                                                        |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient file handles      | Check whether FD file handles are used up.                                                                                   | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              | -  Usage: the first value in **/proc/sys/fs/file-nr**                                                      |
      | FDProblem                      |                                                                                                                              | -  Maximum value: the third value in **/proc/sys/fs/file-nr**                                              |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient node memory       | Check whether memory is used up.                                                                                             | -  Default threshold: 80%                                                                                  |
      |                                |                                                                                                                              | -  Usage: **MemTotal-MemAvailable** in **/proc/meminfo**                                                   |
      | MemoryProblem                  |                                                                                                                              | -  Maximum value: **MemTotal** in **/proc/meminfo**                                                        |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+
      | Insufficient process resources | Check whether PID process resources are exhausted.                                                                           | -  Default threshold: 90%                                                                                  |
      |                                |                                                                                                                              | -  Usage: **nr_threads in /proc/loadavg**                                                                  |
      | PIDProblem                     |                                                                                                                              | -  Maximum value: smaller value between **/proc/sys/kernel/pid_max** and **/proc/sys/kernel/threads-max**. |
      +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+

   .. table:: **Table 4** Checking the storage

      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                     | Function                                                                                                                                                                                                                                                                                                                                                                                           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
      +================================+====================================================================================================================================================================================================================================================================================================================================================================================================+=======================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+
      | Disk read-only                 | Periodically perform read and write tests on the system disk and CCE data disks (including the CRI logical disk and Kubelet logical disk) of the node to check the availability of key disks.                                                                                                                                                                                                      | Detection paths:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskReadonly                   |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /mnt/paas/kubernetes/kubelet/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/lib/docker/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/lib/containerd/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/paas/sys/log/cceaddon-npd/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | The temporary file **npd-disk-write-ping** is generated in the detection path.                                                                                                                                                                                                                                                                                                                                                                                                                                        |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | Currently, additional data disks are not supported.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Insufficient disk resources    | Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node.                                                                                                                                                                                                                                                                       | -  Default threshold: 90%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskProblem                    |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       df -h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | Currently, additional data disks are not supported.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | emptyDir storage pool error    | Check whether the ephemeral volume group on the node is normal.                                                                                                                                                                                                                                                                                                                                    | -  Detection period: 30s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | EmptyDirVolumeGroupStatusError | Impact: The pod that depends on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error.                                                                                                                                                                                                        | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. The user deletes some data disks by mistake. As a result, the storage pool becomes abnormal.                                                                                                                                                                                          |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       vgs -o vg_name, vg_attr                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost.                                                                                                                                                                                                                                                                                                                                                                                                          |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Joint scheduling: The scheduler can automatically identify a PV storage pool error and prevent pods that depend on the storage pool from being scheduled to the node.                                                                                                                                                                                                                                                                                                                                              |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Exceptional scenario: The npd add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in **nodestatus.allocatable** to **0**. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected by this check item, but by the ReadonlyFilesystem check item. |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | PV storage pool error          | Check the PV group on the node.                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | LocalPvVolumeGroupStatusError  | Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error.                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake.                                                                                                                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Mount point error              | Check the mount point on the node.                                                                                                                                                                                                                                                                                                                                                                 | Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | MountPointProblem              | Exceptional definition: You cannot access the mount point by running the **cd** command.                                                                                                                                                                                                                                                                                                           | .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails. |    for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok"                                                                                                                                                                                                                                                                                                                                                                                                                      |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Suspended disk I/O             | Check whether I/O suspension occurs on all disks on the node, that is, whether I/O read and write operations are not responded.                                                                                                                                                                                                                                                                    | -  Check object: all data disks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskHung                       | Definition of I/O suspension: The system does not respond to disk I/O requests, and some processes are in the D state.                                                                                                                                                                                                                                                                             | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                | Typical scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network.                                                                                                                                                                                                                                                                            |    /proc/diskstat                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       iostat -xmt 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Threshold:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average usage: ioutil >= 0.99                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average I/O queue length: avgqu-sz >= 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average I/O transfer volume: iops (w/s) + ioth (wMB/s) <= 1                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. note::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait should be greater than 0.8.                                                                                                                                                                                                                                                                                                                                                                        |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Slow disk I/O                  | Check whether all disks on the node have slow I/Os, that is, whether I/Os respond slowly.                                                                                                                                                                                                                                                                                                          | -  Check object: all data disks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      | DiskSlow                       | Typical scenario: EVS disks have slow I/Os due to network fluctuation.                                                                                                                                                                                                                                                                                                                             | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    /proc/diskstat                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       iostat -xmt 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Default threshold:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Average I/O latency: await >= 5000 ms                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | .. note::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    If I/O requests are not responded and the **await** data is not updated, this check item is invalid.                                                                                                                                                                                                                                                                                                                                                                                                               |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

   .. table:: **Table 5** Other check items

      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item               | Function                                                                                                                                                                                                | Description                                                                                                                             |
      +==========================+=========================================================================================================================================================================================================+=========================================================================================================================================+
      | Abnormal NTP             | Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused.                                                                     | Default clock offset threshold: 8000 ms                                                                                                 |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | NTPProblem               |                                                                                                                                                                                                         |                                                                                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Process D error          | Check whether there is a process D on the node.                                                                                                                                                         | Default threshold: 10 abnormal processes detected for three consecutive times                                                           |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ProcessD                 |                                                                                                                                                                                                         | Source:                                                                                                                                 |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         | -  /proc/{PID}/stat                                                                                                                     |
      |                          |                                                                                                                                                                                                         | -  Alternately, you can run the **ps aux** command.                                                                                     |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         | Exceptional scenario: ProcessD ignores the resident D processes (heartbeat and update) on which the SDI driver on the BMS node depends. |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Process Z error          | Check whether the node has processes in Z state.                                                                                                                                                        |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ProcessZ                 |                                                                                                                                                                                                         |                                                                                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | ResolvConf error         | Check whether the ResolvConf file is lost.                                                                                                                                                              | Object: **/etc/resolv.conf**                                                                                                            |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ResolvConfFileProblem    | Check whether the ResolvConf file is normal.                                                                                                                                                            |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          | Exceptional definition: No upstream domain name resolution server (nameserver) is included.                                                                                                             |                                                                                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
      | Existing scheduled event | Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer. | Source:                                                                                                                                 |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      | ScheduledEvent           | Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs.                                                    | -  http://169.254.169.254/meta-data/latest/events/scheduled                                                                             |
      |                          |                                                                                                                                                                                                         |                                                                                                                                         |
      |                          |                                                                                                                                                                                                         | This check item is an Alpha feature and is disabled by default.                                                                         |
      +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

   The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using npd.

   .. table:: **Table 6** Default kubelet check items

      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                  | Function                                                               | Description                                                                                                                                                                                                                                                                                                              |
      +=============================+========================================================================+==========================================================================================================================================================================================================================================================================================================================+
      | Insufficient PID resources  | Check whether PIDs are sufficient.                                     | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                             |                                                                        | -  Threshold: 90%                                                                                                                                                                                                                                                                                                        |
      | PIDPressure                 |                                                                        | -  Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see `issue 107107 <https://github.com/kubernetes/kubernetes/issues/107107>`__. In community version 1.24 and earlier versions, thread-max is not considered in this check item. |
      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Insufficient memory         | Check whether the allocable memory for the containers is sufficient.   | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                             |                                                                        | -  Threshold: max. 100 MiB                                                                                                                                                                                                                                                                                               |
      | MemoryPressure              |                                                                        | -  Allocable = Total memory of a node - Reserved memory of a node                                                                                                                                                                                                                                                        |
      |                             |                                                                        | -  Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node.                                                                                                                                                                             |
      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Insufficient disk resources | Check the disk usage and inodes usage of the kubelet and Docker disks. | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                             |                                                                        | -  Threshold: 90%                                                                                                                                                                                                                                                                                                        |
      | DiskPressure                |                                                                        |                                                                                                                                                                                                                                                                                                                          |
      +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

.. _cce_10_0132__en-us_topic_0000001244261007_section1471610580474:

Node-problem-controller Fault Isolation
---------------------------------------

.. note::

   Fault isolation is supported only by add-ons of 1.16.0 and later versions.

   By default, if multiple nodes become faulty, NPC adds taints to up to 10% of the nodes. You can set **npc.maxTaintedNode** to increase the threshold.

The open source NPD plug-in provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open source NPD. This component is implemented based on the Kubernetes `node controller <https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions>`__. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.

.. _cce_10_0132__en-us_topic_0000001244261007_table205378534248:

.. table:: **Table 7** Parameters

   +-----------------------+--------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | Parameter             | Description                                                                                                        | Default                                                                                                                                 |
   +=======================+====================================================================================================================+=========================================================================================================================================+
   | npc.enable            | Whether to enable NPC                                                                                              | true                                                                                                                                    |
   |                       |                                                                                                                    |                                                                                                                                         |
   |                       | NPC cannot be disabled in 1.18.0 or later versions.                                                                |                                                                                                                                         |
   +-----------------------+--------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc. maxTaintedNode   | Check how many nodes can npc add taints to for mitigating the impact when a single fault occurs on multiple nodes. | 10%                                                                                                                                     |
   |                       |                                                                                                                    |                                                                                                                                         |
   |                       | The int format and percentage format are supported.                                                                | Value range:                                                                                                                            |
   |                       |                                                                                                                    |                                                                                                                                         |
   |                       |                                                                                                                    | -  The value is in int format and ranges from 1 to infinity.                                                                            |
   |                       |                                                                                                                    | -  The value ranges from 1% to 100%, in percentage. The minimum value of this parameter multiplied by the number of cluster nodes is 1. |
   +-----------------------+--------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc.affinity          | Node affinity of the controller                                                                                    | N/A                                                                                                                                     |
   +-----------------------+--------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

Collecting Prometheus Metrics
-----------------------------

The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation **metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'**. You can build a Prometheus collector to identify and obtain NPD metrics from **http://{{NpdPodIP}}:{{NpdPodPort}}/metrics**.

.. note::

   If the npd add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is **20257**.

Currently, the metric data includes **problem_counter** and **problem_gauge**, as shown below.

.. code-block::

   # HELP problem_counter Number of times a specific type of problem have occurred.
   # TYPE problem_counter counter
   problem_counter{reason="DockerHung"} 0
   problem_counter{reason="DockerStart"} 0
   problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0
   ...
   # HELP problem_gauge Whether a specific type of problem is affecting the node or not.
   # TYPE problem_gauge gauge
   problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0
   problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0
   problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0
   problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0
   ..