:original_name: cce_10_0132.html

.. _cce_10_0132:

npd
===

Introduction
------------

node-problem-detector (npd for short) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. The npd add-on can run as a DaemonSet or a daemon.

For more information, see `node-problem-detector <https://github.com/kubernetes/node-problem-detector>`__.

Notes and Constraints
---------------------

-  When using this add-on, do not format or partition node disks.
-  Each npd process occupies 30 mCPU and 100 MB memory.

Permission Description
----------------------

To monitor kernel logs, the npd add-on needs to read the host **/dev/kmsg**. Therefore, the privileged mode must be enabled. For details, see `privileged <https://kubernetes.io/docs/concepts/policy/pod-security-policy/#privileged>`__.

In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for npd running:

-  cap_dac_read_search: permission to access **/run/log/journal**.
-  cap_sys_admin: permission to access **/dev/kmsg**.

Installing the Add-on
---------------------

#. Log in to the CCE console, click the cluster name, and access the cluster console. Choose **Add-ons** in the navigation pane, locate **npd** on the right, and click **Install**.

#. On the **Install Add-on** page, select the add-on specifications and set related parameters.

   -  **Pods**: Set the number of pods based on service requirements.
   -  **Containers**: Select a proper container quota based on service requirements.

#. Set the parameters according to the following table and click **Install**.

   Only 1.16.0 and later versions support the configurations.

   **npc.enable**: indicates whether to enable :ref:`Node-problem-controller <cce_10_0132__section1471610580474>`.

npd Check Items
---------------

.. note::

   Check items are supported only in 1.16.0 and later versions.

Check items cover events and statuses.

-  Event-related

   For event-related check items, when a problem occurs, npd reports an event to the API server. The event type can be **Normal** (normal event) or **Warning** (abnormal event).

   .. table:: **Table 1** Event-related check items

      +-----------------------+-------------------------------------------------------+-----------------------+
      | Check Item            | Function                                              | Description           |
      +=======================+=======================================================+=======================+
      | OOMKilling            | Check whether OOM events occur and are reported.      | Warning event         |
      +-----------------------+-------------------------------------------------------+-----------------------+
      | TaskHung              | Check whether taskHung events occur and are reported. | Warning event         |
      +-----------------------+-------------------------------------------------------+-----------------------+
      | KernelOops            | Check kernel nil pointer panic errors.                | Warning event         |
      +-----------------------+-------------------------------------------------------+-----------------------+
      | ConntrackFull         | Check whether the conntrack table is full.            | Warning event         |
      |                       |                                                       |                       |
      |                       |                                                       | Interval: 30 seconds  |
      |                       |                                                       |                       |
      |                       |                                                       | Threshold: 80%        |
      +-----------------------+-------------------------------------------------------+-----------------------+

-  Status-related

   For status-related check items, when a problem occurs, npd reports an event to the API server and changes the node status synchronously. This function can be used together with :ref:`Node-problem-controller fault isolation <cce_10_0132__section1471610580474>` to isolate nodes.

   **If the check period is not specified in the following check items, the default period is 30 seconds.**

   .. table:: **Table 2** Application and OS check items

      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                | Function                                                                                                                                                      | Description                                                                                                                                                |
      +===========================+===============================================================================================================================================================+============================================================================================================================================================+
      | FrequentKubeletRestart    | Check whether kubelet restarts frequently by listening to journald logs.                                                                                      | -  Interval: 5 minutes                                                                                                                                     |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               | -  Backtracking: 10 minutes                                                                                                                                |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               | -  Threshold: 10 times                                                                                                                                     |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               |    If the system restarts for 10 times within the backtracking period, it indicates that the system restarts frequently and a fault alarm is generated.    |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               | -  Listening object: logs in the **/run/log/journal** directory                                                                                            |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               | .. note::                                                                                                                                                  |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               |    The Ubuntu OS does not support the preceding check items due to incompatible log formats.                                                               |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | FrequentDockerRestart     | Check whether Docker restarts frequently by listening to journald logs.                                                                                       |                                                                                                                                                            |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | FrequentContainerdRestart | Check whether containerd restarts frequently by listening to journald logs.                                                                                   |                                                                                                                                                            |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | CRIProblem                | Check the CRI component status.                                                                                                                               | Check object: Docker or containerd                                                                                                                         |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | KUBELETProblem            | Check the kubelet status.                                                                                                                                     | None                                                                                                                                                       |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | NTPProblem                | Check the NTP and Chrony service status.                                                                                                                      | Threshold of the clock offset: 8000 ms                                                                                                                     |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           | Check whether the node clock offsets.                                                                                                                         |                                                                                                                                                            |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | PIDProblem                | Check whether PIDs are sufficient.                                                                                                                            | -  Threshold: 90%                                                                                                                                          |
      |                           |                                                                                                                                                               | -  Usage: nr_threads in /proc/loadavg                                                                                                                      |
      |                           |                                                                                                                                                               | -  Maximum value: smaller value between **/proc/sys/kernel/pid_max** and **/proc/sys/kernel/threads-max**.                                                 |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | FDProblem                 | Check whether file handles are sufficient.                                                                                                                    | -  Threshold: 90%                                                                                                                                          |
      |                           |                                                                                                                                                               | -  Usage: the first value in **/proc/sys/fs/file-nr**                                                                                                      |
      |                           |                                                                                                                                                               | -  Maximum value: the third value in **/proc/sys/fs/file-nr**                                                                                              |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | MemoryProblem             | Check whether the overall node memory is sufficient.                                                                                                          | -  Threshold: 90%                                                                                                                                          |
      |                           |                                                                                                                                                               | -  Usage: **MemTotal-MemAvailable** in **/proc/meminfo**                                                                                                   |
      |                           |                                                                                                                                                               | -  Maximum value: **MemTotal** in **/proc/meminfo**                                                                                                        |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | ResolvConfFileProblem     | Check whether the ResolvConf file is lost.                                                                                                                    | Object: **/etc/resolv.conf**                                                                                                                               |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           | Check whether the ResolvConf file is normal.                                                                                                                  |                                                                                                                                                            |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           | Exception definition: No upstream domain name resolution server (nameserver) is included.                                                                     |                                                                                                                                                            |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | ProcessD                  | Check whether there is a process D on the node.                                                                                                               | Source:                                                                                                                                                    |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               | -  /proc/{PID}/stat                                                                                                                                        |
      |                           |                                                                                                                                                               | -  Alternately, you can run **ps aux**.                                                                                                                    |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               | Exception scenario: ProcessD ignores the resident processes (heartbeat and update) that are in the D state that the SDI driver on the BMS node depends on. |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | ProcessZ                  | Check whether the node has processes in Z state.                                                                                                              |                                                                                                                                                            |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | ScheduledEvent            | Check whether host plan events exist on the node.                                                                                                             | Source:                                                                                                                                                    |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           | Typical scenario: The host is faulty, for example, the fan is damaged or the disk has bad sectors. As a result, cold and live migration is triggered for VMs. | -  http://169.254.169.254/meta-data/latest/events/scheduled                                                                                                |
      |                           |                                                                                                                                                               |                                                                                                                                                            |
      |                           |                                                                                                                                                               | This check item is an Alpha feature and is disabled by default.                                                                                            |
      +---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+

   .. table:: **Table 3** Network connection check items

      +------------------+------------------------------------------------------+-------------+
      | Check Item       | Function                                             | Description |
      +==================+======================================================+=============+
      | CNIProblem       | Check whether the CNI component is running properly. | None        |
      +------------------+------------------------------------------------------+-------------+
      | KUBEPROXYProblem | Check whether kube-proxy is running properly.        | None        |
      +------------------+------------------------------------------------------+-------------+

   .. table:: **Table 4** Storage check items

      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item                     | Function                                                                                                                                                                                                                                                                                                                                                                                           | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
      +================================+====================================================================================================================================================================================================================================================================================================================================================================================================+====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+
      | ReadonlyFilesystem             | Check whether the **Remount root filesystem read-only** error occurs in the system kernel by listening to the kernel logs.                                                                                                                                                                                                                                                                         | Listening object: **/dev/kmsg**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is reattached as a read-only disk.                                                                                                                                      | Matching rule: **Remounting filesystem read-only**                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | DiskReadonly                   | Check whether the system disk, Docker disk, and kubelet disk are read-only.                                                                                                                                                                                                                                                                                                                        | Detection paths:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /mnt/paas/kubernetes/kubelet/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/lib/docker/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/lib/containerd/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  /var/paas/sys/log/cceaddon-npd/                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | The temporary file **npd-disk-write-ping** is generated in the detection path.                                                                                                                                                                                                                                                                                                                                                                                                                                     |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | Currently, additional data disks are not supported.                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | DiskProblem                    | Check the usage of the system disk, Docker disk, and kubelet disk.                                                                                                                                                                                                                                                                                                                                 | -  Threshold: 80%                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       df -h                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | Currently, additional data disks are not supported.                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | EmptyDirVolumeGroupStatusError | Check whether the ephemeral volume group on the node is normal.                                                                                                                                                                                                                                                                                                                                    | -  Detection period: 60s                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Impact: The pod that depends on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error.                                                                                                                                                                                                        | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. The user deletes some data disks by mistake. As a result, the storage pool becomes abnormal.                                                                                                                                                                                          |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       vgs -o vg_name, vg_attr                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost.                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Joint scheduling: The scheduler can automatically identify an abnormal node and prevent pods that depend on the storage pool from being scheduled to the node.                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Exception scenario: The npd add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in **nodestatus.allocatable** to **0**. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected. In this case, the ReadonlyFilesystem detection is abnormal. |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | LocalPvVolumeGroupStatusError  | Check the PV group on the node.                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error.                                                                                                                                                                                                          |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake.                                                                                                                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | MountPointProblem              | Check the mount point on the node.                                                                                                                                                                                                                                                                                                                                                                 | Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Exception definition: You cannot access the mount point by running the **cd** command.                                                                                                                                                                                                                                                                                                             | .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails. |    for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok"                                                                                                                                                                                                                                                                                                                                                                                                                   |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | DiskHung                       | Check whether I/O faults occur on the disk of the node.                                                                                                                                                                                                                                                                                                                                            | -  Check object: all data disks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Definition of I/O faults: The system does not respond to disk I/O requests, and some processes are in the D state.                                                                                                                                                                                                                                                                                 | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Typical Scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network.                                                                                                                                                                                                                                                                            |    /proc/diskstat                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       iostat -xmt 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Threshold:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average usage. The value of ioutil is greater than or equal to 0.99.                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average I/O queue length. avgqu-sz >=1                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    -  Average I/O transfer volume, iops (w/s) + ioth (wMB/s) < = 1                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. note::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait is greater than 0.8.                                                                                                                                                                                                                                                                                                                                                                            |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | DiskSlow                       | Check whether slow I/O occurs on the disk of the node.                                                                                                                                                                                                                                                                                                                                             | -  Check object: all data disks                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Definition of slow I/O: The average response time exceeds the threshold.                                                                                                                                                                                                                                                                                                                           | -  Source:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                | Typical scenario: EVS disks have slow I/Os due to network fluctuation.                                                                                                                                                                                                                                                                                                                             |    /proc/diskstat                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Alternatively, you can run the following command:                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    .. code-block::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |       iostat -xmt 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | -  Threshold:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    Average I/O latency: await > = 5000 ms                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    | .. note::                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
      |                                |                                                                                                                                                                                                                                                                                                                                                                                                    |    If I/O requests are not responded and the **await** data is not updated. In this case, this check item is invalid.                                                                                                                                                                                                                                                                                                                                                                                              |
      +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

   The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using npd.

   .. table:: **Table 5** Default kubelet check items

      +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | Check Item            | Function                                                               | Description                                                                                                                                                                                                                                                                                                              |
      +=======================+========================================================================+==========================================================================================================================================================================================================================================================================================================================+
      | PIDPressure           | Check whether PIDs are sufficient.                                     | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                       |                                                                        | -  Threshold: 90%                                                                                                                                                                                                                                                                                                        |
      |                       |                                                                        | -  Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see `issue 107107 <https://github.com/kubernetes/kubernetes/issues/107107>`__. In community version 1.24 and earlier versions, thread-max is not considered in this check item. |
      +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | MemoryPressure        | Check whether the allocable memory for the containers is sufficient.   | -  Interval: 10 seconds                                                                                                                                                                                                                                                                                                  |
      |                       |                                                                        | -  Threshold: Max. 100 MiB                                                                                                                                                                                                                                                                                               |
      |                       |                                                                        | -  Allocable = Total memory of a node - Reserved memory of a node                                                                                                                                                                                                                                                        |
      |                       |                                                                        | -  Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node.                                                                                                                                                                             |
      +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      | DiskPressure          | Check the disk usage and inodes usage of the kubelet and Docker disks. | Interval: 10 seconds                                                                                                                                                                                                                                                                                                     |
      |                       |                                                                        |                                                                                                                                                                                                                                                                                                                          |
      |                       |                                                                        | Threshold: 90%                                                                                                                                                                                                                                                                                                           |
      +-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

.. _cce_10_0132__section1471610580474:

Node-problem-controller Fault Isolation
---------------------------------------

.. note::

   Fault isolation is supported only by add-ons of 1.16.0 and later versions.

   When installing the npd add-on, set **npc.enable** to **true** to deploy dual Node-problem-controller (NPC). You can deploy NPC as single-instance but such NPC does not ensure high availability.

   By default, if multiple nodes become faulty, NPC adds taints to only one node. You can set **npc.maxTaintedNode** to increase the threshold. When the fault is rectified, NPC is not running and taints remain. You need to manually clear the taints or start NPC.

The open source NPD plug-in provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open source NPD. This component is implemented based on the Kubernetes `node controller <https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/#taint-based-evictions>`__. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.

You can modify **add-onnpc.customConditionToTaint** according to the following table to configure fault isolation rules.

.. table:: **Table 6** Parameters

   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | Parameter                      | Description                                             | Default                                                                                                                                 |
   +================================+=========================================================+=========================================================================================================================================+
   | npc.enable                     | Whether to enable NPC                                   | true                                                                                                                                    |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc.customCondtionToTaint      | Fault isolation rules                                   | See :ref:`Table 7 <cce_10_0132__table147438134911>`.                                                                                    |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc.customConditionToTaint[i]  | Fault isolation rule items                              | N/A                                                                                                                                     |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc.customConditionToTaint[i]. | Fault status                                            | true                                                                                                                                    |
   |                                |                                                         |                                                                                                                                         |
   | condition.status               |                                                         |                                                                                                                                         |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc.customConditionToTaint[i]. | Fault type                                              | N/A                                                                                                                                     |
   |                                |                                                         |                                                                                                                                         |
   | condition.type                 |                                                         |                                                                                                                                         |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc.customConditionToTaint[i]. | Whether to enable the fault isolation rule.             | false                                                                                                                                   |
   |                                |                                                         |                                                                                                                                         |
   | enable                         |                                                         |                                                                                                                                         |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc.customConditionToTaint[i]. | Fault isolation effect                                  | NoSchedule                                                                                                                              |
   |                                |                                                         |                                                                                                                                         |
   | .taint.effect                  | NoSchedule, PreferNoSchedule, or NoExecute              | Value options: **NoSchedule**, **PreferNoSchedule**, and **NoExecute**                                                                  |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | npc. maxTaintedNode            | Number of nodes in a cluster that can be tainted by NPC | 1                                                                                                                                       |
   |                                |                                                         |                                                                                                                                         |
   |                                | The int format and percentage format are supported.     | Values:                                                                                                                                 |
   |                                |                                                         |                                                                                                                                         |
   |                                |                                                         | -  The value is in int format and ranges from 1 to infinity.                                                                            |
   |                                |                                                         | -  The value ranges from 1% to 100%, in percentage. The minimum value of this parameter multiplied by the number of cluster nodes is 1. |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
   | Npc.affinity                   | Node affinity of the controller                         | N/A                                                                                                                                     |
   +--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+

.. _cce_10_0132__table147438134911:

.. table:: **Table 7** Fault isolation rule configuration

   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | Fault                     | Fault Details                                                       | Taint                                |
   +===========================+=====================================================================+======================================+
   | DiskReadonly              | Disk read-only                                                      | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | DiskProblem               | The disk space is insufficient, and key logical disks are detached. | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | FrequentKubeletRestart    | kubelet restarts frequently.                                        | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | FrequentDockerRestart     | Docker restarts frequently.                                         | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | FrequentContainerdRestart | containerd restarts frequently.                                     | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | KUBEPROXYProblem          | kube-proxy is abnormal.                                             | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | PIDProblem                | Insufficient PIDs                                                   | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | FDProblem                 | Insufficient file handles                                           | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+
   | MemoryProblem             | Insufficient node memory                                            | **NoSchedule**: No new pods allowed. |
   +---------------------------+---------------------------------------------------------------------+--------------------------------------+

Collecting Prometheus Metrics
-----------------------------

The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation **metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'**. You can build a Prometheus collector to identify and obtain NPD metrics from **http://{{NpdPodIP}}:{{NpdPodPort}}/metrics**.

.. note::

   If the npd add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is **20257**.

Currently, the metric data includes **problem_counter** and **problem_gauge**, as shown below.

.. code-block::

   # HELP problem_counter Number of times a specific type of problem have occurred.
   # TYPE problem_counter counter
   problem_counter{reason="DockerHung"} 0
   problem_counter{reason="DockerStart"} 0
   problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0
   ...
   # HELP problem_gauge Whether a specific type of problem is affecting the node or not.
   # TYPE problem_gauge gauge
   problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0
   problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0
   problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0
   problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0
   ..