:original_name: cce_10_0659.html .. _cce_10_0659: Node Fault Detection Policy =========================== The node fault detection function depends on the :ref:`node-problem-detector (npd) ` add-on. The add-on instances run on nodes and monitor nodes. This section describes how to enable node fault detection. Prerequisites ------------- The :ref:`npd ` add-on has been installed in the cluster. Enabling Node Fault Detection ----------------------------- #. Log in to the CCE console and click the cluster name to access the cluster console. #. In the navigation pane on the left, choose **Nodes**. Check whether the npd add-on has been installed in the cluster or whether the add-on has been upgraded to the latest version. After the npd add-on has been installed, you can use the fault detection function. |image1| #. If the npd add-on is running properly, click **Node Fault Detection Policy** to view the current fault detection items. For details about the npd check item list, see :ref:`npd Check Items `. #. If the check result of the current node is abnormal, a message is displayed in the node list, indicating that the metric is abnormal. |image2| #. You can click **Abnormal metrics** and rectify the fault as prompted. |image3| Customized Check Items ---------------------- #. Log in to the CCE console and click the cluster name to access the cluster console. #. Choose Node Management on the left and click **Node Fault Detection Policy**. #. On the displayed page, view the current check items. Click **Edit** in the **Operation** column and edit checks. Currently, the following configurations are supported: - **Enable/Disable**: Enable or disable a check item. - **Target Node**: By default, check items run on all nodes. You can change the fault threshold based on special scenarios. For example, the spot price ECS interruption reclamation check runs only on the spot price ECS node. |image4| - **Trigger Threshold**: The default thresholds match common fault scenarios. You can customize and modify the fault thresholds as required. For example, change the threshold for triggering connection tracking table exhaustion from 90% to 80%. |image5| - **Check Period**: The default check period is 30 seconds. You can modify this parameter as required. |image6| - **Troubleshooting Strategy**: After a fault occurs, you can select the strategies listed in the following table. .. table:: **Table 1** Troubleshooting strategies +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Troubleshooting Strategy | Effect | +==========================+======================================================================================================================================================================================================+ | Prompting Exception | Reports the Kuberentes events. | +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Disabling scheduling | Reports the Kuberentes events and adds the **NoSchedule** taint to the node. | +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Evict Node Load | Reports the Kuberentes events and adds the **NoExecute** taint to the node. This operation will evict workloads on the node and interrupt services. Exercise caution when performing this operation. | +--------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. _cce_10_0659__en-us_topic_0000001519314622_section321984418184: npd Check Items --------------- .. note:: Check items are supported only in 1.16.0 and later versions. Check items cover events and statuses. - Event-related For event-related check items, when a problem occurs, npd reports an event to the API server. The event type can be **Normal** (normal event) or **Warning** (abnormal event). .. table:: **Table 2** Event-related check items +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+ | Check Item | Function | Description | +=======================+==============================================================================================================================================================================================================================================================+=======================================================================================================+ | OOMKilling | Listen to the kernel logs and check whether OOM events occur and are reported. | Warning event | | | | | | | Typical scenario: When the memory usage of a process in a container exceeds the limit, OOM is triggered and the process is terminated. | Listening object: **/dev/kmsg** | | | | | | | | Matching rule: "Killed process \\\\d+ (.+) total-vm:\\\\d+kB, anon-rss:\\\\d+kB, file-rss:\\\\d+kB.*" | +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+ | TaskHung | Listen to the kernel logs and check whether taskHung events occur and are reported. | Warning event | | | | | | | Typical scenario: Disk I/O suspension causes process suspension. | Listening object: **/dev/kmsg** | | | | | | | | Matching rule: "task \\\\S+:\\\\w+ blocked for more than \\\\w+ seconds\\\\." | +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+ | ReadonlyFilesystem | Check whether the **Remount root filesystem read-only** error occurs in the system kernel by listening to the kernel logs. | Warning event | | | | | | | Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk. | Listening object: **/dev/kmsg** | | | | | | | | Matching rule: **Remounting filesystem read-only** | +-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------+ - Status-related For status-related check items, when a problem occurs, npd reports an event to the API server and changes the node status synchronously. This function can be used together with :ref:`Node-problem-controller fault isolation ` to isolate nodes. **If the check period is not specified in the following check items, the default period is 30 seconds.** .. table:: **Table 3** Checking system components +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Check Item | Function | Description | +===================================+===========================================================================================================+=========================================================================================================================================+ | Container network component error | Check the status of the CNI components (container network components). | None | | | | | | CNIProblem | | | +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Container runtime component error | Check the status of Docker and containerd of the CRI components (container runtime components). | Check object: Docker or containerd | | | | | | CRIProblem | | | +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Frequent restarts of Kubelet | Periodically backtrack system logs to check whether the key component Kubelet restarts frequently. | - Default threshold: 10 restarts within 10 minutes | | | | | | FrequentKubeletRestart | | If Kubelet restarts for 10 times within 10 minutes, it indicates that the system restarts frequently and a fault alarm is generated. | | | | | | | | - Listening object: logs in the **/run/log/journal** directory | +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Frequent restarts of Docker | Periodically backtrack system logs to check whether the container runtime Docker restarts frequently. | | | | | | | FrequentDockerRestart | | | +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Frequent restarts of containerd | Periodically backtrack system logs to check whether the container runtime containerd restarts frequently. | | | | | | | FrequentContainerdRestart | | | +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | kubelet error | Check the status of the key component Kubelet. | None | | | | | | KubeletProblem | | | +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | kube-proxy error | Check the status of the key component kube-proxy. | None | | | | | | KubeProxyProblem | | | +-----------------------------------+-----------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ .. table:: **Table 4** Checking system metrics +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+ | Check Item | Function | Description | +================================+==============================================================================================================================+============================================================================================================+ | Conntrack table full | Check whether the conntrack table is full. | - Default threshold: 90% | | | | | | ConntrackFullProblem | | - Usage: **nf_conntrack_count** | | | | - Maximum value: **nf_conntrack_max** | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+ | Insufficient disk resources | Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. | - Default threshold: 90% | | | | | | DiskProblem | | - Source: | | | | | | | | .. code-block:: | | | | | | | | df -h | | | | | | | | Currently, additional data disks are not supported. | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+ | Insufficient file handles | Check whether FD file handles are used up. | - Default threshold: 90% | | | | - Usage: the first value in **/proc/sys/fs/file-nr** | | FDProblem | | - Maximum value: the third value in **/proc/sys/fs/file-nr** | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+ | Insufficient node memory | Check whether memory is used up. | - Default threshold: 80% | | | | - Usage: **MemTotal-MemAvailable** in **/proc/meminfo** | | MemoryProblem | | - Maximum value: **MemTotal** in **/proc/meminfo** | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+ | Insufficient process resources | Check whether PID process resources are exhausted. | - Default threshold: 90% | | | | - Usage: **nr_threads in /proc/loadavg** | | PIDProblem | | - Maximum value: smaller value between **/proc/sys/kernel/pid_max** and **/proc/sys/kernel/threads-max**. | +--------------------------------+------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------+ .. table:: **Table 5** Checking the storage +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Check Item | Function | Description | +================================+====================================================================================================================================================================================================================================================================================================================================================================================================+=======================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+ | Disk read-only | Periodically perform read and write tests on the system disk and CCE data disks (including the CRI logical disk and Kubelet logical disk) of the node to check the availability of key disks. | Detection paths: | | | | | | DiskReadonly | | - /mnt/paas/kubernetes/kubelet/ | | | | - /var/lib/docker/ | | | | - /var/lib/containerd/ | | | | - /var/paas/sys/log/cceaddon-npd/ | | | | | | | | The temporary file **npd-disk-write-ping** is generated in the detection path. | | | | | | | | Currently, additional data disks are not supported. | +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Insufficient disk resources | Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. | - Default threshold: 90% | | | | | | DiskProblem | | - Source: | | | | | | | | .. code-block:: | | | | | | | | df -h | | | | | | | | Currently, additional data disks are not supported. | +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | emptyDir storage pool error | Check whether the ephemeral volume group on the node is normal. | - Detection period: 30s | | | | | | EmptyDirVolumeGroupStatusError | Impact: The pod that depends on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error. | - Source: | | | | | | | Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. The user deletes some data disks by mistake. As a result, the storage pool becomes abnormal. | .. code-block:: | | | | | | | | vgs -o vg_name, vg_attr | | | | | | | | - Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost. | | | | | | | | - Joint scheduling: The scheduler can automatically identify a PV storage pool error and prevent pods that depend on the storage pool from being scheduled to the node. | | | | | | | | - Exceptional scenario: The npd add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in **nodestatus.allocatable** to **0**. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected by this check item, but by the ReadonlyFilesystem check item. | +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | PV storage pool error | Check the PV group on the node. | | | | | | | LocalPvVolumeGroupStatusError | Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error. | | | | | | | | Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake. | | +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Mount point error | Check the mount point on the node. | Alternatively, you can run the following command: | | | | | | MountPointProblem | Exceptional definition: You cannot access the mount point by running the **cd** command. | .. code-block:: | | | | | | | Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails. | for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok" | +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Suspended disk I/O | Check whether I/O suspension occurs on all disks on the node, that is, whether I/O read and write operations are not responded. | - Check object: all data disks | | | | | | DiskHung | Definition of I/O suspension: The system does not respond to disk I/O requests, and some processes are in the D state. | - Source: | | | | | | | Typical scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network. | /proc/diskstat | | | | | | | | Alternatively, you can run the following command: | | | | | | | | .. code-block:: | | | | | | | | iostat -xmt 1 | | | | | | | | - Threshold: | | | | | | | | - Average usage: ioutil >= 0.99 | | | | - Average I/O queue length: avgqu-sz >= 1 | | | | - Average I/O transfer volume: iops (w/s) + ioth (wMB/s) <= 1 | | | | | | | | .. note:: | | | | | | | | In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait should be greater than 0.8. | +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Slow disk I/O | Check whether all disks on the node have slow I/Os, that is, whether I/Os respond slowly. | - Check object: all data disks | | | | | | DiskSlow | Typical scenario: EVS disks have slow I/Os due to network fluctuation. | - Source: | | | | | | | | /proc/diskstat | | | | | | | | Alternatively, you can run the following command: | | | | | | | | .. code-block:: | | | | | | | | iostat -xmt 1 | | | | | | | | - Default threshold: | | | | | | | | Average I/O latency: await >= 5000 ms | | | | | | | | .. note:: | | | | | | | | If I/O requests are not responded and the **await** data is not updated, this check item is invalid. | +--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. table:: **Table 6** Other check items +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Check Item | Function | Description | +==========================+=========================================================================================================================================================================================================+=========================================================================================================================================+ | Abnormal NTP | Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused. | Default clock offset threshold: 8000 ms | | | | | | NTPProblem | | | +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Process D error | Check whether there is a process D on the node. | Default threshold: 10 abnormal processes detected for three consecutive times | | | | | | ProcessD | | Source: | | | | | | | | - /proc/{PID}/stat | | | | - Alternately, you can run the **ps aux** command. | | | | | | | | Exceptional scenario: ProcessD ignores the resident D processes (heartbeat and update) on which the SDI driver on the BMS node depends. | +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Process Z error | Check whether the node has processes in Z state. | | | | | | | ProcessZ | | | +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | ResolvConf error | Check whether the ResolvConf file is lost. | Object: **/etc/resolv.conf** | | | | | | ResolvConfFileProblem | Check whether the ResolvConf file is normal. | | | | | | | | Exceptional definition: No upstream domain name resolution server (nameserver) is included. | | +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ | Existing scheduled event | Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer. | Source: | | | | | | ScheduledEvent | Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs. | - http://169.254.169.254/meta-data/latest/events/scheduled | | | | | | | | This check item is an Alpha feature and is disabled by default. | +--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+ The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using npd. .. table:: **Table 7** Default kubelet check items +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Check Item | Function | Description | +=============================+========================================================================+==========================================================================================================================================================================================================================================================================================================================+ | Insufficient PID resources | Check whether PIDs are sufficient. | - Interval: 10 seconds | | | | - Threshold: 90% | | PIDPressure | | - Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see `issue 107107 `__. In community version 1.24 and earlier versions, thread-max is not considered in this check item. | +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Insufficient memory | Check whether the allocable memory for the containers is sufficient. | - Interval: 10 seconds | | | | - Threshold: max. 100 MiB | | MemoryPressure | | - Allocable = Total memory of a node - Reserved memory of a node | | | | - Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node. | +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Insufficient disk resources | Check the disk usage and inodes usage of the kubelet and Docker disks. | - Interval: 10 seconds | | | | - Threshold: 90% | | DiskPressure | | | +-----------------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ .. |image1| image:: /_static/images/en-us_image_0000001519067438.png .. |image2| image:: /_static/images/en-us_image_0000001520080400.png .. |image3| image:: /_static/images/en-us_image_0000001571360421.png .. |image4| image:: /_static/images/en-us_image_0000001570344789.png .. |image5| image:: /_static/images/en-us_image_0000001519063542.png .. |image6| image:: /_static/images/en-us_image_0000001519544422.png