The node fault detection function depends on the NPD add-on. The add-on instances run on nodes and monitor nodes. This section describes how to enable node fault detection.
The CCE Node Problem Detector add-on has been installed in the cluster.
Troubleshooting Strategy |
Effect |
---|---|
Prompting Exception |
Kubernetes events are reported. |
Disabling scheduling |
Kubernetes events are reported and the NoSchedule taint is added to the node. |
Evict Node Load |
Kubernetes events are reported and the NoExecute taint is added to the node. This operation will evict workloads on the node and interrupt services. Exercise caution when performing this operation. |
Check items are supported only in 1.16.0 and later versions.
Check items cover events and statuses.
For event-related check items, when a problem occurs, NPD reports an event to the API server. The event type can be Normal (normal event) or Warning (abnormal event).
Check Item |
Function |
Description |
---|---|---|
OOMKilling |
Listen to the kernel logs and check whether OOM events occur and are reported. Typical scenario: When the memory usage of a process in a container exceeds the limit, OOM is triggered and the process is terminated. |
Warning event Listening object: /dev/kmsg Matching rule: "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*" |
TaskHung |
Listen to the kernel logs and check whether taskHung events occur and are reported. Typical scenario: Disk I/O suspension causes process suspension. |
Warning event Listening object: /dev/kmsg Matching rule: "task \\S+:\\w+ blocked for more than \\w+ seconds\\." |
ReadonlyFilesystem |
Check whether the Remount root filesystem read-only error occurs in the system kernel by listening to the kernel logs. Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk. NOTE:
If the rootfs of node pods is of the device mapper type, an error will occur in the thin pool if a data disk is detached. This will affect NPD and NPD will not be able to detect node faults. |
Warning event Listening object: /dev/kmsg Matching rule: Remounting filesystem read-only |
For status-related check items, when a problem occurs, NPD reports an event to the API server and changes the node status synchronously. This function can be used together with Node-problem-controller fault isolation to isolate nodes.
If the check period is not specified in the following check items, the default period is 30 seconds.
Check Item |
Function |
Description |
---|---|---|
Container network component error CNIProblem |
Check the status of the CNI components (container network components). |
None |
Container runtime component error CRIProblem |
Check the status of Docker and containerd of the CRI components (container runtime components). |
Check object: Docker or containerd |
Frequent restarts of Kubelet FrequentKubeletRestart |
Periodically backtrack system logs to check whether the key component Kubelet restarts frequently. |
|
Frequent restarts of Docker FrequentDockerRestart |
Periodically backtrack system logs to check whether the container runtime Docker restarts frequently. |
|
Frequent restarts of containerd FrequentContainerdRestart |
Periodically backtrack system logs to check whether the container runtime containerd restarts frequently. |
|
kubelet error KubeletProblem |
Check the status of the key component Kubelet. |
None |
kube-proxy error KubeProxyProblem |
Check the status of the key component kube-proxy. |
None |
Check Item |
Function |
Description |
---|---|---|
Conntrack table full ConntrackFullProblem |
Check whether the conntrack table is full. |
|
Insufficient disk resources DiskProblem |
Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. |
Currently, additional data disks are not supported. |
Insufficient file handles FDProblem |
Check if the FD file handles are used up. |
|
Insufficient node memory MemoryProblem |
Check whether memory is used up. |
|
Insufficient process resources PIDProblem |
Check whether PID process resources are exhausted. |
|
Check Item |
Function |
Description |
---|---|---|
Disk read-only DiskReadonly |
Periodically perform write tests on the system disk and CCE data disks (including the CRI logical disk and Kubelet logical disk) of the node to check the availability of key disks. |
Detection paths:
The temporary file npd-disk-write-ping is generated in the detection path. Currently, additional data disks are not supported. |
emptyDir storage pool error EmptyDirVolumeGroupStatusError |
Check whether the ephemeral volume group on the node is normal. Impact: Pods that depend on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error. Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. Some data disks are deleted by mistake. As a result, the storage pool becomes abnormal. |
|
PV storage pool error LocalPvVolumeGroupStatusError |
Check the PV group on the node. Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error. Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake. |
|
Mount point error MountPointProblem |
Check the mount point on the node. Exceptional definition: You cannot access the mount point by running the cd command. Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails. |
Alternatively, you can run the following command: for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok" |
Suspended disk I/O DiskHung |
Check whether I/O suspension occurs on all disks on the node, that is, whether I/O read and write operations are not responded. Definition of I/O suspension: The system does not respond to disk I/O requests, and some processes are in the D state. Typical scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network. |
|
Slow disk I/O DiskSlow |
Check whether all disks on the node have slow I/Os, that is, whether I/Os respond slowly. Typical scenario: EVS disks have slow I/Os due to network fluctuation. |
NOTE:
If I/O requests are not responded and the await data is not updated, this check item is invalid. |
Check Item |
Function |
Description |
---|---|---|
Abnormal NTP NTPProblem |
Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused. |
Default clock offset threshold: 8000 ms |
Process D error ProcessD |
Check whether there is a process D on the node. |
Default threshold: 10 abnormal processes detected for three consecutive times Source:
Exceptional scenario: The ProcessD check item ignores the resident D processes (heartbeat and update) on which the SDI driver on the BMS node depends. |
Process Z error ProcessZ |
Check whether the node has processes in Z state. |
|
ResolvConf error ResolvConfFileProblem |
Check whether the ResolvConf file is lost. Check whether the ResolvConf file is normal. Exceptional definition: No upstream domain name resolution server (nameserver) is included. |
Object: /etc/resolv.conf |
Existing scheduled event ScheduledEvent |
Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer. Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs. |
Source:
This check item is an Alpha feature and is disabled by default. |
The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using NPD.
Check Item |
Function |
Description |
---|---|---|
Insufficient PID resources PIDPressure |
Check whether PIDs are sufficient. |
|
Insufficient memory MemoryPressure |
Check whether the allocable memory for the containers is sufficient. |
|
Insufficient disk resources DiskPressure |
Check the disk usage and inodes usage of the kubelet and Docker disks. |
|