Reviewed-by: Eotvos, Oliver <oliver.eotvos@t-systems.com> Co-authored-by: proposalbot <proposalbot@otc-service.com> Co-committed-by: proposalbot <proposalbot@otc-service.com>
137 KiB
- original_name
cce_10_0132.html
npd
Introduction
node-problem-detector (npd for short) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. The npd add-on can run as a DaemonSet or a daemon.
For more information, see node-problem-detector.
Notes and Constraints
- When using this add-on, do not format or partition node disks.
- Each npd process occupies 30 mCPU and 100 MB memory.
Permission Description
To monitor kernel logs, the npd add-on needs to read the host /dev/kmsg. Therefore, the privileged mode must be enabled. For details, see privileged.
In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for npd running:
- cap_dac_read_search: permission to access /run/log/journal.
- cap_sys_admin: permission to access /dev/kmsg.
Installing the Add-on
Log in to the CCE console and access the cluster console. Choose Add-ons in the navigation pane, locate npd on the right, and click Install.
On the Install Add-on page, select the add-on specifications and set related parameters.
- Pods: Set the number of pods based on service requirements.
- Containers: Select a proper container quota based on service requirements.
Set the npd parameters and click Install.
The parameters are configurable only in 1.16.0 and later versions. For details, see
Table 7 <cce_10_0132__en-us_topic_0000001244261007_table205378534248>
.
npd Check Items
Note
Check items are supported only in 1.16.0 and later versions.
Check items cover events and statuses.
Event-related
For event-related check items, when a problem occurs, npd reports an event to the API server. The event type can be Normal (normal event) or Warning (abnormal event).
Table 1 Event-related check items Check Item Function Description OOMKilling Listen to the kernel logs and check whether OOM events occur and are reported.
Typical scenario: When the memory usage of a process in a container exceeds the limit, OOM is triggered and the process is terminated.
Warning event
Listening object: /dev/kmsg
Matching rule: "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
TaskHung Listen to the kernel logs and check whether taskHung events occur and are reported.
Typical scenario: Disk I/O suspension causes process suspension.
Warning event
Listening object: /dev/kmsg
Matching rule: "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
ReadonlyFilesystem Check whether the Remount root filesystem read-only error occurs in the system kernel by listening to the kernel logs.
Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is remounted as a read-only disk.
Warning event
Listening object: /dev/kmsg
Matching rule: Remounting filesystem read-only
Status-related
For status-related check items, when a problem occurs, npd reports an event to the API server and changes the node status synchronously. This function can be used together with
Node-problem-controller fault isolation <cce_10_0132__en-us_topic_0000001244261007_section1471610580474>
to isolate nodes.If the check period is not specified in the following check items, the default period is 30 seconds.
Table 2 Checking system components Check Item Function Description Container network component error
CNIProblem
Check the status of the CNI components (container network components). None Container runtime component error
CRIProblem
Check the status of Docker and containerd of the CRI components (container runtime components). Check object: Docker or containerd Frequent restarts of Kubelet
FrequentKubeletRestart
Periodically backtrack system logs to check whether the key component Kubelet restarts frequently. Default threshold: 10 restarts within 10 minutes
If Kubelet restarts for 10 times within 10 minutes, it indicates that the system restarts frequently and a fault alarm is generated.
Listening object: logs in the /run/log/journal directory
Frequent restarts of Docker
FrequentDockerRestart
Periodically backtrack system logs to check whether the container runtime Docker restarts frequently. Frequent restarts of containerd
FrequentContainerdRestart
Periodically backtrack system logs to check whether the container runtime containerd restarts frequently. kubelet error
KubeletProblem
Check the status of the key component Kubelet. None kube-proxy error
KubeProxyProblem
Check the status of the key component kube-proxy. None Table 3 Checking system metrics Check Item Function Description Conntrack table full
ConntrackFullProblem
Check whether the conntrack table is full. - Default threshold: 90%
- Usage: nf_conntrack_count
- Maximum value: nf_conntrack_max
Insufficient disk resources
DiskProblem
Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. Default threshold: 90%
Source:
df -h
Insufficient file handles
FDProblem
Check whether FD file handles are used up. - Default threshold: 90%
- Usage: the first value in /proc/sys/fs/file-nr
- Maximum value: the third value in /proc/sys/fs/file-nr
Insufficient node memory
MemoryProblem
Check whether memory is used up. - Default threshold: 80%
- Usage: MemTotal-MemAvailable in /proc/meminfo
- Maximum value: MemTotal in /proc/meminfo
Insufficient process resources
PIDProblem
Check whether PID process resources are exhausted. - Default threshold: 90%
- Usage: nr_threads in /proc/loadavg
- Maximum value: smaller value between /proc/sys/kernel/pid_max and /proc/sys/kernel/threads-max.
Table 4 Checking the storage Check Item Function Description Disk read-only
DiskReadonly
Periodically perform read and write tests on the system disk and CCE data disks (including the CRI logical disk and Kubelet logical disk) of the node to check the availability of key disks. Detection paths:
- /mnt/paas/kubernetes/kubelet/
- /var/lib/docker/
- /var/lib/containerd/
- /var/paas/sys/log/cceaddon-npd/
The temporary file npd-disk-write-ping is generated in the detection path.
Currently, additional data disks are not supported.
Insufficient disk resources
DiskProblem
Check the usage of the system disk and CCE data disks (including the CRI logical disk and kubelet logical disk) on the node. Default threshold: 90%
Source:
df -h
emptyDir storage pool error
EmptyDirVolumeGroupStatusError
Check whether the ephemeral volume group on the node is normal.
Impact: The pod that depends on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error.
Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. The user deletes some data disks by mistake. As a result, the storage pool becomes abnormal.
Detection period: 30s
Source:
vgs -o vg_name, vg_attr
Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost.
Joint scheduling: The scheduler can automatically identify a PV storage pool error and prevent pods that depend on the storage pool from being scheduled to the node.
Exceptional scenario: The npd add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in nodestatus.allocatable to 0. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected by this check item, but by the ReadonlyFilesystem check item.
PV storage pool error
LocalPvVolumeGroupStatusError
Check the PV group on the node.
Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error.
Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake.
Mount point error
MountPointProblem
Check the mount point on the node.
Exceptional definition: You cannot access the mount point by running the cd command.
Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails.
Alternatively, you can run the following command:
for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok"
Suspended disk I/O
DiskHung
Check whether I/O suspension occurs on all disks on the node, that is, whether I/O read and write operations are not responded.
Definition of I/O suspension: The system does not respond to disk I/O requests, and some processes are in the D state.
Typical scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network.
Check object: all data disks
Source:
/proc/diskstat
Alternatively, you can run the following command:
iostat -xmt 1
Threshold:
- Average usage: ioutil >= 0.99
- Average I/O queue length: avgqu-sz >= 1
- Average I/O transfer volume: iops (w/s) + ioth (wMB/s) <= 1
Note
In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait should be greater than 0.8.
Slow disk I/O
DiskSlow
Check whether all disks on the node have slow I/Os, that is, whether I/Os respond slowly.
Typical scenario: EVS disks have slow I/Os due to network fluctuation.
Check object: all data disks
Source:
/proc/diskstat
Alternatively, you can run the following command:
iostat -xmt 1
Default threshold:
Average I/O latency: await >= 5000 ms
Note
If I/O requests are not responded and the await data is not updated, this check item is invalid.
Table 5 Other check items Check Item Function Description Abnormal NTP
NTPProblem
Check whether the node clock synchronization service ntpd or chronyd is running properly and whether a system time drift is caused. Default clock offset threshold: 8000 ms Process D error
ProcessD
Check whether there is a process D on the node. Default threshold: 10 abnormal processes detected for three consecutive times
Source:
- /proc/{PID}/stat
- Alternately, you can run the ps aux command.
Exceptional scenario: ProcessD ignores the resident D processes (heartbeat and update) on which the SDI driver on the BMS node depends.
Process Z error
ProcessZ
Check whether the node has processes in Z state. ResolvConf error
ResolvConfFileProblem
Check whether the ResolvConf file is lost.
Check whether the ResolvConf file is normal.
Exceptional definition: No upstream domain name resolution server (nameserver) is included.
Object: /etc/resolv.conf Existing scheduled event
ScheduledEvent
Check whether scheduled live migration events exist on the node. A live migration plan event is usually triggered by a hardware fault and is an automatic fault rectification method at the IaaS layer.
Typical scenario: The host is faulty. For example, the fan is damaged or the disk has bad sectors. As a result, live migration is triggered for VMs.
Source:
This check item is an Alpha feature and is disabled by default.
The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using npd.
Table 6 Default kubelet check items Check Item Function Description Insufficient PID resources
PIDPressure
Check whether PIDs are sufficient. - Interval: 10 seconds
- Threshold: 90%
- Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see issue 107107. In community version 1.24 and earlier versions, thread-max is not considered in this check item.
Insufficient memory
MemoryPressure
Check whether the allocable memory for the containers is sufficient. - Interval: 10 seconds
- Threshold: max. 100 MiB
- Allocable = Total memory of a node - Reserved memory of a node
- Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node.
Insufficient disk resources
DiskPressure
Check the disk usage and inodes usage of the kubelet and Docker disks. - Interval: 10 seconds
- Threshold: 90%
Node-problem-controller Fault Isolation
Note
Fault isolation is supported only by add-ons of 1.16.0 and later versions.
By default, if multiple nodes become faulty, NPC adds taints to up to 10% of the nodes. You can set npc.maxTaintedNode to increase the threshold.
The open source NPD plug-in provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open source NPD. This component is implemented based on the Kubernetes node controller. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.
Parameter | Description | Default |
---|---|---|
npc.enable | Whether to enable NPC NPC cannot be disabled in 1.18.0 or later versions. |
true |
npc. maxTaintedNode | Check how many nodes can npc add taints to for mitigating the impact when a single fault occurs on multiple nodes. The int format and percentage format are supported. |
10% Value range:
|
npc.affinity | Node affinity of the controller | N/A |
Collecting Prometheus Metrics
The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'. You can build a Prometheus collector to identify and obtain NPD metrics from http://{{NpdPodIP}}:{{NpdPodPort}}/metrics.
Note
If the npd add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is 20257.
Currently, the metric data includes problem_counter and problem_gauge, as shown below.
# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="DockerHung"} 0
problem_counter{reason="DockerStart"} 0
problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0
...
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0
problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0
problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0
problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0
..