:original_name: cce_10_0132.html
.. _cce_10_0132:
npd
===
Introduction
------------
node-problem-detector (npd for short) is an add-on that monitors abnormal events of cluster nodes and connects to a third-party monitoring platform. It is a daemon running on each node. It collects node issues from different daemons and reports them to the API server. The npd add-on can run as a DaemonSet or a daemon.
For more information, see `node-problem-detector `__.
Notes and Constraints
---------------------
- When using this add-on, do not format or partition node disks.
- Each npd process occupies 30 mCPU and 100 MB memory.
Permission Description
----------------------
To monitor kernel logs, the npd add-on needs to read the host **/dev/kmsg**. Therefore, the privileged mode must be enabled. For details, see `privileged `__.
In addition, CCE mitigates risks according to the least privilege principle. Only the following privileges are available for npd running:
- cap_dac_read_search: permission to access **/run/log/journal**.
- cap_sys_admin: permission to access **/dev/kmsg**.
Installing the Add-on
---------------------
#. Log in to the CCE console, click the cluster name, and access the cluster console. Choose **Add-ons** in the navigation pane, locate **npd** on the right, and click **Install**.
#. On the **Install Add-on** page, select the add-on specifications and set related parameters.
- **Pods**: Set the number of pods based on service requirements.
- **Containers**: Select a proper container quota based on service requirements.
#. Set the parameters according to the following table and click **Install**.
Only 1.16.0 and later versions support the configurations.
**npc.enable**: indicates whether to enable :ref:`Node-problem-controller `.
npd Check Items
---------------
.. note::
Check items are supported only in 1.16.0 and later versions.
Check items cover events and statuses.
- Event-related
For event-related check items, when a problem occurs, npd reports an event to the API server. The event type can be **Normal** (normal event) or **Warning** (abnormal event).
.. table:: **Table 1** Event-related check items
+-----------------------+-------------------------------------------------------+-----------------------+
| Check Item | Function | Description |
+=======================+=======================================================+=======================+
| OOMKilling | Check whether OOM events occur and are reported. | Warning event |
+-----------------------+-------------------------------------------------------+-----------------------+
| TaskHung | Check whether taskHung events occur and are reported. | Warning event |
+-----------------------+-------------------------------------------------------+-----------------------+
| KernelOops | Check kernel nil pointer panic errors. | Warning event |
+-----------------------+-------------------------------------------------------+-----------------------+
| ConntrackFull | Check whether the conntrack table is full. | Warning event |
| | | |
| | | Interval: 30 seconds |
| | | |
| | | Threshold: 80% |
+-----------------------+-------------------------------------------------------+-----------------------+
- Status-related
For status-related check items, when a problem occurs, npd reports an event to the API server and changes the node status synchronously. This function can be used together with :ref:`Node-problem-controller fault isolation ` to isolate nodes.
**If the check period is not specified in the following check items, the default period is 30 seconds.**
.. table:: **Table 2** Application and OS check items
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Check Item | Function | Description |
+===========================+===============================================================================================================================================================+============================================================================================================================================================+
| FrequentKubeletRestart | Check whether kubelet restarts frequently by listening to journald logs. | - Interval: 5 minutes |
| | | |
| | | - Backtracking: 10 minutes |
| | | |
| | | - Threshold: 10 times |
| | | |
| | | If the system restarts for 10 times within the backtracking period, it indicates that the system restarts frequently and a fault alarm is generated. |
| | | |
| | | - Listening object: logs in the **/run/log/journal** directory |
| | | |
| | | .. note:: |
| | | |
| | | The Ubuntu OS does not support the preceding check items due to incompatible log formats. |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FrequentDockerRestart | Check whether Docker restarts frequently by listening to journald logs. | |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FrequentContainerdRestart | Check whether containerd restarts frequently by listening to journald logs. | |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| CRIProblem | Check the CRI component status. | Check object: Docker or containerd |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| KUBELETProblem | Check the kubelet status. | None |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| NTPProblem | Check the NTP and Chrony service status. | Threshold of the clock offset: 8000 ms |
| | | |
| | Check whether the node clock offsets. | |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| PIDProblem | Check whether PIDs are sufficient. | - Threshold: 90% |
| | | - Usage: nr_threads in /proc/loadavg |
| | | - Maximum value: smaller value between **/proc/sys/kernel/pid_max** and **/proc/sys/kernel/threads-max**. |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| FDProblem | Check whether file handles are sufficient. | - Threshold: 90% |
| | | - Usage: the first value in **/proc/sys/fs/file-nr** |
| | | - Maximum value: the third value in **/proc/sys/fs/file-nr** |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MemoryProblem | Check whether the overall node memory is sufficient. | - Threshold: 90% |
| | | - Usage: **MemTotal-MemAvailable** in **/proc/meminfo** |
| | | - Maximum value: **MemTotal** in **/proc/meminfo** |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ResolvConfFileProblem | Check whether the ResolvConf file is lost. | Object: **/etc/resolv.conf** |
| | | |
| | Check whether the ResolvConf file is normal. | |
| | | |
| | Exception definition: No upstream domain name resolution server (nameserver) is included. | |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ProcessD | Check whether there is a process D on the node. | Source: |
| | | |
| | | - /proc/{PID}/stat |
| | | - Alternately, you can run **ps aux**. |
| | | |
| | | Exception scenario: ProcessD ignores the resident processes (heartbeat and update) that are in the D state that the SDI driver on the BMS node depends on. |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ProcessZ | Check whether the node has processes in Z state. | |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
| ScheduledEvent | Check whether host plan events exist on the node. | Source: |
| | | |
| | Typical scenario: The host is faulty, for example, the fan is damaged or the disk has bad sectors. As a result, cold and live migration is triggered for VMs. | - http://169.254.169.254/meta-data/latest/events/scheduled |
| | | |
| | | This check item is an Alpha feature and is disabled by default. |
+---------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------+
.. table:: **Table 3** Network connection check items
+------------------+------------------------------------------------------+-------------+
| Check Item | Function | Description |
+==================+======================================================+=============+
| CNIProblem | Check whether the CNI component is running properly. | None |
+------------------+------------------------------------------------------+-------------+
| KUBEPROXYProblem | Check whether kube-proxy is running properly. | None |
+------------------+------------------------------------------------------+-------------+
.. table:: **Table 4** Storage check items
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Check Item | Function | Description |
+================================+====================================================================================================================================================================================================================================================================================================================================================================================================+====================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================================+
| ReadonlyFilesystem | Check whether the **Remount root filesystem read-only** error occurs in the system kernel by listening to the kernel logs. | Listening object: **/dev/kmsg** |
| | | |
| | Typical scenario: A user detaches a data disk from a node by mistake on the ECS, and applications continuously write data to the mount point of the data disk. As a result, an I/O error occurs in the kernel and the disk is reattached as a read-only disk. | Matching rule: **Remounting filesystem read-only** |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| DiskReadonly | Check whether the system disk, Docker disk, and kubelet disk are read-only. | Detection paths: |
| | | |
| | | - /mnt/paas/kubernetes/kubelet/ |
| | | - /var/lib/docker/ |
| | | - /var/lib/containerd/ |
| | | - /var/paas/sys/log/cceaddon-npd/ |
| | | |
| | | The temporary file **npd-disk-write-ping** is generated in the detection path. |
| | | |
| | | Currently, additional data disks are not supported. |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| DiskProblem | Check the usage of the system disk, Docker disk, and kubelet disk. | - Threshold: 80% |
| | | |
| | | - Source: |
| | | |
| | | .. code-block:: |
| | | |
| | | df -h |
| | | |
| | | Currently, additional data disks are not supported. |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| EmptyDirVolumeGroupStatusError | Check whether the ephemeral volume group on the node is normal. | - Detection period: 60s |
| | | |
| | Impact: The pod that depends on the storage pool cannot write data to the temporary volume. The temporary volume is remounted as a read-only file system by the kernel due to an I/O error. | - Source: |
| | | |
| | Typical scenario: When creating a node, a user configures two data disks as a temporary volume storage pool. The user deletes some data disks by mistake. As a result, the storage pool becomes abnormal. | .. code-block:: |
| | | |
| | | vgs -o vg_name, vg_attr |
| | | |
| | | - Principle: Check whether the VG (storage pool) is in the P state. If yes, some PVs (data disks) are lost. |
| | | |
| | | - Joint scheduling: The scheduler can automatically identify an abnormal node and prevent pods that depend on the storage pool from being scheduled to the node. |
| | | |
| | | - Exception scenario: The npd add-on cannot detect the loss of all PVs (data disks), resulting in the loss of VGs (storage pools). In this case, kubelet automatically isolates the node, detects the loss of VGs (storage pools), and updates the corresponding resources in **nodestatus.allocatable** to **0**. This prevents pods that depend on the storage pool from being scheduled to the node. The damage of a single PV cannot be detected. In this case, the ReadonlyFilesystem detection is abnormal. |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| LocalPvVolumeGroupStatusError | Check the PV group on the node. | |
| | | |
| | Impact: Pods that depend on the storage pool cannot write data to the persistent volume. The persistent volume is remounted as a read-only file system by the kernel due to an I/O error. | |
| | | |
| | Typical scenario: When creating a node, a user configures two data disks as a persistent volume storage pool. Some data disks are deleted by mistake. | |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MountPointProblem | Check the mount point on the node. | Alternatively, you can run the following command: |
| | | |
| | Exception definition: You cannot access the mount point by running the **cd** command. | .. code-block:: |
| | | |
| | Typical scenario: Network File System (NFS), for example, obsfs and s3fs is mounted to a node. When the connection is abnormal due to network or peer NFS server exceptions, all processes that access the mount point are suspended. For example, during a cluster upgrade, a kubelet is restarted, and all mount points are scanned. If the abnormal mount point is detected, the upgrade fails. | for dir in `df -h | grep -v "Mounted on" | awk "{print \\$NF}"`;do cd $dir; done && echo "ok" |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| DiskHung | Check whether I/O faults occur on the disk of the node. | - Check object: all data disks |
| | | |
| | Definition of I/O faults: The system does not respond to disk I/O requests, and some processes are in the D state. | - Source: |
| | | |
| | Typical Scenario: Disks cannot respond due to abnormal OS hard disk drivers or severe faults on the underlying network. | /proc/diskstat |
| | | |
| | | Alternatively, you can run the following command: |
| | | |
| | | .. code-block:: |
| | | |
| | | iostat -xmt 1 |
| | | |
| | | - Threshold: |
| | | |
| | | - Average usage. The value of ioutil is greater than or equal to 0.99. |
| | | - Average I/O queue length. avgqu-sz >=1 |
| | | - Average I/O transfer volume, iops (w/s) + ioth (wMB/s) < = 1 |
| | | |
| | | .. note:: |
| | | |
| | | In some OSs, no data changes during I/O. In this case, calculate the CPU I/O time usage. The value of iowait is greater than 0.8. |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| DiskSlow | Check whether slow I/O occurs on the disk of the node. | - Check object: all data disks |
| | | |
| | Definition of slow I/O: The average response time exceeds the threshold. | - Source: |
| | | |
| | Typical scenario: EVS disks have slow I/Os due to network fluctuation. | /proc/diskstat |
| | | |
| | | Alternatively, you can run the following command: |
| | | |
| | | .. code-block:: |
| | | |
| | | iostat -xmt 1 |
| | | |
| | | - Threshold: |
| | | |
| | | Average I/O latency: await > = 5000 ms |
| | | |
| | | .. note:: |
| | | |
| | | If I/O requests are not responded and the **await** data is not updated. In this case, this check item is invalid. |
+--------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The kubelet component has the following default check items, which have bugs or defects. You can fix them by upgrading the cluster or using npd.
.. table:: **Table 5** Default kubelet check items
+-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Check Item | Function | Description |
+=======================+========================================================================+==========================================================================================================================================================================================================================================================================================================================+
| PIDPressure | Check whether PIDs are sufficient. | - Interval: 10 seconds |
| | | - Threshold: 90% |
| | | - Defect: In community version 1.23.1 and earlier versions, this check item becomes invalid when over 65535 PIDs are used. For details, see `issue 107107 `__. In community version 1.24 and earlier versions, thread-max is not considered in this check item. |
+-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| MemoryPressure | Check whether the allocable memory for the containers is sufficient. | - Interval: 10 seconds |
| | | - Threshold: Max. 100 MiB |
| | | - Allocable = Total memory of a node - Reserved memory of a node |
| | | - Defect: This check item checks only the memory consumed by containers, and does not consider that consumed by other elements on the node. |
+-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| DiskPressure | Check the disk usage and inodes usage of the kubelet and Docker disks. | Interval: 10 seconds |
| | | |
| | | Threshold: 90% |
+-----------------------+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
.. _cce_10_0132__section1471610580474:
Node-problem-controller Fault Isolation
---------------------------------------
.. note::
Fault isolation is supported only by add-ons of 1.16.0 and later versions.
When installing the npd add-on, set **npc.enable** to **true** to deploy dual Node-problem-controller (NPC). You can deploy NPC as single-instance but such NPC does not ensure high availability.
By default, if multiple nodes become faulty, NPC adds taints to only one node. You can set **npc.maxTaintedNode** to increase the threshold. When the fault is rectified, NPC is not running and taints remain. You need to manually clear the taints or start NPC.
The open source NPD plug-in provides fault detection but not fault isolation. CCE enhances the node-problem-controller (NPC) based on the open source NPD. This component is implemented based on the Kubernetes `node controller `__. For faults reported by NPD, NPC automatically adds taints to nodes for node fault isolation.
You can modify **add-onnpc.customConditionToTaint** according to the following table to configure fault isolation rules.
.. table:: **Table 6** Parameters
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Parameter | Description | Default |
+================================+=========================================================+=========================================================================================================================================+
| npc.enable | Whether to enable NPC | true |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| npc.customCondtionToTaint | Fault isolation rules | See :ref:`Table 7 `. |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| npc.customConditionToTaint[i] | Fault isolation rule items | N/A |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| npc.customConditionToTaint[i]. | Fault status | true |
| | | |
| condition.status | | |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| npc.customConditionToTaint[i]. | Fault type | N/A |
| | | |
| condition.type | | |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| npc.customConditionToTaint[i]. | Whether to enable the fault isolation rule. | false |
| | | |
| enable | | |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| npc.customConditionToTaint[i]. | Fault isolation effect | NoSchedule |
| | | |
| .taint.effect | NoSchedule, PreferNoSchedule, or NoExecute | Value options: **NoSchedule**, **PreferNoSchedule**, and **NoExecute** |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| npc. maxTaintedNode | Number of nodes in a cluster that can be tainted by NPC | 1 |
| | | |
| | The int format and percentage format are supported. | Values: |
| | | |
| | | - The value is in int format and ranges from 1 to infinity. |
| | | - The value ranges from 1% to 100%, in percentage. The minimum value of this parameter multiplied by the number of cluster nodes is 1. |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Npc.affinity | Node affinity of the controller | N/A |
+--------------------------------+---------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
.. _cce_10_0132__table147438134911:
.. table:: **Table 7** Fault isolation rule configuration
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| Fault | Fault Details | Taint |
+===========================+=====================================================================+======================================+
| DiskReadonly | Disk read-only | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| DiskProblem | The disk space is insufficient, and key logical disks are detached. | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| FrequentKubeletRestart | kubelet restarts frequently. | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| FrequentDockerRestart | Docker restarts frequently. | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| FrequentContainerdRestart | containerd restarts frequently. | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| KUBEPROXYProblem | kube-proxy is abnormal. | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| PIDProblem | Insufficient PIDs | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| FDProblem | Insufficient file handles | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
| MemoryProblem | Insufficient node memory | **NoSchedule**: No new pods allowed. |
+---------------------------+---------------------------------------------------------------------+--------------------------------------+
Collecting Prometheus Metrics
-----------------------------
The NPD daemon pod exposes Prometheus metric data on port 19901. By default, the NPD pod is added with the annotation **metrics.alpha.kubernetes.io/custom-endpoints: '[{"api":"prometheus","path":"/metrics","port":"19901","names":""}]'**. You can build a Prometheus collector to identify and obtain NPD metrics from **http://{{NpdPodIP}}:{{NpdPodPort}}/metrics**.
.. note::
If the npd add-on version is earlier than 1.16.5, the exposed port of Prometheus metrics is **20257**.
Currently, the metric data includes **problem_counter** and **problem_gauge**, as shown below.
.. code-block::
# HELP problem_counter Number of times a specific type of problem have occurred.
# TYPE problem_counter counter
problem_counter{reason="DockerHung"} 0
problem_counter{reason="DockerStart"} 0
problem_counter{reason="EmptyDirVolumeGroupStatusError"} 0
...
# HELP problem_gauge Whether a specific type of problem is affecting the node or not.
# TYPE problem_gauge gauge
problem_gauge{reason="CNIIsDown",type="CNIProblem"} 0
problem_gauge{reason="CNIIsUp",type="CNIProblem"} 0
problem_gauge{reason="CRIIsDown",type="CRIProblem"} 0
problem_gauge{reason="CRIIsUp",type="CRIProblem"} 0
..