Node Fault Detection Policy

The node fault detection function depends on the NPD add-on. The add-on instances run on nodes and monitor nodes. This section describes how to enable node fault detection.

Prerequisites

The CCE Node Problem Detector add-on has been installed in the cluster.

Enabling Node Fault Detection

  1. Log in to the CCE console and click the cluster name to access the cluster console.
  2. In the navigation pane, choose Nodes and then click the Nodes tab. Check whether the NPD add-on has been installed in the cluster or whether the add-on has been upgraded to the latest version. After the NPD add-on has been installed, you can use the fault detection function.
  3. If the NPD add-on is running properly, click Node Fault Detection Policy to view the current fault detection items. For details about the NPD check item list, see NPD Check Items.
  4. If the check result of the current node is abnormal, a message is displayed in the node list, indicating that the metric is abnormal.

  5. You can click Abnormal metrics and rectify the fault as prompted.

Customized Check Items

  1. Log in to the CCE console and click the cluster name to access the cluster console.
  2. In the navigation pane, choose Nodes and then click the Nodes tab. Then, click Fault Detection Policy.
  3. On the displayed page, view the current check items. Click Edit in the Operation column and edit checks.

    Currently, the following configurations are supported:
    • Enable/Disable: Enable or disable a check item.
    • Target Node: By default, check items run on all nodes. You can change the fault threshold based on special scenarios. For example, the spot price ECS interruption reclamation check runs only on the spot price ECS node.

    • Trigger Threshold: The default thresholds match common fault scenarios. You can customize and modify the fault thresholds as required. For example, change the threshold for triggering connection tracking table exhaustion from 90% to 80%.

    • Check Period: The default check period is 30 seconds. You can modify this parameter as required.

    • Troubleshooting Strategy: After a fault occurs, you can select the strategies listed in the following table.
      Table 1 Troubleshooting strategies

      Troubleshooting Strategy

      Effect

      Prompting Exception

      Kubernetes events are reported.

      Disabling scheduling

      Kubernetes events are reported and the NoSchedule taint is added to the node.

      Evict Node Load

      Kubernetes events are reported and the NoExecute taint is added to the node. This operation will evict workloads on the node and interrupt services. Exercise caution when performing this operation.

NPD Check Items

Check items are supported only in 1.16.0 and later versions.

Check items cover events and statuses.