If the cluster status is available but some nodes in the cluster are unavailable, perform the following operations to rectify the fault:
Fault locating:
A too high CPU or memory usage of the node will result in an excessive network latency or trigger system OOM (see What Should I Do If the OOM Killer Is Triggered When a Container Uses Memory Resources More Than Limited?). Therefore, the node is displayed as unavailable.
Solution
After the node becomes available, the workload is restored.
Log in to the CCE console, choose Resource Management > Clusters in the navigation pane. On the page displayed, check whether the cluster is available.
If the node names are inconsistent and the password and key cannot be used to log in to the node, Cloud-Init problems occurred when an ECS was created. In this case, restart the node and submit a service ticket to the ECS personnel to locate the root cause.
The name of this security group is in the format of cluster name-cce-control-ID.
Inbound rule parameter description:
After a node is created in a cluster of v1.7.3-r7 or a later version, a 100 GB data disk dedicated for Docker is bound to the node. If the data disk is uninstalled or damaged, the Docker service becomes abnormal and the node becomes unavailable.
Click the node name to check whether the data disk mounted to the node is uninstalled. If the disk is uninstalled, mount a data disk to the node again and restart the node. Then the node can be recovered.
For details, see Logging In to a Linux ECS.
For version 1.13, run the following command:
systemctl status kubelet
If this command fails to be run, contact technical support. If this command is successfully executed, the status of each component is displayed as active, as shown in the following figure.
If the component status is not active, run the following commands (using the faulty component canal as an example):
Run systemctl restart canal to restart the component.
After restarting the component, run systemctl status canal to check the status.
For versions earlier than v1.13, run the following command:
su paas -c '/var/paas/monit/bin/monit summary'
If this command fails to be run, contact technical support. If this command is successfully executed, the status of each component is displayed, as shown in the following figure.
If any service component is not in the Running state, restart the corresponding service. For example, the canal component is abnormal, as shown in the following figure.
Run su paas -c '/var/paas/monit/bin/monit restart canal' to restart the canal component.
After the restart, run su paas -c '/var/paas/monit/bin/monit summary' to query the status of the canal component.
In that case, the status of each component is Running, as shown in the following figure.
ps -ef | grep monitrc
kill -s 9 `ps -ef | grep monitrc | grep -v grep | awk '{print $2}'`