If the cluster status is available but some nodes in the cluster are unavailable, perform the following operations to rectify the fault:
Kubernetes provides the heartbeat mechanism to help you determine node availability. For details about the mechanism and interval, see Heartbeats.
The issues here are described in order of how likely they are to occur.
Check these causes one by one until you find the cause of the fault.
Symptom
The node connection in the cluster is abnormal. Multiple nodes report write errors, but services are not affected.
Fault Locating
A too high CPU or memory usage of the node will result in a high network latency or trigger system OOM. Therefore, the node is displayed as unavailable.
Solution
After the node becomes available, the workload is restored.
Log in to the CCE console and check whether the cluster is available.
If the node names are inconsistent and the password and key cannot be used to log in to the node, Cloud-Init problems occurred when an ECS was created. In this case, restart the node and submit a service ticket to the ECS personnel to locate the root cause.
Log in to the VPC console. In the navigation pane, choose Access Control > Security Groups and locate the security group of the cluster master node.
The name of this security group is in the format of Cluster name-cce-control-ID. You can search for the security group by cluster name and -cce-control-.
Check whether the security group rules have been modified. For details about security groups, see How Can I Configure a Security Group Rule in a Cluster?
Check whether such a security group policy exists.
When a node is added to an existing cluster, if an extended CIDR block is added to the VPC corresponding to the subnet and the subnet is an extended CIDR block, you need to add the following three security group rules to the master node security group (the group name is in the format of Cluster name-cce-control-Random number). These rules ensure that the nodes added to the cluster are available. (This step is not required if an extended CIDR block has been added to the VPC during cluster creation.)
For details about security groups, see How Can I Configure a Security Group Rule in a Cluster?.
A 100 GiB data disk dedicated for Docker is attached to the new node. If the data disk is uninstalled or damaged, the Docker service becomes abnormal and the node becomes unavailable.
Click the node name to check whether the data disk mounted to the node is uninstalled. If the disk is uninstalled, mount a data disk to the node again and restart the node. Then the node can be recovered.
systemctl status kubelet
If the command is successfully executed, the status of each component is displayed as active, as shown in the following figure.
If the component status is not active, run the following commands (using the faulty component canal as an example):
Run systemctl restart canal to restart the component.
After restarting the component, run systemctl status canal to check the status.
ps -ef | grep monitrc
If the monitrc process exists, run the following command to kill this process. The monitrc process will be automatically restarted after it is killed.
kill -s 9 `ps -ef | grep monitrc | grep -v grep | awk '{print $2}'`
cat /var/log/cloud-init-output.log | grep resolv
If the command output contains the following information, the domain name cannot be resolved:
Could not resolve host: Unknown error
If the vdb disk on a node is deleted, you can refer to this topic to restore the node.
systemctl status docker
If the command fails or the Docker service status is not active, locate the cause or contact technical support if necessary.
docker ps -a | wc -l
If the command is suspended, the command execution takes a long time, or there are more than 1000 abnormal containers, check whether workloads are repeatedly created and deleted. If a large number of containers are frequently created and deleted, a large number of abnormal containers may occur and cannot be cleared in a timely manner.
In this case, stop repeated creation and deletion of the workload or use more nodes to share the workload. Generally, the nodes will be restored after a period of time. If necessary, run the docker rm {container_id} command to manually clear abnormal containers.