A200161411/cloud-container-engine

proposalbot 85e1a6ed92 Changes to cce_umn from docs/doc-exports#770 (Added the support of the OS for fe

Reviewed-by: Eotvos, Oliver <oliver.eotvos@t-systems.com>
Co-authored-by: proposalbot <proposalbot@otc-service.com>
Co-committed-by: proposalbot <proposalbot@otc-service.com>

2023-06-20 14:44:25 +00:00

43 KiB

Raw Permalink Blame History

original_name: cce_10_0054.html

High-Risk Operations and Solutions

During service deployment or running, you may trigger high-risk operations at different levels, causing service faults or interruption. To help you better estimate and avoid operation risks, this section introduces the consequences and solutions of high-risk operations from multiple dimensions, such as clusters, nodes, networking, load balancing, logs, and EVS disks.

Clusters and Nodes

**Table 1** High-risk operations and solutions
Category	Operation	Impact	Solution
Master node	Modifying the security group of a node in a cluster	The master node may be unavailable. Note Naming rule of a master node: Cluster name`-`cce-control`-`Random number	Restore the security group by referring to the security group of the new cluster and allow traffic from the security group to pass through.
	Letting the node expire or destroying the node	The master node will be unavailable.	This operation cannot be undone.
	Reinstalling the OS	Components on the master node will be deleted.	This operation cannot be undone.
	Upgrading components on the master or etcd node	The cluster may be unavailable.	Roll back to the original version.
	Deleting or formatting core directory data such as /etc/kubernetes on the node	The master node will be unavailable.	This operation cannot be undone.
	Changing the node IP address	The master node will be unavailable.	Change the IP address back to the original one.
	Modifying parameters of core components (such as etcd, kube-apiserver, and docker)	The master node may be unavailable.	Restore the parameter settings to the recommended values. For details, see `Cluster Configuration Management <cce_10_0213>`.
	Replacing the master or etcd certificate	The cluster may become unavailable.	This operation cannot be undone.
Worker node	Modifying the security group of a node in a cluster	The node may be unavailable. Note Naming rule of a worker node: Cluster name`-`cce-node`-`Random number	Restore the security group by referring to `Creating a CCE Cluster <cce_10_0028>` and allow traffic from the security group to pass through.
	Deleting the node	The node will become unavailable.	This operation cannot be undone.
	Reinstalling the OS	Node components are deleted, and the node becomes unavailable.	Reset the node. For details, see `Resetting a Node <cce_10_0003>`.
	Upgrading the node kernel	The node may be unavailable or the network may be abnormal. Note Node running depends on the system kernel version. Do not use the yum update command to update or reinstall the operating system kernel of a node unless necessary. (Reinstalling the operating system kernel using the original image or other images is a risky operation.)	For details, see `Resetting a Node <cce_10_0003>`.
	Changing the node IP address	The node will become unavailable.	Change the IP address back to the original one.
	Modifying parameters of core components (such as kubelet and kube-proxy)	The node may become unavailable, and components may be insecure if security-related configurations are modified.	Restore the parameter settings to the recommended values. For details, see `Configuring a Node Pool <cce_10_0652>`.
	Modifying OS configuration	The node may be unavailable.	Restore the configuration items or reset the node. For details, see `Resetting a Node <cce_10_0003>`.
	Deleting or modifying the /opt/cloud/cce and /var/paas directories, and delete the data disk.	The node will become unready.	You can reset the node. For details, see `Resetting a Node <cce_10_0003>`.
	Modifying the node directory permission and the container directory permission	The permissions will be abnormal.	You are not advised to modify the permissions. Restore the permissions if they are modified.
	Formatting or partitioning system disks, Docker disks, and kubelet disks on nodes.	The node may be unavailable.	You can reset the node. For details, see `Resetting a Node <cce_10_0003>`.
	Installing other software on nodes	This may cause exceptions on Kubernetes components installed on the node, and make the node unavailable.	Uninstall the software that has been installed and restore or reset the node. For details, see `Resetting a Node <cce_10_0003>`.
	Modifying NetworkManager configurations	The node will become unavailable.	Reset the node. For details, see `Resetting a Node <cce_10_0003>`.
	Delete system images such as cfe-pause from the node.	Containers cannot be created and system images cannot be pulled.	Copy the image from another normal node for restoration.

Networking and Load Balancing

**Table 2** High-risk operations and solutions
Operation	Impact	How to Avoid/Fix
Changing the value of the kernel parameter net.ipv4.ip_forward to 0	The network becomes inaccessible.	Change the value to 1.
Changing the value of the kernel parameter net.ipv4.tcp_tw_recycle to 1	The NAT service becomes abnormal.	Change the value to 0.
Changing the value of the kernel parameter net.ipv4.tcp_tw_reuse to 1	The network becomes abnormal.	Change the value to 0.
Not configuring the node security group to allow UDP packets to pass through port 53 of the container CIDR block	The DNS in the cluster cannot work properly.	Restore the security group by referring to `Creating a CCE Cluster <cce_10_0028>` and allow traffic from the security group to pass through.
Creating a custom listener on the ELB console for the load balancer managed by CCE	The modified items are reset by CCE or the ingress is faulty.	Use the YAML file of the Service to automatically create a listener.
Binding a user-defined backend on the ELB console to the load balancer managed by CCE.		Do not manually bind any backend.
Changing the ELB certificate on the ELB console for the load balancer managed by CCE.		Use the YAML file of the ingress to automatically manage certificates.
Changing the listener name on the ELB console for the ELB listener managed by CCE.		Do not change the name of the ELB listener managed by CCE.
Changing the description of load balancers, listeners, and forwarding policies managed by CCE on the ELB console.		Do not modify the description of load balancers, listeners, or forwarding policies managed by CCE.
Delete CRD resources of network-attachment-definitions of default-network.	The container network is disconnected, or the cluster fails to be deleted.	If the resources are deleted by mistake, use the correct configurations to create the default-network resources.

Logs

**Table 3** High-risk operations and solutions
Operation	Impact	Solution
Deleting the /tmp/ccs-log-collector/pos directory on the host machine	Logs are collected repeatedly.	None
Deleting the /tmp/ccs-log-collector/buffer directory of the host machine	Logs are lost.	None

EVS Disks

**Table 4** High-risk operations and solutions
Operation	Impact	Solution	Remarks
Manually unmounting an EVS disk on the console	An I/O error is reported when the pod data is being written into the disk.	Delete the mount path from the node and schedule the pod again.	The file in the pod records the location where files are to be collected.
Unmounting the disk mount path on the node	Pod data is written into a local disk.	Remount the corresponding path to the pod.	The buffer contains log cache files to be consumed.
Operating EVS disks on the node	Pod data is written into a local disk.	None	None

43 KiB Raw Permalink Blame History