11 KiB
High Availability
Availability Definition
Availability refers to the ability of a product or service to perform a specified function under specified conditions at a specified time or within a specified period of time. It is a measure of the reliability and maintainability of the product.
Service availability is generally measured by the SLA. Each type of cloud service provides services based on their SLA commitment. The following table lists the downtimes acceptable for different SLA commitments:
SLA | Weekly Downtime | Monthly Downtime | Yearly Downtime |
---|---|---|---|
99% | 1.68 hours | 7.2 hours | 3.65 days |
99.90% | 10.1 minutes | 43.2 minutes | 8.76 hours |
99.95% | 5 minutes | 21.6 minutes | 4.38 hours |
99.99% | 1.01 minutes | 4.32 minutes | 52.56 minutes |
100.00% | 6 seconds | 25.9 seconds | 5.26 minutes |
High Availability Solutions
Most Open Telekom Cloud services have HA deployment options available. They provide you with HA capabilities at multiple layers, including data centers, hardware, data, and self-service. Open Telekom Cloud data centers are deployed in Germany, Netherlands and Switzerland to meet resource requirements for different regions. Each region is divided into multiple availability zones (AZs). Each AZ has independent cooling, fire extinguishing, moisture-control, and electrical facilities, and the failure of one AZ does not affect other AZs. There are four types of HA deployments:
- Single-AZ HA: For services that do not require high availability, active/standby or cluster deployment models of cloud services can be used to quickly recover services in the event of a single service node failure. By using automatic fault detection and switchover of nodes in a cluster, single points of failure (SPOFs) are eliminated and service interruptions are prevented.
- Dual-AZ (intra-city) HA: For services that require high availability, you can deploy services in multiple equipment rooms in the same city. This way, service continuity is guaranteed even if the network, physical device, or power supply of one equipment room fails. Open Telekom Cloud users can deploy services across AZs. AZs are isolated from each other, so if one AZ fails, you can switch services to another AZ to quickly recover services. Most cloud service products have corresponding capabilities. Select your desired capabilities during purchase and complete deployment.
- Two-site three-center HA: For some ultra-large or commercial systems that demand extra robust protection, the dual-AZ (intra-city) HA solution is still insufficient. It cannot guard against regional disasters, such as an earthquake or a flood. A remote equipment room is required for that. So in a two-site three-center HA solution, you add a remote DR equipment room to the intra-city solution. That way, you are protected even in the event of a regional disaster.
- Cross-cloud HA: To meet the requirements for multi-cloud HA of some enterprises, Open Telekom Cloud also supports multi-cloud DR deployment. Enterprises can deploy production services on Open Telekom Cloud and deploy the DR site on a cloud vendor platform.
Single-AZ HA Solution Design
Solution description and key points of the design:
- Layered service deployment: web access layer, service layer, data layer, and management zone
- Service HA: No single-node deployment, HA deployment (in cluster or active/standby mode)
- Cloud service HA: The selected Open Telekom Cloud services are deployed in the HA mode.
From the perspective of service continuity and data availability, this solution achieves cluster and active/standby HA. Instead of using single nodes, applications are deployed in clusters or active/standby mode. This may be a more expensive option, but it significantly improves availability.
Key points of the HA design:
Item | HA Design Focuses |
---|---|
S ervice HA | Decoupled deployment: Different components are deployed on different ECSs. HA deployment: HA deployment is used for all nodes. If HA (active/standby or cluster) deployment is not supported, an emergency solution must be available, for example, cloud server backup (via CSBS) and an emergency environment. Layered deployment: Services are deployed separately on the web layer, service layer, and database layer. Auto scaling: The Auto Scaling service can be leveraged to adjust compute resources to deal with changes in service volume. |
Cloud s ervice HA |
|
Data relia bility |
|
Dual-AZ HA Solution Design
Solution description and key points of the design:
- Service modules: For services supporting cluster deployment, resources are deployed in two AZs, with the loads balanced using ELB. For single-node ECSs, SDRS is used for AZ-level DR.
- Cloud service HA: Primary and standby nodes are deployed in different AZs.
- Database synchronization: RDS is used, and RDS DB instances are deployed in primary/standby mode across AZs and the data kept synchronized.
- DR switchover: If an AZ fails, RDS database services automatically switch to the standby databases. Application services can be taken over by the DR servers automatically or in just a few clicks.
- DR drills: Users can perform DR drills with just a few clicks.
Two-Site Three-Center (Cross-Region) HA Solution Design
Solution description and key points of the design:
- The production centers and DR center are deployed in two different regions of Open Telekom Cloud.
- The production centers are deployed in two AZs, and the DR center is deployed in one AZ.
- RDS DB instances are deployed in both the production centers and DR center for 1:1:1 primary/standby replication.
- Configurations, logs, snapshots, and backups generated at the production and DR centers are replicated across regions using OBS.
- If one production AZ fails, services are switched to the other AZ, and a database switchover is performed.
- If both production centers fail, a database switchover is performed, and by modifying DNS configuration, 100% of the user traffic is directed to the DR center.
- After the production centers recover, a database switchover is performed, and DNS directs 100% of the user traffic to the production centers.
- To improve the utilization of the DR center, some read-only or analysis services can be distributed to the DR center.
Note
- This solution provides the highest possible service continuity and data availability. Data and services can be protected even in the event of regional disasters.
- The RPO it determined by the database replication interval. Servers at the DR center are always running, so an RTO of almost zero can be achieved. The time needed to complete a DR switchover depends on how long the DNS cache takes to update. It usually takes a few minutes and can be faster if GSLB is used.
Establishing HA DR capabilities is a complex project that includes ingress traffic control, service layer reconstruction, middleware and database control, and the collaboration of numerous systems. It requires professional skills to build HA DR systems. If customers lack related experience and want to quickly build an HA DR system, the Multi-cloud high Availability Service (MAS) is a good choice. This service is derived from the multi-cloud application HA solution of Open Telekom Cloud consumer services. It provides end-to-end service failover and DR drill capabilities that include everything from the traffic ingress and application layer, to the data layer. MAS ensures quick service recovery and improved service continuity.