diff --git a/doc/source/internal/index.rst b/doc/source/internal/index.rst index aec27c7..dbdd7cc 100644 --- a/doc/source/internal/index.rst +++ b/doc/source/internal/index.rst @@ -6,3 +6,4 @@ Internal Documentation helpcenter_training/index apimon_training/index + sd2_training/index diff --git a/doc/source/internal/sd2_training/index.rst b/doc/source/internal/sd2_training/index.rst new file mode 100644 index 0000000..0a51ce3 --- /dev/null +++ b/doc/source/internal/sd2_training/index.rst @@ -0,0 +1,8 @@ +=========================== +Status Dashboard 2 Training +=========================== + +.. toctree:: + :maxdepth: 1 + + onepager diff --git a/doc/source/internal/sd2_training/onepager.rst b/doc/source/internal/sd2_training/onepager.rst new file mode 100644 index 0000000..3d1d330 --- /dev/null +++ b/doc/source/internal/sd2_training/onepager.rst @@ -0,0 +1,243 @@ +OTC Status Dashboard 2: Cheat-Sheet for Squad Service Managers +============================================================== + +The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC +services, intended for customers to grasp an overview of the service +availability. It comprises of a set of **monitoring zones**, each +monitoring services of an **monitoring environment** (a. k. a. regions +like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring +sites is configured with an HA approach in mind by the Ecosystem Squad +and is not described in technical detail in this document. + +Additionally, the web-based Dashboard itself serves the monitored data +in a frontend component visible by OTC customers. The general assumption +of the SD2 is that “no news is good news”. Technically speaking the SD2 +doesn’t receive any monitoring metrics, but only **incidents**. Once the +SD2 receives such an incident, the associated service is marked with +yellow or red semaphores, otherwise every services stays with a green +semaphore. + +Each squad should appoint one or more colleagues for the role of a +**Service Incident Engineers (SIE)**. The SIEs define the exact +conditions when a yellow or red semaphore should be raised. This +document is intended for them. + +As a secondary target group this document may also be useful for +**Service Incident Managers (SIM)**. It’s the role’s responsibility to +react on incoming incident, initiate mitigation activities, explain the +situation to customers, and eventually close incidents once they’re +resolved. For SIMs it might be useful to understand *why* incidents are +raised, but they may not need to know *how* exactly this happens. + +Simplified architectural overview and data flow +----------------------------------------------- + +The SD2 is a specific application of the much more genral Stackmon +framework for cloud monitoring. It is licensed as open source software, +initiated by the OTC, and developed together with the Community. Due to +this design, the monitoring data flows through several stages. Most of +them can be configured and customized to a great deal to serve for many +different purposes. However, SD2 comes with a number of assumptions and +pre-configurations to reduce complexity for SIEs and SIMs. + +The data flows through these stages: A plugin collects the raw metrics +from the live systems of the OTC. For the SD2 the EPMon-plugin is used, +which is an abbreviation for “endpoint monitoring”. This means that the +plugin sends HTTP-GET-requests to API endpoints that are listed in the +OTC service catalogue. Typically simple “list” requests are queried, and +no actual resources are created or modified by the action. The +EPMon-plugin records only the status code and the round-trip-time for +the response. There is a maximum timeout configured. The results of the +probes are stored in a TSDB implemented by Graphite. By means of some +Graphit-queries, the raw data is aggregated, resulting in several +**flags**. For example, if less than 90% of all queries to the +ABC-service in the past 15 minutes exceeded a threshold of 300ms, a flag +named “abc_unreliable” could be raised.Another example is ……………… . The +**metric processor** further aggregates the flags into minor incidents +(yellow) and major outages (red). The yellow semaphores mean that a +service is degraded, dropping some requests or running into occasional +timeouts. However, the serice itself is still repsonding. Red semaphores +indicate that a service is not available anymore at all. Note, that this +is a very informal description of the semantics. The details are defined +in the service-specific configuration items covered later in this +document. Only if the metric processor actually creates an incident (of +whatever color), it is transitted to and displayed in the SD2 website. +The incident is listed on the website and won’t go away automatically. +It requires the manual intervention of the SIM to mark the issue as +resolved. The frontend supports the SIM in this process as she or he may +report intemdiate progress statements to the customers. The service data +of red semaphores is used to calculate an SLA value according to the +service description. + +The configuration of the backend is based on configuration files in Git +repositories hosted on GitHub (for the subject-based configuration) and +on Gitea (for OTC-related non-public data). Changes are requested and +tracked by GitOps methods. + +The components are distributed (to several regions and a non-OTC +platform, GCP) and designed redundantly to increse resillience against +outages of the platform. + +The SD2 frontend is connected to a Keycloak authentication proxy +instance, provindg access to users listed in the OTC-LDAP directory or +optionally authenticted by GitHub as an external ID provider. The SD2 +stores nor processes any personal data, except for authtication when +personalized accounts are used. + +Accessing the platform and the configuration +-------------------------------------------- + +The SD2 for the regions eu-de and eu-nl is accessible via the Internet +at: + +:: + + https://status.cloudmon.eco.tsi-dev.otc-service.com/ + +The SD2 instance for the Swiss Cloud is available at: + +:: + + https://status-ch2.cloudmon.eco.tsi-dev.otc-service.com/ + +The access for SIMs to edit or resolve incidents is available when you +extend the mentioned dashboard URLs by “/login/openid”. + +The public configuration repository for the eu-de and eu-nl is at: + +:: + + https://github.com/opentelekomcloud-infra/stackmon-config + +Consult the upcoming sections to configure any service metrics, flags, +and semaphores. + +Customizing metrics +------------------- + +The actual data flow is slightly more complex than described in the +abstract sections before, but thankfully there are already working +defaults in place, so that only little configuration has to be touched. +All configuration is formatted as YAML and can be found in the +repository + +:: + + https://github.com/opentelekomcloud-infra/stackmon-config + +There are a couple of questions to be answered to follow the metrics +through the subsystems. All metrics processed by the EPMon plugin are +based on the service catalogue of the OTC (see “openstack catalog list” +for reference). + +Question 1: What HTTP GET queires should be sent to the service? + +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/epmon/config.yaml + +This file lists under the top-level key ``elements`` the services. The +important attribute here is a list of ``urls``, that get appended to the +service endpoint. With this list several aspects of the service can be +expressed. Only “list-type” queries should be listed here as the plugin +just sends a GET request and discards the actual response body. + +.. code:: yaml + + antiddos: # simple regular antiddos + service_type: antiddos # service_type in the catalog + sdk_proxy: anti_ddos # how SDK proxy is named + urls: # which urls to test + - / + - /antiddos + - /antiddos/query_config_list + - /antiddos/default/config + - /antiddos/weekly + +If a service catalog entry should – for whatever reasons – not be +queried, assign an empty list to the ``urls``\ attribute. + +Question 2: What flags should be defined for a service? + +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml + +Under the top-level attribute ``flag metrics`` a long list of +``services`` are associated with a condition, which is abstracted by a +``template``. Effectively and as a default momst often three flags are +defined: ``api_down``, ``api_slow``, and ``api_success_rate_low``. The +“implementation” of the flag’s semantics are externalized in templates +and contain complex Graphite queries. The implementation is not +important in the context of this primer. + +.. code:: yaml + + ### Anti-DDoS + - name: "api_down" + service: "antiddos" + template: + name: "api_down" + environments: + - name: "production_eu-de" + - name: "production_eu-nl" + + - name: "api_slow" + service: "antiddos" + template: + name: "api_slow" + environments: + - name: "production_eu-de" + - name: "production_eu-nl" + + - name: "api_success_rate_low" + service: "antiddos" + template: + name: "api_success_rate_low" + environments: + - name: "production_eu-de" + - name: "production_eu-nl" + +The flag ``api_down`` means that all queries of a test series have +failed without exception. The flag ``api_slow`` is raised when the +average RTT in a test series took longer than 300 ms. The flag +``api_success_rate_low`` is similar to ``api_down``, but a bit relaxter, +as it is raised only if 90% or less of the queries succeed. In the +template file +(https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml) +there are three additional flag definitions listed, but they are +currently not widely used. Custom queries could theoretically be added +with their own templates, but this is beyond the scope of this document. + +The flags are referenced in upcoming files as *service*._name_, for +example as ``antiddos.api_slow`` in the second example instance. + +Question 3: What is the impact of one or more raised flags? + +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml + +.. code:: yaml + + ### Anti-DDoS + antiddos: + service: antiddos + component_name: "Anti DDoS" + category: database + metrics: + - antiddos.api_down + - antiddos.api_slow + - antiddos.api_success_rate_low + expressions: + - expression: "antiddos.api_slow || antiddos.api_success_rate_low" + weight: 1 + - expression: "antiddos.api_down" + weight: 2 + +In this file, the top-level ``health_metrics`` key holds a long list of +semaphores. The value of the semaphores are mapped to the colors, 1 +meaning yellow and 2 resulting in a red incident or outage, +respectively. The configuration items define how this mapping is done: +The ``metrics`` from the previous section are listed as a declaration, +the key ``expressions`` specify the actual mapping. Typically not much +needs to be changed here unless no new flags are introduced or the +semantics of red and yellow should be changed. + +Should the outcome of this mapping result in a yellow or red semaphore +an incident for the corresponding service is created, sent to the SD2 +frontend and displayed.