OTC Status Dashboard 2: Cheat-Sheet for Squad Service Managers ============================================================== The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC services, intended for customers to grasp an overview of the service availability. It comprises of a set of **monitoring zones**, each monitoring services of an **monitoring environment** (a. k. a. regions like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring sites is configured with an HA approach in mind by the Ecosystem Squad and is not described in technical detail in this document. Additionally, the web-based Dashboard itself serves the monitored data in a frontend component visible by OTC customers. The general assumption of the SD2 is that “no news is good news”. Technically speaking the SD2 doesn’t receive any monitoring metrics, but only **incidents**. Once the SD2 receives such an incident, the associated service is marked with yellow or red semaphores, otherwise every services stays with a green semaphore. Each squad should appoint one or more colleagues for the role of a **Service Incident Engineers (SIE)**. The SIEs define the exact conditions when a yellow or red semaphore should be raised. This document is intended for them. As a secondary target group this document may also be useful for **Service Incident Managers (SIM)**. It’s the role’s responsibility to react on incoming incident, initiate mitigation activities, explain the situation to customers, and eventually close incidents once they’re resolved. For SIMs it might be useful to understand *why* incidents are raised, but they may not need to know *how* exactly this happens. Simplified architectural overview and data flow ----------------------------------------------- The SD2 is a specific application of the much more genral Stackmon framework for cloud monitoring. It is licensed as open source software, initiated by the OTC, and developed together with the Community. Due to this design, the monitoring data flows through several stages. Most of them can be configured and customized to a great deal to serve for many different purposes. However, SD2 comes with a number of assumptions and pre-configurations to reduce complexity for SIEs and SIMs. The data flows through these stages: A plugin collects the raw metrics from the live systems of the OTC. For the SD2 the EPMon-plugin is used, which is an abbreviation for “endpoint monitoring”. This means that the plugin sends HTTP-GET-requests to API endpoints that are listed in the OTC service catalogue. Typically simple “list” requests are queried, and no actual resources are created or modified by the action. The EPMon-plugin records only the status code and the round-trip-time for the response. There is a maximum timeout configured. The results of the probes are stored in a TSDB implemented by Graphite. By means of some Graphit-queries, the raw data is aggregated, resulting in several **flags**. For example, if less than 90% of all queries to the ABC-service in the past 15 minutes exceeded a threshold of 300ms, a flag named “abc_unreliable” could be raised.Another example is ……………… . The **metric processor** further aggregates the flags into minor incidents (yellow) and major outages (red). The yellow semaphores mean that a service is degraded, dropping some requests or running into occasional timeouts. However, the serice itself is still repsonding. Red semaphores indicate that a service is not available anymore at all. Note, that this is a very informal description of the semantics. The details are defined in the service-specific configuration items covered later in this document. Only if the metric processor actually creates an incident (of whatever color), it is transitted to and displayed in the SD2 website. The incident is listed on the website and won’t go away automatically. It requires the manual intervention of the SIM to mark the issue as resolved. The frontend supports the SIM in this process as she or he may report intemdiate progress statements to the customers. The service data of red semaphores is used to calculate an SLA value according to the service description. The configuration of the backend is based on configuration files in Git repositories hosted on GitHub (for the subject-based configuration) and on Gitea (for OTC-related non-public data). Changes are requested and tracked by GitOps methods. The components are distributed (to several regions and a non-OTC platform, GCP) and designed redundantly to increse resillience against outages of the platform. The SD2 frontend is connected to a Keycloak authentication proxy instance, provindg access to users listed in the OTC-LDAP directory or optionally authenticted by GitHub as an external ID provider. The SD2 stores nor processes any personal data, except for authtication when personalized accounts are used. Accessing the platform and the configuration -------------------------------------------- The SD2 for the regions eu-de and eu-nl is accessible via the Internet at: :: https://status.cloudmon.eco.tsi-dev.otc-service.com/ The SD2 instance for the Swiss Cloud is available at: :: https://status-ch2.cloudmon.eco.tsi-dev.otc-service.com/ The access for SIMs to edit or resolve incidents is available when you extend the mentioned dashboard URLs by “/login/openid”. The public configuration repository for the eu-de and eu-nl is at: :: https://github.com/opentelekomcloud-infra/stackmon-config Consult the upcoming sections to configure any service metrics, flags, and semaphores. Customizing metrics ------------------- The actual data flow is slightly more complex than described in the abstract sections before, but thankfully there are already working defaults in place, so that only little configuration has to be touched. All configuration is formatted as YAML and can be found in the repository :: https://github.com/opentelekomcloud-infra/stackmon-config There are a couple of questions to be answered to follow the metrics through the subsystems. All metrics processed by the EPMon plugin are based on the service catalogue of the OTC (see “openstack catalog list” for reference). Question 1: What HTTP GET queires should be sent to the service? https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/epmon/config.yaml This file lists under the top-level key ``elements`` the services. The important attribute here is a list of ``urls``, that get appended to the service endpoint. With this list several aspects of the service can be expressed. Only “list-type” queries should be listed here as the plugin just sends a GET request and discards the actual response body. .. code:: yaml antiddos: # simple regular antiddos service_type: antiddos # service_type in the catalog sdk_proxy: anti_ddos # how SDK proxy is named urls: # which urls to test - / - /antiddos - /antiddos/query_config_list - /antiddos/default/config - /antiddos/weekly If a service catalog entry should – for whatever reasons – not be queried, assign an empty list to the ``urls``\ attribute. Question 2: What flags should be defined for a service? https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml Under the top-level attribute ``flag metrics`` a long list of ``services`` are associated with a condition, which is abstracted by a ``template``. Effectively and as a default momst often three flags are defined: ``api_down``, ``api_slow``, and ``api_success_rate_low``. The “implementation” of the flag’s semantics are externalized in templates and contain complex Graphite queries. The implementation is not important in the context of this primer. .. code:: yaml ### Anti-DDoS - name: "api_down" service: "antiddos" template: name: "api_down" environments: - name: "production_eu-de" - name: "production_eu-nl" - name: "api_slow" service: "antiddos" template: name: "api_slow" environments: - name: "production_eu-de" - name: "production_eu-nl" - name: "api_success_rate_low" service: "antiddos" template: name: "api_success_rate_low" environments: - name: "production_eu-de" - name: "production_eu-nl" The flag ``api_down`` means that all queries of a test series have failed without exception. The flag ``api_slow`` is raised when the average RTT in a test series took longer than 300 ms. The flag ``api_success_rate_low`` is similar to ``api_down``, but a bit relaxter, as it is raised only if 90% or less of the queries succeed. In the template file (https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml) there are three additional flag definitions listed, but they are currently not widely used. Custom queries could theoretically be added with their own templates, but this is beyond the scope of this document. The flags are referenced in upcoming files as *service*._name_, for example as ``antiddos.api_slow`` in the second example instance. Question 3: What is the impact of one or more raised flags? https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml .. code:: yaml ### Anti-DDoS antiddos: service: antiddos component_name: "Anti DDoS" category: database metrics: - antiddos.api_down - antiddos.api_slow - antiddos.api_success_rate_low expressions: - expression: "antiddos.api_slow || antiddos.api_success_rate_low" weight: 1 - expression: "antiddos.api_down" weight: 2 In this file, the top-level ``health_metrics`` key holds a long list of semaphores. The value of the semaphores are mapped to the colors, 1 meaning yellow and 2 resulting in a red incident or outage, respectively. The configuration items define how this mapping is done: The ``metrics`` from the previous section are listed as a declaration, the key ``expressions`` specify the actual mapping. Typically not much needs to be changed here unless no new flags are introduced or the semantics of red and yellow should be changed. Should the outcome of this mapping result in a yellow or red semaphore an incident for the corresponding service is created, sent to the SD2 frontend and displayed.