OTC Status Dashboard 2: Cheat-Sheet for Squad Service Managers
==============================================================

The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC
services, intended for customers to grasp an overview of the service
availability. It comprises of a set of **monitoring zones**, each
monitoring services of an **monitoring environment** (a. k. a. regions
like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring
sites is configured with an HA approach in mind by the Ecosystem Squad
and is not described in technical detail in this document.

Additionally, the web-based Dashboard itself serves the monitored data
in a frontend component visible by OTC customers. The general assumption
of the SD2 is that “no news is good news”. Technically speaking the SD2
doesn’t receive any monitoring metrics, but only **incidents**. Once the
SD2 receives such an incident, the associated service is marked with
yellow or red semaphores, otherwise every services stays with a green
semaphore.

Each squad should appoint one or more colleagues for the role of a
**Service Incident Engineers (SIE)**. The SIEs define the exact
conditions when a yellow or red semaphore should be raised. This
document is intended for them.

As a secondary target group this document may also be useful for
**Service Incident Managers (SIM)**. It’s the role’s responsibility to
react on incoming incident, initiate mitigation activities, explain the
situation to customers, and eventually close incidents once they’re
resolved. For SIMs it might be useful to understand *why* incidents are
raised, but they may not need to know *how* exactly this happens.

Simplified architectural overview and data flow
-----------------------------------------------

The SD2 is a specific application of the much more genral Stackmon
framework for cloud monitoring. It is licensed as open source software,
initiated by the OTC, and developed together with the Community. Due to
this design, the monitoring data flows through several stages. Most of
them can be configured and customized to a great deal to serve for many
different purposes. However, SD2 comes with a number of assumptions and
pre-configurations to reduce complexity for SIEs and SIMs.

The data flows through these stages: A plugin collects the raw metrics
from the live systems of the OTC. For the SD2 the EPMon-plugin is used,
which is an abbreviation for “endpoint monitoring”. This means that the
plugin sends HTTP-GET-requests to API endpoints that are listed in the
OTC service catalogue. Typically simple “list” requests are queried, and
no actual resources are created or modified by the action. The
EPMon-plugin records only the status code and the round-trip-time for
the response. There is a maximum timeout configured. The results of the
probes are stored in a TSDB implemented by Graphite. By means of some
Graphit-queries, the raw data is aggregated, resulting in several
**flags**. For example, if less than 90% of all queries to the
ABC-service in the past 15 minutes exceeded a threshold of 300ms, a flag
named “abc_unreliable” could be raised.Another example is ……………… . The
**metric processor** further aggregates the flags into minor incidents
(yellow) and major outages (red). The yellow semaphores mean that a
service is degraded, dropping some requests or running into occasional
timeouts. However, the serice itself is still repsonding. Red semaphores
indicate that a service is not available anymore at all. Note, that this
is a very informal description of the semantics. The details are defined
in the service-specific configuration items covered later in this
document. Only if the metric processor actually creates an incident (of
whatever color), it is transitted to and displayed in the SD2 website.
The incident is listed on the website and won’t go away automatically.
It requires the manual intervention of the SIM to mark the issue as
resolved. The frontend supports the SIM in this process as she or he may
report intemdiate progress statements to the customers. The service data
of red semaphores is used to calculate an SLA value according to the
service description.

The configuration of the backend is based on configuration files in Git
repositories hosted on GitHub (for the subject-based configuration) and
on Gitea (for OTC-related non-public data). Changes are requested and
tracked by GitOps methods.

The components are distributed (to several regions and a non-OTC
platform, GCP) and designed redundantly to increse resillience against
outages of the platform.

The SD2 frontend is connected to a Keycloak authentication proxy
instance, provindg access to users listed in the OTC-LDAP directory or
optionally authenticted by GitHub as an external ID provider. The SD2
stores nor processes any personal data, except for authtication when
personalized accounts are used.

Accessing the platform and the configuration
--------------------------------------------

The SD2 for the regions eu-de and eu-nl is accessible via the Internet
at:

::

   https://status.cloudmon.eco.tsi-dev.otc-service.com/

The SD2 instance for the Swiss Cloud is available at:

::

   https://status-ch2.cloudmon.eco.tsi-dev.otc-service.com/

The access for SIMs to edit or resolve incidents is available when you
extend the mentioned dashboard URLs by “/login/openid”.

The public configuration repository for the eu-de and eu-nl is at:

::

   https://github.com/opentelekomcloud-infra/stackmon-config

Consult the upcoming sections to configure any service metrics, flags,
and semaphores.

Customizing metrics
-------------------

The actual data flow is slightly more complex than described in the
abstract sections before, but thankfully there are already working
defaults in place, so that only little configuration has to be touched.
All configuration is formatted as YAML and can be found in the
repository

::

   https://github.com/opentelekomcloud-infra/stackmon-config

There are a couple of questions to be answered to follow the metrics
through the subsystems. All metrics processed by the EPMon plugin are
based on the service catalogue of the OTC (see “openstack catalog list”
for reference).

Question 1: What HTTP GET queires should be sent to the service?

https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/epmon/config.yaml

This file lists under the top-level key ``elements`` the services. The
important attribute here is a list of ``urls``, that get appended to the
service endpoint. With this list several aspects of the service can be
expressed. Only “list-type” queries should be listed here as the plugin
just sends a GET request and discards the actual response body.

.. code:: yaml

    antiddos: # simple regular antiddos
       service_type: antiddos # service_type in the catalog
       sdk_proxy: anti_ddos # how SDK proxy is named
       urls: # which urls to test
         - /
         - /antiddos
         - /antiddos/query_config_list
         - /antiddos/default/config
         - /antiddos/weekly

If a service catalog entry should – for whatever reasons – not be
queried, assign an empty list to the ``urls``\ attribute.

Question 2: What flags should be defined for a service?

https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml

Under the top-level attribute ``flag metrics`` a long list of
``services`` are associated with a condition, which is abstracted by a
``template``. Effectively and as a default momst often three flags are
defined: ``api_down``, ``api_slow``, and ``api_success_rate_low``. The
“implementation” of the flag’s semantics are externalized in templates
and contain complex Graphite queries. The implementation is not
important in the context of this primer.

.. code:: yaml

   ### Anti-DDoS
     - name: "api_down"
       service: "antiddos"
       template:
         name: "api_down"
       environments:
         - name: "production_eu-de"
         - name: "production_eu-nl"

     - name: "api_slow"
       service: "antiddos"
       template:
         name: "api_slow"
       environments:
         - name: "production_eu-de"
         - name: "production_eu-nl"

     - name: "api_success_rate_low"
       service: "antiddos"
       template:
         name: "api_success_rate_low"
       environments:
         - name: "production_eu-de"
         - name: "production_eu-nl"

The flag ``api_down`` means that all queries of a test series have
failed without exception. The flag ``api_slow`` is raised when the
average RTT in a test series took longer than 300 ms. The flag
``api_success_rate_low`` is similar to ``api_down``, but a bit relaxter,
as it is raised only if 90% or less of the queries succeed. In the
template file
(https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml)
there are three additional flag definitions listed, but they are
currently not widely used. Custom queries could theoretically be added
with their own templates, but this is beyond the scope of this document.

The flags are referenced in upcoming files as *service*._name_, for
example as ``antiddos.api_slow`` in the second example instance.

Question 3: What is the impact of one or more raised flags?

https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml

.. code:: yaml

   ### Anti-DDoS
     antiddos:
       service: antiddos
       component_name: "Anti DDoS"
       category: database
       metrics:
         - antiddos.api_down
         - antiddos.api_slow
         - antiddos.api_success_rate_low
       expressions:
         - expression: "antiddos.api_slow || antiddos.api_success_rate_low"
           weight: 1
         - expression: "antiddos.api_down"
           weight: 2

In this file, the top-level ``health_metrics`` key holds a long list of
semaphores. The value of the semaphores are mapped to the colors, 1
meaning yellow and 2 resulting in a red incident or outage,
respectively. The configuration items define how this mapping is done:
The ``metrics`` from the previous section are listed as a declaration,
the key ``expressions`` specify the actual mapping. Typically not much
needs to be changed here unless no new flags are introduced or the
semantics of red and yellow should be changed.

Should the outcome of this mapping result in a yellow or red semaphore
an incident for the corresponding service is created, sent to the SD2
frontend and displayed.