forked from docs/internal-documentation
244 lines
10 KiB
ReStructuredText
244 lines
10 KiB
ReStructuredText
OTC Status Dashboard 2: Cheat-Sheet for Squad Service Managers
|
||
==============================================================
|
||
|
||
The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC
|
||
services, intended for customers to grasp an overview of the service
|
||
availability. It comprises of a set of **monitoring zones**, each
|
||
monitoring services of an **monitoring environment** (a. k. a. regions
|
||
like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring
|
||
sites is configured with an HA approach in mind by the Ecosystem Squad
|
||
and is not described in technical detail in this document.
|
||
|
||
Additionally, the web-based Dashboard itself serves the monitored data
|
||
in a frontend component visible by OTC customers. The general assumption
|
||
of the SD2 is that “no news is good news”. Technically speaking the SD2
|
||
doesn’t receive any monitoring metrics, but only **incidents**. Once the
|
||
SD2 receives such an incident, the associated service is marked with
|
||
yellow or red semaphores, otherwise every services stays with a green
|
||
semaphore.
|
||
|
||
Each squad should appoint one or more colleagues for the role of a
|
||
**Service Incident Engineers (SIE)**. The SIEs define the exact
|
||
conditions when a yellow or red semaphore should be raised. This
|
||
document is intended for them.
|
||
|
||
As a secondary target group this document may also be useful for
|
||
**Service Incident Managers (SIM)**. It’s the role’s responsibility to
|
||
react on incoming incident, initiate mitigation activities, explain the
|
||
situation to customers, and eventually close incidents once they’re
|
||
resolved. For SIMs it might be useful to understand *why* incidents are
|
||
raised, but they may not need to know *how* exactly this happens.
|
||
|
||
Simplified architectural overview and data flow
|
||
-----------------------------------------------
|
||
|
||
The SD2 is a specific application of the much more genral Stackmon
|
||
framework for cloud monitoring. It is licensed as open source software,
|
||
initiated by the OTC, and developed together with the Community. Due to
|
||
this design, the monitoring data flows through several stages. Most of
|
||
them can be configured and customized to a great deal to serve for many
|
||
different purposes. However, SD2 comes with a number of assumptions and
|
||
pre-configurations to reduce complexity for SIEs and SIMs.
|
||
|
||
The data flows through these stages: A plugin collects the raw metrics
|
||
from the live systems of the OTC. For the SD2 the EPMon-plugin is used,
|
||
which is an abbreviation for “endpoint monitoring”. This means that the
|
||
plugin sends HTTP-GET-requests to API endpoints that are listed in the
|
||
OTC service catalogue. Typically simple “list” requests are queried, and
|
||
no actual resources are created or modified by the action. The
|
||
EPMon-plugin records only the status code and the round-trip-time for
|
||
the response. There is a maximum timeout configured. The results of the
|
||
probes are stored in a TSDB implemented by Graphite. By means of some
|
||
Graphit-queries, the raw data is aggregated, resulting in several
|
||
**flags**. For example, if less than 90% of all queries to the
|
||
ABC-service in the past 15 minutes exceeded a threshold of 300ms, a flag
|
||
named “abc_unreliable” could be raised.Another example is ……………… . The
|
||
**metric processor** further aggregates the flags into minor incidents
|
||
(yellow) and major outages (red). The yellow semaphores mean that a
|
||
service is degraded, dropping some requests or running into occasional
|
||
timeouts. However, the serice itself is still repsonding. Red semaphores
|
||
indicate that a service is not available anymore at all. Note, that this
|
||
is a very informal description of the semantics. The details are defined
|
||
in the service-specific configuration items covered later in this
|
||
document. Only if the metric processor actually creates an incident (of
|
||
whatever color), it is transitted to and displayed in the SD2 website.
|
||
The incident is listed on the website and won’t go away automatically.
|
||
It requires the manual intervention of the SIM to mark the issue as
|
||
resolved. The frontend supports the SIM in this process as she or he may
|
||
report intemdiate progress statements to the customers. The service data
|
||
of red semaphores is used to calculate an SLA value according to the
|
||
service description.
|
||
|
||
The configuration of the backend is based on configuration files in Git
|
||
repositories hosted on GitHub (for the subject-based configuration) and
|
||
on Gitea (for OTC-related non-public data). Changes are requested and
|
||
tracked by GitOps methods.
|
||
|
||
The components are distributed (to several regions and a non-OTC
|
||
platform, GCP) and designed redundantly to increse resillience against
|
||
outages of the platform.
|
||
|
||
The SD2 frontend is connected to a Keycloak authentication proxy
|
||
instance, provindg access to users listed in the OTC-LDAP directory or
|
||
optionally authenticted by GitHub as an external ID provider. The SD2
|
||
stores nor processes any personal data, except for authtication when
|
||
personalized accounts are used.
|
||
|
||
Accessing the platform and the configuration
|
||
--------------------------------------------
|
||
|
||
The SD2 for the regions eu-de and eu-nl is accessible via the Internet
|
||
at:
|
||
|
||
::
|
||
|
||
https://status.cloudmon.eco.tsi-dev.otc-service.com/
|
||
|
||
The SD2 instance for the Swiss Cloud is available at:
|
||
|
||
::
|
||
|
||
https://status-ch2.cloudmon.eco.tsi-dev.otc-service.com/
|
||
|
||
The access for SIMs to edit or resolve incidents is available when you
|
||
extend the mentioned dashboard URLs by “/login/openid”.
|
||
|
||
The public configuration repository for the eu-de and eu-nl is at:
|
||
|
||
::
|
||
|
||
https://github.com/opentelekomcloud-infra/stackmon-config
|
||
|
||
Consult the upcoming sections to configure any service metrics, flags,
|
||
and semaphores.
|
||
|
||
Customizing metrics
|
||
-------------------
|
||
|
||
The actual data flow is slightly more complex than described in the
|
||
abstract sections before, but thankfully there are already working
|
||
defaults in place, so that only little configuration has to be touched.
|
||
All configuration is formatted as YAML and can be found in the
|
||
repository
|
||
|
||
::
|
||
|
||
https://github.com/opentelekomcloud-infra/stackmon-config
|
||
|
||
There are a couple of questions to be answered to follow the metrics
|
||
through the subsystems. All metrics processed by the EPMon plugin are
|
||
based on the service catalogue of the OTC (see “openstack catalog list”
|
||
for reference).
|
||
|
||
Question 1: What HTTP GET queires should be sent to the service?
|
||
|
||
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/epmon/config.yaml
|
||
|
||
This file lists under the top-level key ``elements`` the services. The
|
||
important attribute here is a list of ``urls``, that get appended to the
|
||
service endpoint. With this list several aspects of the service can be
|
||
expressed. Only “list-type” queries should be listed here as the plugin
|
||
just sends a GET request and discards the actual response body.
|
||
|
||
.. code:: yaml
|
||
|
||
antiddos: # simple regular antiddos
|
||
service_type: antiddos # service_type in the catalog
|
||
sdk_proxy: anti_ddos # how SDK proxy is named
|
||
urls: # which urls to test
|
||
- /
|
||
- /antiddos
|
||
- /antiddos/query_config_list
|
||
- /antiddos/default/config
|
||
- /antiddos/weekly
|
||
|
||
If a service catalog entry should – for whatever reasons – not be
|
||
queried, assign an empty list to the ``urls``\ attribute.
|
||
|
||
Question 2: What flags should be defined for a service?
|
||
|
||
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml
|
||
|
||
Under the top-level attribute ``flag metrics`` a long list of
|
||
``services`` are associated with a condition, which is abstracted by a
|
||
``template``. Effectively and as a default momst often three flags are
|
||
defined: ``api_down``, ``api_slow``, and ``api_success_rate_low``. The
|
||
“implementation” of the flag’s semantics are externalized in templates
|
||
and contain complex Graphite queries. The implementation is not
|
||
important in the context of this primer.
|
||
|
||
.. code:: yaml
|
||
|
||
### Anti-DDoS
|
||
- name: "api_down"
|
||
service: "antiddos"
|
||
template:
|
||
name: "api_down"
|
||
environments:
|
||
- name: "production_eu-de"
|
||
- name: "production_eu-nl"
|
||
|
||
- name: "api_slow"
|
||
service: "antiddos"
|
||
template:
|
||
name: "api_slow"
|
||
environments:
|
||
- name: "production_eu-de"
|
||
- name: "production_eu-nl"
|
||
|
||
- name: "api_success_rate_low"
|
||
service: "antiddos"
|
||
template:
|
||
name: "api_success_rate_low"
|
||
environments:
|
||
- name: "production_eu-de"
|
||
- name: "production_eu-nl"
|
||
|
||
The flag ``api_down`` means that all queries of a test series have
|
||
failed without exception. The flag ``api_slow`` is raised when the
|
||
average RTT in a test series took longer than 300 ms. The flag
|
||
``api_success_rate_low`` is similar to ``api_down``, but a bit relaxter,
|
||
as it is raised only if 90% or less of the queries succeed. In the
|
||
template file
|
||
(https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml)
|
||
there are three additional flag definitions listed, but they are
|
||
currently not widely used. Custom queries could theoretically be added
|
||
with their own templates, but this is beyond the scope of this document.
|
||
|
||
The flags are referenced in upcoming files as *service*._name_, for
|
||
example as ``antiddos.api_slow`` in the second example instance.
|
||
|
||
Question 3: What is the impact of one or more raised flags?
|
||
|
||
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml
|
||
|
||
.. code:: yaml
|
||
|
||
### Anti-DDoS
|
||
antiddos:
|
||
service: antiddos
|
||
component_name: "Anti DDoS"
|
||
category: database
|
||
metrics:
|
||
- antiddos.api_down
|
||
- antiddos.api_slow
|
||
- antiddos.api_success_rate_low
|
||
expressions:
|
||
- expression: "antiddos.api_slow || antiddos.api_success_rate_low"
|
||
weight: 1
|
||
- expression: "antiddos.api_down"
|
||
weight: 2
|
||
|
||
In this file, the top-level ``health_metrics`` key holds a long list of
|
||
semaphores. The value of the semaphores are mapped to the colors, 1
|
||
meaning yellow and 2 resulting in a red incident or outage,
|
||
respectively. The configuration items define how this mapping is done:
|
||
The ``metrics`` from the previous section are listed as a declaration,
|
||
the key ``expressions`` specify the actual mapping. Typically not much
|
||
needs to be changed here unless no new flags are introduced or the
|
||
semantics of red and yellow should be changed.
|
||
|
||
Should the outcome of this mapping result in a yellow or red semaphore
|
||
an incident for the corresponding service is created, sent to the SD2
|
||
frontend and displayed.
|