Nils Magnus 6e2da0d05c review of training material
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Nils Magnus <magnus@linuxtag.org>
Co-committed-by: Nils Magnus <magnus@linuxtag.org>
2023-10-12 18:02:41 +00:00

93 lines
4.5 KiB
ReStructuredText

======================================
Introduction to the Status Dashboard 2
======================================
The Open Telekom Cloud is represented to users and customers by the API
endpoints and the various services behind them. Customers are
interested in a reliable way to check and verify if those services are actually
available to them via the Internet.
The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC
services, intended for customers to grasp a quick overview of the service
availability. It comprises of a set of **monitoring zones**, each
monitoring services of an **monitoring environment** (a. k. a. regions
like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring
sites is configured in a mesh matrix to validate internal as well as
external connections to cloud.
Monitoring can be a tricky process, as there are many approaches of how
deep, realistic, practical, synthetic, and reliable to measure the systems
and services. The SD2 provides a reliable, quick, and comprehensive view
on the OTC, and makes some opinionated, deliberate simplifications. This
document guides through the architecture and necessary steps to maintain
the monitoring process by all OTC staff roles involved in providing a
service.
Key features of the SD2 framework:
- Developed to **supervise the 24/7 availability** of the public APIs
of the OTC platform.
- SD2 **sends GET-requests that list resources** to API-endoints. It
does explicitly not simulate more complex, multi-stage use-cases.
- Answers to such requests (status, roundtrip time) are grouped by
**service** and considered as **metrics**. They are sent to the
**Metric Processor**.
- The Metric Processor maps the metrics to **flags**, that are raised
for certain situations, like request probes not being answered (API
down), a majority not answering within a defined threshold period
(API slow) or other situations.
- Based on a combination of raised flags and their severity, the Metric
Processor calculates health metrics as **semaphores**. No flags result in
a green semaphore, minor issues result in a yellow semaphore (service
degradation), while severe situations lead to red semaphores (service
unavailable).
- The **SD2 frontend** visualizes health of the service based on the
semaphores on a website.
- Each non-green semaphore raises automatically an **issue** and displays
it on the website. MODs and/or service squad owners should now take over.
- It requires the **manual intervention** of the affected service's owners
to review, document, resolve, and eventually delete the issue condition.
.. image:: https://stackmon.github.io/assets/images/solution-diagram.svg
SD2 Architecture Summary
------------------------
- The **EpMon** plugin (end point monitoring) sends several HTTP query
requests to service endpoints and generates metrics.
- HTTP request metrics (status code, round trip time) are generated by
OpenStack SDK and are collected by Statsd.
- A time series database (Graphite) pulls metrics from Statsd.
- The Metric Processor (MP) processes the requests metrics and flags
certain circumstances. Based on defined rules and thresholds, the
MP computes resulting service health metrics (semaphores).
- The MP raises an issue for any non-green semaphore and stores it in
the SQL-based incident database that is part of the frontend component.
- The Status Dashboard frontend visualizes the incidents on a website.
- Grafana dashboards visualize data from Graphite as well as from the
Metric Processor for OTC staff members.
- Service Levels are computed based on how long incidents last.
SD2 features
------------
SD2 comes with the following features:
- Service health with 5 service statuses (three generated
semaphores, one custom semaphore light, one maintenance status).
- HTTP GET-requests for Endpoint Monitoring.
- Custom metrics and custom thresholds.
- Incidents are generated once non-green semaphores are detected.
Alternatively, incidents can be raised manually as maintence
downtimes.
- All OTC-environments including eu-de, eu-nl, and eu-sc2 are covered.
- The monitoring environments are decoupled from the monitoring zones
obtaining the metrics and include eu-de, eu-nl, eu-sc2, and GCP.
- Linked Grafana dashboards support service squad members and MODs to
understand the root cause for service health changes.
- Each service squad can control and manage their metrics as well as
dashboards individually.
- All parameters configured from single place (stackmon-config) in
human readable form (YAML).