forked from docs/docsportal
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-authored-by: Nils Magnus <magnus@linuxtag.org> Co-committed-by: Nils Magnus <magnus@linuxtag.org>
93 lines
4.5 KiB
ReStructuredText
93 lines
4.5 KiB
ReStructuredText
======================================
|
|
Introduction to the Status Dashboard 2
|
|
======================================
|
|
|
|
The Open Telekom Cloud is represented to users and customers by the API
|
|
endpoints and the various services behind them. Customers are
|
|
interested in a reliable way to check and verify if those services are actually
|
|
available to them via the Internet.
|
|
|
|
The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC
|
|
services, intended for customers to grasp a quick overview of the service
|
|
availability. It comprises of a set of **monitoring zones**, each
|
|
monitoring services of an **monitoring environment** (a. k. a. regions
|
|
like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring
|
|
sites is configured in a mesh matrix to validate internal as well as
|
|
external connections to cloud.
|
|
|
|
Monitoring can be a tricky process, as there are many approaches of how
|
|
deep, realistic, practical, synthetic, and reliable to measure the systems
|
|
and services. The SD2 provides a reliable, quick, and comprehensive view
|
|
on the OTC, and makes some opinionated, deliberate simplifications. This
|
|
document guides through the architecture and necessary steps to maintain
|
|
the monitoring process by all OTC staff roles involved in providing a
|
|
service.
|
|
|
|
Key features of the SD2 framework:
|
|
|
|
- Developed to **supervise the 24/7 availability** of the public APIs
|
|
of the OTC platform.
|
|
- SD2 **sends GET-requests that list resources** to API-endoints. It
|
|
does explicitly not simulate more complex, multi-stage use-cases.
|
|
- Answers to such requests (status, roundtrip time) are grouped by
|
|
**service** and considered as **metrics**. They are sent to the
|
|
**Metric Processor**.
|
|
- The Metric Processor maps the metrics to **flags**, that are raised
|
|
for certain situations, like request probes not being answered (API
|
|
down), a majority not answering within a defined threshold period
|
|
(API slow) or other situations.
|
|
- Based on a combination of raised flags and their severity, the Metric
|
|
Processor calculates health metrics as **semaphores**. No flags result in
|
|
a green semaphore, minor issues result in a yellow semaphore (service
|
|
degradation), while severe situations lead to red semaphores (service
|
|
unavailable).
|
|
- The **SD2 frontend** visualizes health of the service based on the
|
|
semaphores on a website.
|
|
- Each non-green semaphore raises automatically an **issue** and displays
|
|
it on the website. MODs and/or service squad owners should now take over.
|
|
- It requires the **manual intervention** of the affected service's owners
|
|
to review, document, resolve, and eventually delete the issue condition.
|
|
|
|
.. image:: https://stackmon.github.io/assets/images/solution-diagram.svg
|
|
|
|
|
|
SD2 Architecture Summary
|
|
------------------------
|
|
|
|
- The **EpMon** plugin (end point monitoring) sends several HTTP query
|
|
requests to service endpoints and generates metrics.
|
|
- HTTP request metrics (status code, round trip time) are generated by
|
|
OpenStack SDK and are collected by Statsd.
|
|
- A time series database (Graphite) pulls metrics from Statsd.
|
|
- The Metric Processor (MP) processes the requests metrics and flags
|
|
certain circumstances. Based on defined rules and thresholds, the
|
|
MP computes resulting service health metrics (semaphores).
|
|
- The MP raises an issue for any non-green semaphore and stores it in
|
|
the SQL-based incident database that is part of the frontend component.
|
|
- The Status Dashboard frontend visualizes the incidents on a website.
|
|
- Grafana dashboards visualize data from Graphite as well as from the
|
|
Metric Processor for OTC staff members.
|
|
- Service Levels are computed based on how long incidents last.
|
|
|
|
|
|
SD2 features
|
|
------------
|
|
|
|
SD2 comes with the following features:
|
|
|
|
- Service health with 5 service statuses (three generated
|
|
semaphores, one custom semaphore light, one maintenance status).
|
|
- HTTP GET-requests for Endpoint Monitoring.
|
|
- Custom metrics and custom thresholds.
|
|
- Incidents are generated once non-green semaphores are detected.
|
|
Alternatively, incidents can be raised manually as maintence
|
|
downtimes.
|
|
- All OTC-environments including eu-de, eu-nl, and eu-sc2 are covered.
|
|
- The monitoring environments are decoupled from the monitoring zones
|
|
obtaining the metrics and include eu-de, eu-nl, eu-sc2, and GCP.
|
|
- Linked Grafana dashboards support service squad members and MODs to
|
|
understand the root cause for service health changes.
|
|
- Each service squad can control and manage their metrics as well as
|
|
dashboards individually.
|
|
- All parameters configured from single place (stackmon-config) in
|
|
human readable form (YAML). |