====================================== Introduction to the Status Dashboard 2 ====================================== The Open Telekom Cloud is represented to users and customers by the API endpoints and the various services behind them. Customers are interested in a reliable way to check and verify if those services are actually available to them via the Internet. The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC services, intended for customers to grasp a quick overview of the service availability. It comprises of a set of **monitoring zones**, each monitoring services of an **monitoring environment** (a. k. a. regions like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring sites is configured in a mesh matrix to validate internal as well as external connections to cloud. Monitoring can be a tricky process, as there are many approaches of how deep, realistic, practical, synthetic, and reliable to measure the systems and services. The SD2 provides a reliable, quick, and comprehensive view on the OTC, and makes some opinionated, deliberate simplifications. This document guides through the architecture and necessary steps to maintain the monitoring process by all OTC staff roles involved in providing a service. Key features of the SD2 framework: - Developed to **supervise the 24/7 availability** of the public APIs of the OTC platform. - SD2 **sends GET-requests that list resources** to API-endoints. It does explicitly not simulate more complex, multi-stage use-cases. - Answers to such requests (status, roundtrip time) are grouped by **service** and considered as **metrics**. They are sent to the **Metric Processor**. - The Metric Processor maps the metrics to **flags**, that are raised for certain situations, like request probes not being answered (API down), a majority not answering within a defined threshold period (API slow) or other situations. - Based on a combination of raised flags and their severity, the Metric Processor calculates health metrics as **semaphores**. No flags result in a green semaphore, minor issues result in a yellow semaphore (service degradation), while severe situations lead to red semaphores (service unavailable). - The **SD2 frontend** visualizes health of the service based on the semaphores on a website. - Each non-green semaphore raises automatically an **issue** and displays it on the website. MODs and/or service squad owners should now take over. - It requires the **manual intervention** of the affected service's owners to review, document, resolve, and eventually delete the issue condition. .. image:: https://stackmon.github.io/assets/images/solution-diagram.svg SD2 Architecture Summary ------------------------ - The **EpMon** plugin (end point monitoring) sends several HTTP query requests to service endpoints and generates metrics. - HTTP request metrics (status code, round trip time) are generated by OpenStack SDK and are collected by Statsd. - A time series database (Graphite) pulls metrics from Statsd. - The Metric Processor (MP) processes the requests metrics and flags certain circumstances. Based on defined rules and thresholds, the MP computes resulting service health metrics (semaphores). - The MP raises an issue for any non-green semaphore and stores it in the SQL-based incident database that is part of the frontend component. - The Status Dashboard frontend visualizes the incidents on a website. - Grafana dashboards visualize data from Graphite as well as from the Metric Processor for OTC staff members. - Service Levels are computed based on how long incidents last. SD2 features ------------ SD2 comes with the following features: - Service health with 5 service statuses (three generated semaphores, one custom semaphore light, one maintenance status). - HTTP GET-requests for Endpoint Monitoring. - Custom metrics and custom thresholds. - Incidents are generated once non-green semaphores are detected. Alternatively, incidents can be raised manually as maintence downtimes. - All OTC-environments including eu-de, eu-nl, and eu-sc2 are covered. - The monitoring environments are decoupled from the monitoring zones obtaining the metrics and include eu-de, eu-nl, eu-sc2, and GCP. - Linked Grafana dashboards support service squad members and MODs to understand the root cause for service health changes. - Each service squad can control and manage their metrics as well as dashboards individually. - All parameters configured from single place (stackmon-config) in human readable form (YAML).