adding SD2 training content

Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com>
Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
This commit is contained in:
Hasko, Vladimir 2023-10-04 10:07:42 +00:00 committed by zuul
parent d95af94fa3
commit f114248cfb
20 changed files with 970 additions and 0 deletions

View File

@ -0,0 +1,21 @@
Contact - Whom to address for Feedback?
=======================================
In case you have any feedback, proposals or found any issues regarding the
Status Dashboard EpMon or CloudMon, you can address them in the corresponding GitHub
OpenTelekomCloud-Infra repositories or StackMon repositories.
Issues or feedback regarding the **ApiMon, EpMon, Status Dashboard, Metric
processor** as well as new feature requests can be addressed by filing an issue
on the **Gihub** repository under
https://github.com/opentelekomcloud-infra/stackmon-config
If you have found any problems which affects the **internal dashboard design**
please open an issue/PR on **GitHub**
https://github.com/stackmon/apimon-tests
If there is another general issue/demand/request try to locate proper repository in
https://github.com/orgs/stackmon/repositories
For general questions you can write an E-Mail to the `Ecosystems Squad
<mailto:dl-pbcotcdeleco@t-systems.com>`_.

View File

@ -0,0 +1,88 @@
=====================
Dashboards management
=====================
https://dashboard.tsi-dev.otc-service.com/dashboards/f/CloudMon/cloudmon
The authentication is centrally managed by OTC LDAP.
The CloudMon Dashboards are segregated based on the type of service:
- The “Squad Flag and Health" dashboard provides high level overview about the service health
and flag metric status per each service from respective squad.
- “Cloud Service" Statistics dashboard monitors health of every endpoint url listed
by EpMon config entry.
- Dashboards can be replicated/customized for individual Squad needs.
All the Cloud Service Statistics dashboards support Environment (target monitored platform) and Zone
(monitoring source location) variables at the top of each dashboard so these
views can be adjusted based on chosen value.
All the Squad Flag And Health dashboards support Environment (target monitored platform) variables at the top of each dashboard.
Squad Flag and Health Dashboard
===============================
The dashboard provides deeper insight in Metric Processor generated metrics.
Flag panels provide information whether service has breached the thresholds
of predefined flag metric types.
Health panels provide information about resulting service health status based on evaluated flag metrics.
The resulting flag values are visualized in state timeline panels with following values:
- 0 - flag metric is not breaching the defined threshold
- 1 - flag metric is breaching the defined threshold
The resulting health values are visualized in state timeline panels with following values:
- 0 - Service operates normally
- 1 - Service has a minor issue resulting from defined reached flag metric(s)
- 2 - Service has an outage resulting from defined reached flag metrics(s)
Example at https://dashboard.tsi-dev.otc-service.com/d/s75qyOU4z/compute-flags?orgId=1
.. image:: training_images/flag_and_health_dashboard.png
Cloud Service Statistics dashboard
==================================
Cloud Service Statistics dashboards uses metrics from GET query requests towards OTC
platform (:ref:`EpMon Overview <sd2_epmon_overview>`) and visualize it in:
- API calls duration per each URL query
- API calls duration (aggregated)
- API calls response codes
Example at https://dashboard.tsi-dev.otc-service.com/d/b4560ed6-95f0-45c0-904c-6ff9f8a491e8/sfs-service-statistics?orgId=1&refresh=10s
.. image:: training_images/cloud_service_statistics.png
Custom Dashboards
=================
Previous dashboards are predefined and read-only.
The further customization is currently possible via system-config in github:
https://github.com/stackmon/apimon-tests/tree/main/dashboards/grafana
The predefined simplified dashboard panel in yaml syntax
is defined in Stackmon Github repository
(https://github.com/stackmon/apimon-tests/tree/main/dashboards)
Dashboards can be customized also just by copy/save function directly in
Grafana. The whole dashboard can be saved under new name and then edited
without any restrictions.
This approach is valid for PoC, temporary solutions and investigations but
should not be used as permanent solution as customized dashboards which are not
properly stored on Github repositories might be permanently deleted in case of
full dashboard service re-installation.

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,82 @@
.. _sd2_epmon_overview:
============================
Endpoint Monitoring overview
============================
EpMon is a standalone python based process targeting every OTC service. It
finds service in the service catalogs and sends GET requests to the configured
endpoints.
Performing extensive tests like provisioning a server is giving a great
coverage, but is usually not something what can be performed very often and
leaves certain gaps on the timescale of monitoring. In order to cover this gap
EpMon component is capable to send GET requests to the given URLs relying on the
API discovery of the OpenStack cloud (perform GET request to /servers or the
compute endpoint). Such requests are cheap and can be performed in the loop, i.e.
every 5 seconds. Latency of those calls, as well as the return codes, are being
captured and sent to the metrics storage.
Currently EpMon configuration is located in stackmon-config:
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/epmon/config.yaml
And defines the query HTTP targets (urls) for every single OTC service.
Service entry in OTC Service Catalog (https://git.tsi-dev.otc-service.com/ecosystem/service_catalog) is a prerequisite to enable service to be queried by EpMon.
If there are multiple entries in service catalog, such service entries can be marked for skip in case they are obsolete.
EpMon config.yaml only defines the service queries but doesn't say how and when to use them.
For actual use across different monitoring sources and targets the configuration matrix is defined in:
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/config.yaml
In the following example autoscaling service confiration in EpMon is shown:
.. code:: yaml
as:
service_type: as
sdk_proxy: auto_scaling
urls:
- /
- /scaling_group
- /scaling_configuration
- /scaling_policy
as_swiss:
service_type: as
sdk_proxy: auto_scaling
urls:
- /
- /scaling_group
- /scaling_configuration
as_skip_v1:
service_type: asv1
urls: []
There are 3 entries of autoscaling service.
- "as" entry is default one and used for public cloud regions.
- "as_swiss" entry is specific for Swisscloud
- "as_skip_v1" entry is entry to be skipped from EpMon
By default all entries in service catalog are triggered for EpMon.
The mandatory parameter for all entries is "service_type". This must match the service_type entry in service catalog.
Another important parameter is "sdk_proxy". This attribute identifies which otcextension module should be used
for execution of HTTP GET queries.
The most important parameter is "urls". It defines list of URLs which will be triggered for the specific service.
As service_type is known then not full url is required to be defined but only required is its path which appears after predefined url from service catalog.
If some specific service (or some specific service version) is supposed to be skipped from endpoint monitoring then it must
defined in epmon config with urls parameter setting the empty list. This ensures that even default queries from service catalog are overwritten
by the empty list in this config. In this example service type asv1 (entry from service catalog) is not being triggered by EpMon at all
as it contains empty urls list.
Collected response codes and response times are sent to graphite for further processing by Metrics Processor.

View File

@ -0,0 +1,68 @@
.. _sd2_incidents:
=========
Incidents
=========
TODO
Incidents inform customers about the reason why some cloud service has changed its status from "green" (normal operation) to any other state.
Incidents are created under following conditions:
- Metric Processor evaluates value 1 or 2 on health metric of specific cloud service and incident is automatically created on SD.
- Service Incident Manager (SIM) manually creates incident on SD for one or more cloud services.
Each cloud service on SD is represented by its name and the status semaphore color icon representing its current health status.
The following states of the service can be shown on SD2:
- Operational - green "check" mark icon
- Maintenance - blue "wrench" mark icon
- Minor Issue - yellow "cross" mark icon
- Major Issue - brown "cross" mark icon
- Service Outage - red "cross" mark icon
These 5 states can be set manually for specific service(s) during incident creation but only 2 states (Minor issue and Service Outage) are set automatically by the Metric Processor health metrics.
Incidents are visualized in the respective color scheme on the top of the SD page. Also it's possible to navigate to the related incident via clicking on the service state icon next to the service.
Once the service health status is changed and incident is created there's no automated clean-up of the incident and incident must be handledl by respective SIM. Only after incident is closed the service changes its state back to "green" Operation state.
Incident manual creation process
================================
As mentioned besides the automated incident creation the incidents can be created manually as well.
Service incident manager must authenticate prior to be able to create an incident.
Login is ensured by Openid connect feature on page https://status.cloudmon.eco.tsi-dev.otc-service.com/login/openid
Once logged in the new option "Open new incident" appears at top right corner of the page.
.. image:: training_images/sd2_incident.jpg
The incident creation process consists of these mandatory fields:
- Incident Summary - Description of the incident
- Incident Impact - Drop-down menu of 4 service states (Scheduled Maintenance, Minor Issue, Major Issue, Service Outage)
- Affected services - List of all OTC cloud services in conjunctions with regions. One or more items can be chosen
- Start - Timestamp when incident has started
Incident update process
=======================
During the incident lifecycle SIM can update incident with relevant information.
The incident creation process consists of these optional fields:
- Incident title - Change the title of the incident
- Update Message - Additional details related to the current status of the incident
- Update Status - Drop-down menu of 4 incident statuses (Analyzing incident, Fixing incident, Observing fix, Incident resolved)
- Next Update by - Timestamp when incident is expected to be updated with another information
Incident manual closure process
===============================
Incident is never closed automatically. SIM needs to close the incident by changing its status during the update incident process to "Incident resolved".
After that incident disappears from the active list of incidents and service health status is changed back to "green" operational state.
Every closed incident is recorded in the Incident History.
Incident notifications
======================
Status Dashboard support RSS feeds for incident notifications. The details how to setup RSS feed are described on :ref:`notifications <sd2_notifications>` page.

View File

@ -6,3 +6,14 @@ Status Dashboard 2 Training
:maxdepth: 1 :maxdepth: 1
onepager onepager
introduction
workflow
status_dashboard_frontend
monitoring_coverage
epmon_checks
dashboards
metrics
databases
incidents
notifications
contact

View File

@ -0,0 +1,68 @@
============
Introduction
============
The Open Telekom Cloud is represented to users and customers by the API
endpoints and the various services behind them. Customers are
interested in a reliable way to check and verify if the services are actually
available to them via the Internet.
The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC
services, intended for customers to grasp an overview of the service
availability. It comprises of a set of **monitoring zones**, each
monitoring services of an **monitoring environment** (a. k. a. regions
like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring
sites is configured in a mesh matrix to validate internal as well as external connections to cloud.
The SD2 framework:
- Developed with aim to supervise 24/7 the public APIs of OTC platform.
- GET Requests repeatedly sent to the API.
- Requests grouped in service metrics are sent to Metric Processor
- Metric Processor defines so called Flag metrics which evaluate whether service metrics reach the defined thresholds
- Based on severity of the flag metrics the health metrics are produced
- Status Dashboard visualizes health of the service based health metrics
- Green - service is ok, Yellow - service has a minor issue, Red - service has an outage
- Based on yellow and red service health the incident is created on Status Dashboard and MOD / 24/7 squad is notified
.. image:: https://stackmon.github.io/assets/images/solution-diagram.svg
SD2 Architecture Summary
------------------------
- EpMon executes various HTTP query requests towards service endpoints and
generates metrics
- The HTTP requests metrics (generated by OpenStackSDK) are collected by
statsd.
- Time Series database (graphite) is pulling metrics from statsd.
- Metric Processor processes the requests metrics and based on defined thresholds evaluates the resulting service health metrics
- Status Dashboard visualize service health based on health metrics produced by metric processor and stored in SQL database
- Grafana dashboards visualize data from graphite as well as from metric processor
SD2 features
------------
SD2 comes with the following features:
- Support of service health with 5 service statuses (3 generated semaphore lights, 1 custom semaphore light, 1 maintenance status)
- Support of HTTP requests (GET) for Endpoint Monitoring
- Support of custom metrics and custom thresholds
- Support of automatically generated incidents as well as custom incidents
- Support of all OTC environments
- EU-DE
- EU-NL
- Swisscloud
- Support of multiple Monitoring sources:
- EU-DE
- EU-NL
- Swisscloud
- Internal dashboards to understand the root cause for service health changes
- Each squad can control and manage their metrics and dashboards
- All parameters configured from single place (stackmon-config) in human readable form (yaml)

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,25 @@
.. _sd2_notifications:
=============
Notifications
=============
Status Dashboard application comes with a RSS feeds to provide the information about the incidents
Current RSS Feeds based on the "feedgen" library.
https://pypi.org/project/feedgen/
RSS feeds support region based queries and service name and service category based queries.
Example of region based query:
https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?mt=EU-DE
Example of service category based query:
https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?srvc=Compute
Examples of region and service name based query:
https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?mt=EU-DE&srv=Data%20Warehouse%20Service

View File

@ -0,0 +1,62 @@
=========================
Status Dashboard Frontend
=========================
Status Dashboard provides the status information of OTC cloud services across different regions.
The following features are supported on Status Dashboard:
- Support of service health with 5 service statuses
- Authentication by OpenID connect
- Service categories - meta grouping of services into groups
- Regions - different services are existing in regions
- Incidents - entry about issues affecting certain regions and certain services
- Support of all OTC environments
- built-in API support
- RSS notification
- SLA view on all services
- Incident history
Two Status Dashboard portals are available:
- public status dashboard: https://status.cloudmon.eco.tsi-dev.otc-service.com/
- hybrid status dashboard: https://status-ch2.cloudmon.eco.tsi-dev.otc-service.com/
Service Health View
===================
.. image:: training_images/sd2_frontend.jpg
From the architecture POV Status Dashboard is a flask based web server serving API and rendering web content with the postgresql as database.
Source can be found at https://github.com/stackmon/status-dashboard
Configuration of the status dashboard frontend is located at github: https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/sdb_prod/catalog.yaml
The catalog yaml file contains definitions of service name, service type, service categories and regions.
Example of AutoScaling service entry in SD catalog:
.. code:: yaml
- attributes:
category: Compute
region: EU-DE
type: as
name: Auto Scaling
- attributes:
category: Compute
region: EU-NL
type: as
name: Auto Scaling
SLA view
========
SLA view https://status.cloudmon.eco.tsi-dev.otc-service.com/sla is calculated only from the "outage" service health status and provide 6 months SLA history of each service.
.. image:: training_images/sd2_sla.jpg
Details how to work with incidents can be found at :ref:`incidents <sd2_incidents>` page.

Binary file not shown.

After

Width:  |  Height:  |  Size: 457 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 190 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 123 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 107 KiB

View File

@ -0,0 +1,26 @@
.. _sd2_flow:
SD2 Flow Process
================
.. image:: training_images/sd2_data_flow.svg
:target: training_images/sd2_data_flow.svg
:alt: sd2_data_flow
#. Service squad adds new data entries in github repository for
EpMOn (service URL queries),
adjusts flag and health metrics if required,
and adds service entry in SD catalog.
#. Cloudmon fetches public configuration from GitHub
and internal configuration (credentials, certs, keys,...) from local place and generate final configuration.
#. EpMon plugin is executed and triggers HTTP requests from defined configuration
#. Metrics from HTTP requests are collected by Statsd.
#. Collected metrics are stored in time-series database Graphite.
#. Metric Processor evaluates HTTP metrics from Graphite TSDB.
and generates new flag and health metrics based on defined rules and thresholds in configuration.
#. Status Dashboard changing service health semaphore light based on resulting health metrics from Metric Procesor.
#. Grafana uses metrics and statistics databases as the data sources for the
dashboards. The dashboard with various panels show the real-time status of
the platform. Grafana supports also historical views and trends.