adding SD2 training content

Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com> Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com> Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
2023-10-04 10:07:42 +00:00 · 2023-10-04 10:07:42 +00:00 · f114248cfb
commit f114248cfb
parent d95af94fa3
20 changed files with 970 additions and 0 deletions
--- a/doc/source/internal/sd2_training/contact.rst
+++ b/doc/source/internal/sd2_training/contact.rst
@ -0,0 +1,21 @@
 Contact - Whom to address for Feedback?
 =======================================
 In case you have any feedback, proposals or found any issues regarding the
 Status Dashboard EpMon or CloudMon, you can address them in the corresponding GitHub
 OpenTelekomCloud-Infra repositories or StackMon repositories.
 Issues or feedback regarding the **ApiMon, EpMon, Status Dashboard, Metric
 processor** as well as new feature requests can be addressed by filing an issue
 on the **Gihub** repository under
 https://github.com/opentelekomcloud-infra/stackmon-config
 If you have found any problems which affects the **internal dashboard design**
 please open an issue/PR on **GitHub**
 https://github.com/stackmon/apimon-tests
 If there is another general issue/demand/request try to locate proper repository in
 https://github.com/orgs/stackmon/repositories
 For general questions you can write an E-Mail to the `Ecosystems Squad
 <mailto:dl-pbcotcdeleco@t-systems.com>`_.
--- a/doc/source/internal/sd2_training/dashboards.rst
+++ b/doc/source/internal/sd2_training/dashboards.rst
@ -0,0 +1,88 @@
 =====================
 Dashboards management
 =====================
 https://dashboard.tsi-dev.otc-service.com/dashboards/f/CloudMon/cloudmon
 The authentication is centrally managed by OTC LDAP.
 The CloudMon Dashboards are segregated based on the type of service:
 - The “Squad Flag and Health" dashboard provides high level overview about the service health
   and flag metric status per each service from respective squad.
 - “Cloud Service" Statistics dashboard monitors health of every endpoint url listed
   by EpMon config entry.
 - Dashboards can be replicated/customized for individual Squad needs.
 All the Cloud Service Statistics dashboards support Environment (target monitored platform) and Zone
 (monitoring source location) variables at the top of each dashboard so these
 views can be adjusted based on chosen value.
 All the Squad Flag And Health dashboards support Environment (target monitored platform) variables at the top of each dashboard.
 Squad Flag and Health Dashboard
 ===============================
 The dashboard provides deeper insight in Metric Processor generated metrics.
 Flag panels provide information whether service has breached the thresholds
 of predefined flag metric types.
 Health panels provide information about resulting service health status based on evaluated flag metrics.
 The resulting flag values are visualized in state timeline panels with following values:
 -   0 - flag metric is not breaching the defined threshold
 -   1 - flag metric is breaching the defined threshold
 The resulting health values are visualized in state timeline panels with following values:
 -   0 - Service operates normally
 -   1 - Service has a minor issue resulting from defined reached flag metric(s)
 -   2 - Service has an outage resulting from defined reached flag metrics(s)
 Example at https://dashboard.tsi-dev.otc-service.com/d/s75qyOU4z/compute-flags?orgId=1
 .. image:: training_images/flag_and_health_dashboard.png
 Cloud Service Statistics dashboard
 ==================================
 Cloud Service Statistics dashboards uses metrics from GET query requests towards OTC
 platform (:ref:`EpMon Overview <sd2_epmon_overview>`) and visualize it in:
 - API calls duration per each URL query
 - API calls duration (aggregated)
 - API calls response codes
 Example at https://dashboard.tsi-dev.otc-service.com/d/b4560ed6-95f0-45c0-904c-6ff9f8a491e8/sfs-service-statistics?orgId=1&refresh=10s
 .. image:: training_images/cloud_service_statistics.png
 Custom Dashboards
 =================
 Previous dashboards are predefined and read-only.
 The further customization is currently possible via system-config in github:
 https://github.com/stackmon/apimon-tests/tree/main/dashboards/grafana
 The predefined simplified dashboard panel in yaml syntax
 is defined in Stackmon Github repository 
 (https://github.com/stackmon/apimon-tests/tree/main/dashboards)
 Dashboards can be customized also just by copy/save function directly in
 Grafana. The whole dashboard can be saved under new name and then edited
 without any restrictions.
 This approach is valid for PoC, temporary solutions and investigations but
 should not be used as permanent solution as customized dashboards which are not
 properly stored on Github repositories might be permanently deleted in case of
 full dashboard service re-installation.
--- a/doc/source/internal/sd2_training/databases.rst
+++ b/doc/source/internal/sd2_training/databases.rst
--- a/doc/source/internal/sd2_training/epmon_checks.rst
+++ b/doc/source/internal/sd2_training/epmon_checks.rst
@ -0,0 +1,82 @@
 .. _sd2_epmon_overview:
 ============================
 Endpoint Monitoring overview
 ============================
 EpMon is a standalone python based process targeting every OTC service. It
 finds service in the service catalogs and sends GET requests to the configured
 endpoints.
 Performing extensive tests like provisioning a server is giving a great
 coverage, but is usually not something what can be performed very often and
 leaves certain gaps on the timescale of monitoring. In order to cover this gap
 EpMon component is capable to send GET requests to the given URLs relying on the
 API discovery of the OpenStack cloud (perform GET request to /servers or the
 compute endpoint). Such requests are cheap and can be performed in the loop, i.e.
 every 5 seconds. Latency of those calls, as well as the return codes, are being
 captured and sent to the metrics storage.
 Currently EpMon configuration is located in stackmon-config:
 https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/epmon/config.yaml
 And defines the query HTTP targets (urls) for every single OTC service.
 Service entry in OTC Service Catalog (https://git.tsi-dev.otc-service.com/ecosystem/service_catalog) is a prerequisite to enable service to be queried by EpMon.
 If there are multiple entries in service catalog, such service entries can be marked for skip in case they are obsolete.
 EpMon config.yaml only defines the service queries but doesn't say how and when to use them.
 For actual use across different monitoring sources and targets the configuration matrix is defined in:
 https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/config.yaml
 In the following example autoscaling service confiration in EpMon is shown:
 .. code:: yaml
  as:
    service_type: as
    sdk_proxy: auto_scaling
    urls:
      - /
      - /scaling_group
      - /scaling_configuration
      - /scaling_policy
  as_swiss:
    service_type: as
    sdk_proxy: auto_scaling
    urls:
      - /
      - /scaling_group
      - /scaling_configuration
  as_skip_v1:
    service_type: asv1
    urls: []
 There are 3 entries of autoscaling service. 
 - "as" entry is default one and used for public cloud regions.
 - "as_swiss" entry is specific for Swisscloud 
 - "as_skip_v1" entry is entry to be skipped from EpMon
 By default all entries in service catalog are triggered for EpMon.
 The mandatory parameter for all entries is "service_type". This must match the service_type entry in service catalog.
 Another important parameter is "sdk_proxy". This attribute identifies which otcextension module should be used
 for execution of HTTP GET queries.
 The most important parameter is "urls". It defines list of URLs which will be triggered for the specific service.
 As service_type is known then not full url is required to be defined but only required is its path which appears after predefined url from service catalog.
 If some specific service (or some specific service version) is supposed to be skipped from endpoint monitoring then it must
 defined in epmon config with urls parameter setting the empty list. This ensures that even default queries from service catalog are overwritten
 by the empty list in this config. In this example service type asv1 (entry from service catalog) is not being triggered by EpMon at all
 as it contains empty urls list.
 Collected response codes and response times are sent to graphite for further processing by Metrics Processor.
--- a/doc/source/internal/sd2_training/incidents.rst
+++ b/doc/source/internal/sd2_training/incidents.rst
@ -0,0 +1,68 @@
 .. _sd2_incidents:
 =========
 Incidents
 =========
 TODO
 Incidents inform customers about the reason why some cloud service has changed its status from "green" (normal operation) to any other state.
 Incidents are created under following conditions:
 - Metric Processor evaluates value 1 or 2 on health metric of specific cloud service and incident is automatically created on SD.
 - Service Incident Manager (SIM)  manually creates incident on SD for one or more cloud services.
 Each cloud service on SD is represented by its name and the status semaphore color icon representing its current health status.
 The following states of the service can be shown on SD2:
 - Operational - green "check" mark icon
 - Maintenance - blue "wrench" mark icon
 - Minor Issue - yellow "cross" mark icon
 - Major Issue - brown "cross" mark icon
 - Service Outage - red "cross" mark icon
 These 5 states can be set manually for specific service(s) during incident creation but only 2 states (Minor issue and Service Outage) are set automatically by the Metric Processor health metrics.
 Incidents are visualized in the respective color scheme on the top of the SD page. Also it's possible to navigate to the related incident via clicking on the service state icon next to the service.
 Once the service health status is changed and incident is created there's no automated clean-up of the incident and incident must be handledl by respective SIM. Only after incident is closed the service changes its state back to "green" Operation state.
 Incident manual creation process
 ================================
 As mentioned besides the automated incident creation the incidents can be created manually as well.
 Service incident manager must authenticate prior to be able to create an incident.
 Login is ensured by Openid connect feature on page https://status.cloudmon.eco.tsi-dev.otc-service.com/login/openid
 Once logged in the new option "Open new incident" appears at top right corner of the page.
 .. image:: training_images/sd2_incident.jpg
 The incident creation process consists of these mandatory fields:
 - Incident Summary - Description of the incident
 - Incident Impact - Drop-down menu of 4 service states (Scheduled Maintenance, Minor Issue, Major Issue, Service Outage)
 - Affected services - List of all OTC cloud services in conjunctions with regions. One or more items can be chosen
 - Start - Timestamp when incident has started
 Incident update process
 =======================
 During the incident lifecycle SIM can update incident with relevant information.
 The incident creation process consists of these optional fields:
 - Incident title - Change the title of the incident
 - Update Message - Additional details related to the current status of the incident
 - Update Status - Drop-down menu of 4 incident statuses (Analyzing incident, Fixing incident, Observing fix, Incident resolved)
 - Next Update by - Timestamp when incident is expected to be updated with another information
 Incident manual closure process
 ===============================
 Incident is never closed automatically. SIM needs to close the incident by changing its status during the update incident process to "Incident resolved".
 After that incident disappears from the active list of incidents and service health status is changed back to "green" operational state.
 Every closed incident is recorded in the Incident History.
 Incident notifications
 ======================
 Status Dashboard support RSS feeds for incident notifications. The details how to setup RSS feed are described on :ref:`notifications <sd2_notifications>` page.
--- a/doc/source/internal/sd2_training/index.rst
+++ b/doc/source/internal/sd2_training/index.rst
@ -6,3 +6,14 @@ Status Dashboard 2 Training
   :maxdepth: 1
   onepager
   introduction
   workflow
   status_dashboard_frontend
   monitoring_coverage
   epmon_checks
   dashboards
   metrics
   databases
   incidents
   notifications
   contact
--- a/doc/source/internal/sd2_training/introduction.rst
+++ b/doc/source/internal/sd2_training/introduction.rst
@ -0,0 +1,68 @@
 ============
 Introduction
 ============
 The Open Telekom Cloud is represented to users and customers by the API
 endpoints and the various services behind them. Customers are
 interested in a reliable way to check and verify if the services are actually
 available to them via the Internet. 
 The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC
 services, intended for customers to grasp an overview of the service
 availability. It comprises of a set of **monitoring zones**, each
 monitoring services of an **monitoring environment** (a. k. a. regions
 like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring
 sites is configured in a mesh matrix to validate internal as well as external connections to cloud.
 The SD2 framework:
 - Developed with aim to supervise 24/7 the public APIs of OTC platform.
 - GET Requests repeatedly sent to the API.
 - Requests grouped in service metrics are sent to Metric Processor
 - Metric Processor defines so called Flag metrics which evaluate whether service metrics reach the defined thresholds
 - Based on severity of the flag metrics the health metrics are produced
 - Status Dashboard visualizes health of the service based health metrics
 - Green - service is ok, Yellow - service has a minor issue, Red - service has an outage
 - Based on yellow and red service health the incident is created on Status Dashboard and MOD / 24/7 squad is notified
 .. image:: https://stackmon.github.io/assets/images/solution-diagram.svg
 SD2 Architecture Summary
 ------------------------
 - EpMon executes various HTTP query requests towards service endpoints and
   generates metrics
 - The HTTP requests metrics (generated by OpenStackSDK) are collected by
   statsd.
 - Time Series database (graphite) is pulling metrics from statsd.
 - Metric Processor processes the requests metrics and based on defined thresholds evaluates the resulting service health metrics
 - Status Dashboard visualize service health based on health metrics produced by metric processor and stored in SQL database
 - Grafana dashboards visualize data from graphite as well as from metric processor
 SD2 features
 ------------
 SD2 comes with the following features:
 - Support of service health with 5 service statuses (3 generated semaphore lights, 1 custom semaphore light, 1 maintenance status)
 - Support of HTTP requests (GET) for Endpoint Monitoring
 - Support of custom metrics and custom thresholds
 - Support of automatically generated incidents as well as custom incidents
 - Support of all OTC environments
  - EU-DE
  - EU-NL
  -  Swisscloud
 - Support of multiple Monitoring sources:
  - EU-DE
  - EU-NL
  -  Swisscloud
 - Internal dashboards to understand the root cause for service health changes
 - Each squad can control and manage their metrics and dashboards
 - All parameters configured from single place (stackmon-config) in human readable form (yaml)
--- a/doc/source/internal/sd2_training/metrics.rst
+++ b/doc/source/internal/sd2_training/metrics.rst
--- a/doc/source/internal/sd2_training/monitoring_coverage.rst
+++ b/doc/source/internal/sd2_training/monitoring_coverage.rst
--- a/doc/source/internal/sd2_training/notifications.rst
+++ b/doc/source/internal/sd2_training/notifications.rst
@ -0,0 +1,25 @@
 .. _sd2_notifications:
 =============
 Notifications
 =============
 Status Dashboard application comes with a RSS feeds to provide the information about the incidents
 Current RSS Feeds based on the "feedgen" library.
 https://pypi.org/project/feedgen/
 RSS feeds support region based queries and service name and service category based queries.
 Example of region based query:
 https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?mt=EU-DE
 Example of service category based query:
 https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?srvc=Compute
 Examples of region and service name based query:
 https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?mt=EU-DE&srv=Data%20Warehouse%20Service
--- a/doc/source/internal/sd2_training/status_dashboard_frontend.rst
+++ b/doc/source/internal/sd2_training/status_dashboard_frontend.rst
@ -0,0 +1,62 @@
 =========================
 Status Dashboard Frontend
 =========================
 Status Dashboard provides the status information of OTC cloud services across different regions.
 The following features are supported on Status Dashboard:
 - Support of service health with 5 service statuses
 - Authentication by OpenID connect
 - Service categories - meta grouping of services into groups
 - Regions - different services are existing in regions 
 - Incidents - entry about issues affecting certain regions and certain services
 - Support of all OTC environments
 - built-in API support
 - RSS notification
 - SLA view on all services
 - Incident history 
 Two Status Dashboard portals are available:
 - public status dashboard: https://status.cloudmon.eco.tsi-dev.otc-service.com/
 - hybrid status dashboard: https://status-ch2.cloudmon.eco.tsi-dev.otc-service.com/
 Service Health View
 ===================
 .. image:: training_images/sd2_frontend.jpg
 From the architecture POV Status Dashboard is a flask based web server serving API and rendering web content with the postgresql as database.
 Source can be found at https://github.com/stackmon/status-dashboard
 Configuration of the status dashboard frontend is located at github: https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/sdb_prod/catalog.yaml
 The catalog yaml file contains definitions of service name, service type, service categories and regions.
 Example of AutoScaling service entry in SD catalog:
 .. code:: yaml
 - attributes:
     category: Compute
     region: EU-DE
     type: as
   name: Auto Scaling
 - attributes:
     category: Compute
     region: EU-NL
     type: as
   name: Auto Scaling
 SLA view
 ========
 SLA view https://status.cloudmon.eco.tsi-dev.otc-service.com/sla is calculated only from the "outage" service health status and provide 6 months SLA history of each service.
 .. image:: training_images/sd2_sla.jpg
 Details how to work with incidents can be found at :ref:`incidents <sd2_incidents>` page.
--- a/doc/source/internal/sd2_training/training_images/cloud_service_statistics.png
+++ b/doc/source/internal/sd2_training/training_images/cloud_service_statistics.png
--- a/doc/source/internal/sd2_training/training_images/flag_and_health_dashboard.png
+++ b/doc/source/internal/sd2_training/training_images/flag_and_health_dashboard.png
--- a/doc/source/internal/sd2_training/training_images/graphite_query.png
+++ b/doc/source/internal/sd2_training/training_images/graphite_query.png
--- a/doc/source/internal/sd2_training/training_images/mp_query.png
+++ b/doc/source/internal/sd2_training/training_images/mp_query.png
--- a/doc/source/internal/sd2_training/training_images/sd2_data_flow.svg
+++ b/doc/source/internal/sd2_training/training_images/sd2_data_flow.svg
--- a/doc/source/internal/sd2_training/training_images/sd2_frontend.jpg
+++ b/doc/source/internal/sd2_training/training_images/sd2_frontend.jpg
--- a/doc/source/internal/sd2_training/training_images/sd2_incident.jpg
+++ b/doc/source/internal/sd2_training/training_images/sd2_incident.jpg
--- a/doc/source/internal/sd2_training/training_images/sd2_sla.jpg
+++ b/doc/source/internal/sd2_training/training_images/sd2_sla.jpg
--- a/doc/source/internal/sd2_training/workflow.rst
+++ b/doc/source/internal/sd2_training/workflow.rst
@ -0,0 +1,26 @@
 .. _sd2_flow:
 SD2 Flow Process
 ================
 .. image:: training_images/sd2_data_flow.svg
   :target: training_images/sd2_data_flow.svg
   :alt: sd2_data_flow
 #. Service squad adds new data entries in github repository for
   EpMOn (service URL queries),
   adjusts flag and health metrics if required,
   and adds service entry in SD catalog.
 #. Cloudmon fetches public configuration from GitHub
   and internal configuration (credentials, certs, keys,...) from local place and generate final configuration.
 #. EpMon plugin is executed and triggers HTTP requests from defined configuration
 #. Metrics from HTTP requests are collected by Statsd.
 #. Collected metrics are stored in time-series database Graphite.
 #. Metric Processor evaluates HTTP metrics from Graphite TSDB.
   and generates new flag and health metrics based on defined rules and thresholds in configuration.
 #. Status Dashboard changing service health semaphore light based on resulting health metrics from Metric Procesor.
 #. Grafana uses metrics and statistics databases as the data sources for the
   dashboards. The dashboard with various panels show the real-time status of
   the platform. Grafana supports also historical views and trends.