diff --git a/doc/source/internal/sd2_training/contact.rst b/doc/source/internal/sd2_training/contact.rst new file mode 100644 index 0000000..8552d03 --- /dev/null +++ b/doc/source/internal/sd2_training/contact.rst @@ -0,0 +1,21 @@ +Contact - Whom to address for Feedback? +======================================= + +In case you have any feedback, proposals or found any issues regarding the +Status Dashboard EpMon or CloudMon, you can address them in the corresponding GitHub +OpenTelekomCloud-Infra repositories or StackMon repositories. + +Issues or feedback regarding the **ApiMon, EpMon, Status Dashboard, Metric +processor** as well as new feature requests can be addressed by filing an issue +on the **Gihub** repository under +https://github.com/opentelekomcloud-infra/stackmon-config + +If you have found any problems which affects the **internal dashboard design** +please open an issue/PR on **GitHub** +https://github.com/stackmon/apimon-tests + +If there is another general issue/demand/request try to locate proper repository in +https://github.com/orgs/stackmon/repositories + +For general questions you can write an E-Mail to the `Ecosystems Squad +`_. \ No newline at end of file diff --git a/doc/source/internal/sd2_training/dashboards.rst b/doc/source/internal/sd2_training/dashboards.rst new file mode 100644 index 0000000..d95ba07 --- /dev/null +++ b/doc/source/internal/sd2_training/dashboards.rst @@ -0,0 +1,88 @@ +===================== +Dashboards management +===================== + +https://dashboard.tsi-dev.otc-service.com/dashboards/f/CloudMon/cloudmon + +The authentication is centrally managed by OTC LDAP. + + +The CloudMon Dashboards are segregated based on the type of service: + + - The “Squad Flag and Health" dashboard provides high level overview about the service health + and flag metric status per each service from respective squad. + - “Cloud Service" Statistics dashboard monitors health of every endpoint url listed + by EpMon config entry. + - Dashboards can be replicated/customized for individual Squad needs. + + +All the Cloud Service Statistics dashboards support Environment (target monitored platform) and Zone +(monitoring source location) variables at the top of each dashboard so these +views can be adjusted based on chosen value. + +All the Squad Flag And Health dashboards support Environment (target monitored platform) variables at the top of each dashboard. + + +Squad Flag and Health Dashboard +=============================== + +The dashboard provides deeper insight in Metric Processor generated metrics. +Flag panels provide information whether service has breached the thresholds +of predefined flag metric types. +Health panels provide information about resulting service health status based on evaluated flag metrics. + +The resulting flag values are visualized in state timeline panels with following values: + +- 0 - flag metric is not breaching the defined threshold +- 1 - flag metric is breaching the defined threshold + + +The resulting health values are visualized in state timeline panels with following values: + +- 0 - Service operates normally +- 1 - Service has a minor issue resulting from defined reached flag metric(s) +- 2 - Service has an outage resulting from defined reached flag metrics(s) + +Example at https://dashboard.tsi-dev.otc-service.com/d/s75qyOU4z/compute-flags?orgId=1 + +.. image:: training_images/flag_and_health_dashboard.png + + +Cloud Service Statistics dashboard +================================== + +Cloud Service Statistics dashboards uses metrics from GET query requests towards OTC +platform (:ref:`EpMon Overview `) and visualize it in: + + - API calls duration per each URL query + - API calls duration (aggregated) + - API calls response codes + +Example at https://dashboard.tsi-dev.otc-service.com/d/b4560ed6-95f0-45c0-904c-6ff9f8a491e8/sfs-service-statistics?orgId=1&refresh=10s + +.. image:: training_images/cloud_service_statistics.png + + +Custom Dashboards +================= + +Previous dashboards are predefined and read-only. +The further customization is currently possible via system-config in github: + +https://github.com/stackmon/apimon-tests/tree/main/dashboards/grafana + +The predefined simplified dashboard panel in yaml syntax +is defined in Stackmon Github repository +(https://github.com/stackmon/apimon-tests/tree/main/dashboards) + +Dashboards can be customized also just by copy/save function directly in +Grafana. The whole dashboard can be saved under new name and then edited +without any restrictions. + +This approach is valid for PoC, temporary solutions and investigations but +should not be used as permanent solution as customized dashboards which are not +properly stored on Github repositories might be permanently deleted in case of +full dashboard service re-installation. + + + diff --git a/doc/source/internal/sd2_training/databases.rst b/doc/source/internal/sd2_training/databases.rst new file mode 100644 index 0000000..42ea16e --- /dev/null +++ b/doc/source/internal/sd2_training/databases.rst @@ -0,0 +1,160 @@ +.. _sd2_metric_databases: + +================ +Metric Databases +================ + +Metrics are stored in Graphite time series database in different databases: + + - cloudmon-metrics + - cloudmon + + +cloudmon database +================= + + +EpMon data are stored in the clustered Graphite TSDB. +Metrics emitted by the processes are gathered in the +row of statsd processes which aggregate metrics to 10s precision. + + ++---------------------+-----------------------------------------------------------------------------------------------+ +| Parameter | Value | ++=====================+===============================================================================================+ +| Grafana Datasource | cloudmon | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Database type | time series | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Main namespace | stats | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Metric type | OpenStack API metrics (including otcextensions) collecting response codes, latencies, methods | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Database attributes | "timers", "counters", "environment name", "monitoring location", "service", "request method", | +| | "resource", "response code", "result", custom metrics, etc | ++---------------------+-----------------------------------------------------------------------------------------------+ +| result of API calls | attempted | +| | passed | +| | failed | ++---------------------+-----------------------------------------------------------------------------------------------+ + + +.. image:: training_images/graphite_query.png + + +All metrics are under "stats" namespace: + +Under "stats" there are following important metric types: + +- counters +- timers +- gauges + +Counters and timers have following subbranches: + +- openstack.api → pure API request metrics + +Every section has further following branches: + +- environment name (production_regA, production_regB, etc) + + - monitoring location (production_regA, awx) - specification of the environment from which the metric is gathered + + +openstack.api +------------- + +OpenStack metrics branch is structured as following: + +- service (normally service_type from the service catalog, but sometimes differs slightly) + + - request method (GET/POST/DELETE/PUT) + + - resource (service resource, i.e. server, keypair, volume, etc). Sub-resources are joined with "_" (i.e. cluster_nodes) + + - response code - received response code + + - count/upper/lower/mean/etc - timer specific metrics (available only under stats.timers.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*}) + - count/rate - counter specific metrics (available only under stats.counters.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*}) + + - attempted - counter for the attempted requests (only for counters) + - failed - counter of failed requests (not received response, connection problems, etc) (only for counters) + - passed - counter of requests receiving any response back (only for counters) + + +cloudmon-metrics database +========================= + + +Cloudmon data are stored in the clustered Graphite TSDB. +Metrics are emitted by the Metric Processor. +Metric Processor is processing the cloudmon metrics (from EpMon) and based on defined flag metrics (https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml) +and defined thresholds(https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml) finally produces the health metrics +(https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml) with different impact. +Final health metrics are then sent to Status Dashboard to visualize them as semaphore lights. + + + ++---------------------+-----------------------------------------------------------------------------------------------+ +| Parameter | Value | ++=====================+===============================================================================================+ +| Grafana Datasource | cloudmon-metrics | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Database type | time series | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Main namespace | stats | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Metric type | Metric Processor produces flag metric values (0,1) and health metric values (0,1,2) | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Database attributes | "health", "flag", "environment name", "service", "service type", "flag metric type" | ++---------------------+-----------------------------------------------------------------------------------------------+ +| result | 0 | +| | 1 | +| | 2 | ++---------------------+-----------------------------------------------------------------------------------------------+ + + +.. image:: training_images/mp_query.png + + +Based on the type of metric All metrics are under "stats" namespace: + +Under "cloudmon-metrics" there are following important metric types: + +- flag +- health + +- environment name (production_regA, production_regB, etc) + + +flag metrics +------------ + +flag metrics branch is structured as following: + +- environment name (production_regA, production_regB, etc) + + - service type (service type from the service catalog) + + - flag metric type (api_slow, api_down, api_success_rate_low, ...) + +flag metrics contain following values: + +- 0 - flag metric is not breaching the defined threshold +- 1 - flag metric is breaching the defined threshold + + +Health metrics +-------------- + +Health metrics branch is structured as following: + +- environment name (production_regA, production_regB, etc) + + - service (cloud service) + +Health metrics contain following values: + +- 0 - Service operates normally +- 1 - Service has a minor issue resulting from defined reached flag metric(s) +- 2 - Service has an outage resulting from defined reached flag metrics(s) \ No newline at end of file diff --git a/doc/source/internal/sd2_training/epmon_checks.rst b/doc/source/internal/sd2_training/epmon_checks.rst new file mode 100644 index 0000000..2b92024 --- /dev/null +++ b/doc/source/internal/sd2_training/epmon_checks.rst @@ -0,0 +1,82 @@ +.. _sd2_epmon_overview: + +============================ +Endpoint Monitoring overview +============================ + + +EpMon is a standalone python based process targeting every OTC service. It +finds service in the service catalogs and sends GET requests to the configured +endpoints. + +Performing extensive tests like provisioning a server is giving a great +coverage, but is usually not something what can be performed very often and +leaves certain gaps on the timescale of monitoring. In order to cover this gap +EpMon component is capable to send GET requests to the given URLs relying on the +API discovery of the OpenStack cloud (perform GET request to /servers or the +compute endpoint). Such requests are cheap and can be performed in the loop, i.e. +every 5 seconds. Latency of those calls, as well as the return codes, are being +captured and sent to the metrics storage. + + + +Currently EpMon configuration is located in stackmon-config: +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/epmon/config.yaml + +And defines the query HTTP targets (urls) for every single OTC service. + +Service entry in OTC Service Catalog (https://git.tsi-dev.otc-service.com/ecosystem/service_catalog) is a prerequisite to enable service to be queried by EpMon. +If there are multiple entries in service catalog, such service entries can be marked for skip in case they are obsolete. +EpMon config.yaml only defines the service queries but doesn't say how and when to use them. +For actual use across different monitoring sources and targets the configuration matrix is defined in: +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/config.yaml + + +In the following example autoscaling service confiration in EpMon is shown: + +.. code:: yaml + + as: + service_type: as + sdk_proxy: auto_scaling + urls: + - / + - /scaling_group + - /scaling_configuration + - /scaling_policy + as_swiss: + service_type: as + sdk_proxy: auto_scaling + urls: + - / + - /scaling_group + - /scaling_configuration + as_skip_v1: + service_type: asv1 + urls: [] + + + +There are 3 entries of autoscaling service. + +- "as" entry is default one and used for public cloud regions. +- "as_swiss" entry is specific for Swisscloud +- "as_skip_v1" entry is entry to be skipped from EpMon + +By default all entries in service catalog are triggered for EpMon. + +The mandatory parameter for all entries is "service_type". This must match the service_type entry in service catalog. + +Another important parameter is "sdk_proxy". This attribute identifies which otcextension module should be used +for execution of HTTP GET queries. + +The most important parameter is "urls". It defines list of URLs which will be triggered for the specific service. +As service_type is known then not full url is required to be defined but only required is its path which appears after predefined url from service catalog. + +If some specific service (or some specific service version) is supposed to be skipped from endpoint monitoring then it must +defined in epmon config with urls parameter setting the empty list. This ensures that even default queries from service catalog are overwritten +by the empty list in this config. In this example service type asv1 (entry from service catalog) is not being triggered by EpMon at all +as it contains empty urls list. + + +Collected response codes and response times are sent to graphite for further processing by Metrics Processor. diff --git a/doc/source/internal/sd2_training/incidents.rst b/doc/source/internal/sd2_training/incidents.rst new file mode 100644 index 0000000..a03bb8f --- /dev/null +++ b/doc/source/internal/sd2_training/incidents.rst @@ -0,0 +1,68 @@ +.. _sd2_incidents: + +========= +Incidents +========= + +TODO +Incidents inform customers about the reason why some cloud service has changed its status from "green" (normal operation) to any other state. + +Incidents are created under following conditions: + +- Metric Processor evaluates value 1 or 2 on health metric of specific cloud service and incident is automatically created on SD. +- Service Incident Manager (SIM) manually creates incident on SD for one or more cloud services. + +Each cloud service on SD is represented by its name and the status semaphore color icon representing its current health status. +The following states of the service can be shown on SD2: + +- Operational - green "check" mark icon +- Maintenance - blue "wrench" mark icon +- Minor Issue - yellow "cross" mark icon +- Major Issue - brown "cross" mark icon +- Service Outage - red "cross" mark icon + +These 5 states can be set manually for specific service(s) during incident creation but only 2 states (Minor issue and Service Outage) are set automatically by the Metric Processor health metrics. +Incidents are visualized in the respective color scheme on the top of the SD page. Also it's possible to navigate to the related incident via clicking on the service state icon next to the service. + +Once the service health status is changed and incident is created there's no automated clean-up of the incident and incident must be handledl by respective SIM. Only after incident is closed the service changes its state back to "green" Operation state. + +Incident manual creation process +================================ + +As mentioned besides the automated incident creation the incidents can be created manually as well. +Service incident manager must authenticate prior to be able to create an incident. +Login is ensured by Openid connect feature on page https://status.cloudmon.eco.tsi-dev.otc-service.com/login/openid + +Once logged in the new option "Open new incident" appears at top right corner of the page. + +.. image:: training_images/sd2_incident.jpg + +The incident creation process consists of these mandatory fields: + +- Incident Summary - Description of the incident +- Incident Impact - Drop-down menu of 4 service states (Scheduled Maintenance, Minor Issue, Major Issue, Service Outage) +- Affected services - List of all OTC cloud services in conjunctions with regions. One or more items can be chosen +- Start - Timestamp when incident has started + +Incident update process +======================= + +During the incident lifecycle SIM can update incident with relevant information. +The incident creation process consists of these optional fields: + +- Incident title - Change the title of the incident +- Update Message - Additional details related to the current status of the incident +- Update Status - Drop-down menu of 4 incident statuses (Analyzing incident, Fixing incident, Observing fix, Incident resolved) +- Next Update by - Timestamp when incident is expected to be updated with another information + +Incident manual closure process +=============================== + +Incident is never closed automatically. SIM needs to close the incident by changing its status during the update incident process to "Incident resolved". +After that incident disappears from the active list of incidents and service health status is changed back to "green" operational state. +Every closed incident is recorded in the Incident History. + +Incident notifications +====================== + +Status Dashboard support RSS feeds for incident notifications. The details how to setup RSS feed are described on :ref:`notifications ` page. \ No newline at end of file diff --git a/doc/source/internal/sd2_training/index.rst b/doc/source/internal/sd2_training/index.rst index 0a51ce3..81eb436 100644 --- a/doc/source/internal/sd2_training/index.rst +++ b/doc/source/internal/sd2_training/index.rst @@ -6,3 +6,14 @@ Status Dashboard 2 Training :maxdepth: 1 onepager + introduction + workflow + status_dashboard_frontend + monitoring_coverage + epmon_checks + dashboards + metrics + databases + incidents + notifications + contact diff --git a/doc/source/internal/sd2_training/introduction.rst b/doc/source/internal/sd2_training/introduction.rst new file mode 100644 index 0000000..7ccbb23 --- /dev/null +++ b/doc/source/internal/sd2_training/introduction.rst @@ -0,0 +1,68 @@ +============ +Introduction +============ + +The Open Telekom Cloud is represented to users and customers by the API +endpoints and the various services behind them. Customers are +interested in a reliable way to check and verify if the services are actually +available to them via the Internet. + +The Status Dashboard 2 (SD2) is a service facility monitoring of all OTC +services, intended for customers to grasp an overview of the service +availability. It comprises of a set of **monitoring zones**, each +monitoring services of an **monitoring environment** (a. k. a. regions +like eu-de, eu-nl, etc.). The mapping of monitoring zones to monitoring +sites is configured in a mesh matrix to validate internal as well as external connections to cloud. + +The SD2 framework: + + - Developed with aim to supervise 24/7 the public APIs of OTC platform. + - GET Requests repeatedly sent to the API. + - Requests grouped in service metrics are sent to Metric Processor + - Metric Processor defines so called Flag metrics which evaluate whether service metrics reach the defined thresholds + - Based on severity of the flag metrics the health metrics are produced + - Status Dashboard visualizes health of the service based health metrics + - Green - service is ok, Yellow - service has a minor issue, Red - service has an outage + - Based on yellow and red service health the incident is created on Status Dashboard and MOD / 24/7 squad is notified + +.. image:: https://stackmon.github.io/assets/images/solution-diagram.svg + +SD2 Architecture Summary +------------------------ + + - EpMon executes various HTTP query requests towards service endpoints and + generates metrics + - The HTTP requests metrics (generated by OpenStackSDK) are collected by + statsd. + - Time Series database (graphite) is pulling metrics from statsd. + - Metric Processor processes the requests metrics and based on defined thresholds evaluates the resulting service health metrics + - Status Dashboard visualize service health based on health metrics produced by metric processor and stored in SQL database + - Grafana dashboards visualize data from graphite as well as from metric processor + + + +SD2 features +------------ + +SD2 comes with the following features: + +- Support of service health with 5 service statuses (3 generated semaphore lights, 1 custom semaphore light, 1 maintenance status) +- Support of HTTP requests (GET) for Endpoint Monitoring +- Support of custom metrics and custom thresholds +- Support of automatically generated incidents as well as custom incidents +- Support of all OTC environments + + - EU-DE + - EU-NL + - Swisscloud + +- Support of multiple Monitoring sources: + + - EU-DE + - EU-NL + - Swisscloud + +- Internal dashboards to understand the root cause for service health changes +- Each squad can control and manage their metrics and dashboards +- All parameters configured from single place (stackmon-config) in human readable form (yaml) + diff --git a/doc/source/internal/sd2_training/metrics.rst b/doc/source/internal/sd2_training/metrics.rst new file mode 100644 index 0000000..45e633f --- /dev/null +++ b/doc/source/internal/sd2_training/metrics.rst @@ -0,0 +1,164 @@ +.. _sd2_metrics_definition: + +======= +Metrics +======= + +Status Dashboard distinguish 2 types of metrics: + +- Metrics emitted by EpMon +- Metrics by Metrics Processor + + +- The EpMON plugin internally invokes method calls to **OpenStack SDK + libraries.** They in turn generate metrics about each API call they do. This + requires some special configuration in the clouds.yaml file (currently + exposing metrics into statsd and InfluxDB is supported). For details refer + to the `config + documentation `_ + of the OpenStack SDK. The following metrics are captured: + + - response HTTP code + - duration of API call + - name of API call + - method of API call + - service type + +- Based on EpMon metrics the Metric Processor **emits flag and health metrics**. The following + metrics are captured: + + - environment + - service + - service type + - flag metric type + - resulting value (0, 1, 2) + +Custom metrics: + +Besides default flag and health metrics some services might require specific approach +and evaluation of how to aggregate and combine the HTTP query metrics and +whether custom thresholds must be applied. +For such cases, the custom metrics might be introduced in Metric Processor configuration files: +https://github.com/opentelekomcloud-infra/stackmon-config/tree/main/mp-prod/conf.d + + +More details how to query metrics from databases are described on :ref:`Metric +databases ` page. + + +Configuration of Flag metrics +============================= + +Flag metrics are defined by 2 configuration files: + +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml + +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml + +Example of Autoscaling service entry in flag_metric.yaml: + +.. code:: yaml + + ### AutoScaling + - name: "api_down" + service: "as" + template: + name: "api_down" + environments: + - name: "production_eu-de" + - name: "production_eu-nl" + + - name: "api_slow" + service: "as" + template: + name: "api_slow" + environments: + - name: "production_eu-de" + - name: "production_eu-nl" + + - name: "api_success_rate_low" + service: "as" + template: + name: "api_success_rate_low" + environments: + - name: "production_eu-de" + - name: "production_eu-nl" + + +For each service set of flag metrics are defined. These metrics are used by Metric Processor to define the health metric for respective service. +Flag metric is represented by its name, service attribute (relation to EpMon service definition), +template reference (which exact query with which threshold is defined for this metric) +and environments entry (list of environments where this metric is applicable). + +Example of template metric definition in metric_template.yaml: + +.. code:: yaml + + api_success_rate_low: + query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.{2*,3*,404}.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))" + op: "lt" + threshold: 90 + api_down: + query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.failed.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))" + op: "eq" + threshold: 100 + api_slow: + query: "consolidateBy(aggregate(stats.timers.openstack.api.$environment.*.$service.*.*.*.mean, 'average'), 'average')" + op: "gt" + threshold: 300 + + +Templates define how the flag metrics are evaluated. + +- "query" parameter defines query to graphite time-series database which stores collected metrics from EpMon. + For the details how the query is structured in Graphite TSDB, refer to :ref:`Metric databases ` page. + +- "op" parameter defines the operation for comparison with the threshold (lt - lower than, gt - greater than, eq - equal to, ...) + +- "threshold" parameter defines the value which is used to compare query with. + +For example: + +api_slow metric template defines query whether the average and consolidated aggregation of latencies of all GET queries +for specific service is greater than 300 milliseconds. If yes the value of the flag metric will be 1. +if no the value of the fla metric will be 0. + +Metric template configuration introduces pre-defined metric queries but in case some service needs different approach, +the custom metric can be introduced here as well. + + +Configuration of Health metrics +=============================== + +Once the flag metrics are defined. Metric Processor evaluates health metric based on conditions defined in health_metrics.yaml. +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml + + +Example of DEH health metric definition: + +.. code:: yaml + + ## Compute + ### DEH + deh: + service: deh + component_name: "Dedicated Host" + category: compute + metrics: + - deh.api_down + - deh.api_slow + - deh.api_success_rate_low + expressions: + - expression: "deh.api_slow || deh.api_success_rate_low" + weight: 1 + - expression: "deh.api_down" + weight: 2 + +Configuration consists of following attributes: + +- service - service name (relation to EpMon) +- component_name - component name (relation to SD catalog) +- category - service category (relation to SD catalog) +- metrics - which metrics apply for health metric evaluation (relation to flag metrics) +- expressions - definition of the resulting health metric value by the defined expression. 1 means minor issue. 2 means outage. + diff --git a/doc/source/internal/sd2_training/monitoring_coverage.rst b/doc/source/internal/sd2_training/monitoring_coverage.rst new file mode 100644 index 0000000..ae056f3 --- /dev/null +++ b/doc/source/internal/sd2_training/monitoring_coverage.rst @@ -0,0 +1,191 @@ +=================== +Monitoring coverage +=================== + +Multiple factors define the monitoring coverage to simulate common customer use +cases. The overall matrix configuration of all combined targets, sources and scopes is located at: +https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/config.yaml + + +Monitored targets +################# + +* EU-DE +* EU-NL +* EU-CH2 (Swisscloud) + + +Monitoring sources +################## + +* Inside OTC (eu-de, eu-ch2) +* Outside OTC (Swisscloud) + + +Scope of monitoring +################### + +* Endpoints and HTTP query requests + + * all services + * multiple GET queries + +* Static Resources + + * not yet in SD2 + * specific services + * availability of the resource or resource functionality + +* Global resources + + * not yet in SD2 + * OTC console + * OTC docs portal + * OTC public site + + +Example of monitoring coverage: + +.. code:: yaml + + # Mapping of environments to test projects + - env: production_eu-de + monitoring_zone: eu-de + db_entry: apimon.apimon + plugins: + - name: apimon + schedulers_inventory_group_name: schedulers + executors_inventory_group_name: executors + tests_project: apimon + tasks: + - scenario1_token.yaml + - name: epmon + epmon_inventory_group_name: epmon_de + cloud_name: production_eu-de # env in zone has few creds. We need to pick one + config_elements: + - antiddos + - antiddos_skip_bad_type + - as + - as_skip_v1 + - bms_skip + - cce_skip_unver + - cce + - ces + - ces_skip_v1 + - compute + - css + - cts_skip_unver + - cts + - data_protect_skip + - database_skip + - dcs + - dcs_skip_v1 + - dds + - deh + - dis_skip_unver + - dis + - dms + - dms_skip_v2 + - dns + - dws + - dws_skip_v1 + - identity + - image + - kms_skip_unver + - kms + - mrs + - nat + - network + - object_skip + - object_store + - orchestration + - rds_skip_unver + - rds_skip_v1 + - rds + - sdrs + - sfsturbo + - share + - smn + - smn_skip_v2 + - volume_skip_v2 + - volume + - env: production_eu-nl + monitoring_zone: eu-de + db_entry: apimon.apimon + plugins: + - name: apimon + schedulers_inventory_group_name: schedulers + executors_inventory_group_name: executors + #epmons_inventory_group_name: epmons + tests_project: apimon + tasks: + - scenario1_token.yaml + - name: epmon + epmon_inventory_group_name: epmon_de + cloud_name: production_eu-nl # env in zone has few creds. We need to pick one + config_elements: + - antiddos + - antiddos_skip_bad_type + - as + - as_skip_v1 + - bms_skip + - cce_skip_unver + - cce + - ces + - ces_skip_v1 + - compute + - css + - cts_skip_unver + - cts + - data_protect_skip + - database_skip + - dcs + - dcs_skip_v1 + - dds + - deh + - dis_skip_unver + - dis + - dms + - dms_skip_v2 + - dns + - dws + - dws_skip_v1 + - identity + - image + - kms_skip_unver + - kms + - mrs + - nat + - network + - object_skip + - object_store + - orchestration + - rds_skip_unver + - rds_skip_v1 + - rds + - sdrs + - sfsturbo + - share + - smn + - smn_skip_v2 + - volume_skip_v2 + - volume + + +Parameter "env" defines what is the target for monitoring (which region is to be monitored). + +Parameter "monitoring_zone" defines the source of monitoring (from which region the monitoring will be triggered) + +As Cloudmon is plugin based framework there's possibility to add as many plugins as required. +Currently 2 plugins are enabled: + +- apimon +- epmon + +Apimon plugin triggers scenario-based Ansible playbooks which simulate the customer use-cases including also creation of resources (POST requests). +Currently only one scenario is enabled for token authorization (scenario1_token.yaml). As Status Dasbhoard only evaluates the HTTP GET metrics +other scenarios are not yet enabled. Playbooks are stored on github (https://github.com/stackmon/apimon-tests/tree/main/playbooks). + +EpMon plugin defines which service entries will be used in which specific environment. +Services which are not present in respective environment won't have entry in this config as well. + diff --git a/doc/source/internal/sd2_training/notifications.rst b/doc/source/internal/sd2_training/notifications.rst new file mode 100644 index 0000000..4358a9d --- /dev/null +++ b/doc/source/internal/sd2_training/notifications.rst @@ -0,0 +1,25 @@ +.. _sd2_notifications: + +============= +Notifications +============= + +Status Dashboard application comes with a RSS feeds to provide the information about the incidents + +Current RSS Feeds based on the "feedgen" library. +https://pypi.org/project/feedgen/ + +RSS feeds support region based queries and service name and service category based queries. + +Example of region based query: + +https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?mt=EU-DE + +Example of service category based query: + +https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?srvc=Compute + +Examples of region and service name based query: + +https://status.cloudmon.eco.tsi-dev.otc-service.com/rss/?mt=EU-DE&srv=Data%20Warehouse%20Service + diff --git a/doc/source/internal/sd2_training/status_dashboard_frontend.rst b/doc/source/internal/sd2_training/status_dashboard_frontend.rst new file mode 100644 index 0000000..d632194 --- /dev/null +++ b/doc/source/internal/sd2_training/status_dashboard_frontend.rst @@ -0,0 +1,62 @@ +========================= +Status Dashboard Frontend +========================= + +Status Dashboard provides the status information of OTC cloud services across different regions. + +The following features are supported on Status Dashboard: + +- Support of service health with 5 service statuses +- Authentication by OpenID connect +- Service categories - meta grouping of services into groups +- Regions - different services are existing in regions +- Incidents - entry about issues affecting certain regions and certain services +- Support of all OTC environments +- built-in API support +- RSS notification +- SLA view on all services +- Incident history + + +Two Status Dashboard portals are available: +- public status dashboard: https://status.cloudmon.eco.tsi-dev.otc-service.com/ +- hybrid status dashboard: https://status-ch2.cloudmon.eco.tsi-dev.otc-service.com/ + +Service Health View +=================== + +.. image:: training_images/sd2_frontend.jpg + + +From the architecture POV Status Dashboard is a flask based web server serving API and rendering web content with the postgresql as database. +Source can be found at https://github.com/stackmon/status-dashboard + +Configuration of the status dashboard frontend is located at github: https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/sdb_prod/catalog.yaml +The catalog yaml file contains definitions of service name, service type, service categories and regions. + +Example of AutoScaling service entry in SD catalog: + +.. code:: yaml + + - attributes: + category: Compute + region: EU-DE + type: as + name: Auto Scaling + - attributes: + category: Compute + region: EU-NL + type: as + name: Auto Scaling + + +SLA view +======== + +SLA view https://status.cloudmon.eco.tsi-dev.otc-service.com/sla is calculated only from the "outage" service health status and provide 6 months SLA history of each service. + +.. image:: training_images/sd2_sla.jpg + +Details how to work with incidents can be found at :ref:`incidents ` page. + + diff --git a/doc/source/internal/sd2_training/training_images/cloud_service_statistics.png b/doc/source/internal/sd2_training/training_images/cloud_service_statistics.png new file mode 100755 index 0000000..0021218 Binary files /dev/null and b/doc/source/internal/sd2_training/training_images/cloud_service_statistics.png differ diff --git a/doc/source/internal/sd2_training/training_images/flag_and_health_dashboard.png b/doc/source/internal/sd2_training/training_images/flag_and_health_dashboard.png new file mode 100755 index 0000000..dc525ab Binary files /dev/null and b/doc/source/internal/sd2_training/training_images/flag_and_health_dashboard.png differ diff --git a/doc/source/internal/sd2_training/training_images/graphite_query.png b/doc/source/internal/sd2_training/training_images/graphite_query.png new file mode 100755 index 0000000..83f14ea Binary files /dev/null and b/doc/source/internal/sd2_training/training_images/graphite_query.png differ diff --git a/doc/source/internal/sd2_training/training_images/mp_query.png b/doc/source/internal/sd2_training/training_images/mp_query.png new file mode 100755 index 0000000..75ae814 Binary files /dev/null and b/doc/source/internal/sd2_training/training_images/mp_query.png differ diff --git a/doc/source/internal/sd2_training/training_images/sd2_data_flow.svg b/doc/source/internal/sd2_training/training_images/sd2_data_flow.svg new file mode 100755 index 0000000..93e2487 --- /dev/null +++ b/doc/source/internal/sd2_training/training_images/sd2_data_flow.svg @@ -0,0 +1,4 @@ + + + +

Cloudmon


Main
process
Cloudmon...
Generates full config based on public and private part
Generates full...
Execute HTTP
GET
 requests

Execute HTTP...

Statsd


Collects the
metrics
Statsd...

Cloudmon


EpMon
plugin
Cloudmon...
Send metrics to graphite
Send metrics to...
Service Squad
Servic...
Data
Sources
Data...
Create incidents based on Thresholds
Create incide...
O/M
O/M

Github


stackmon-config
repository
Github...
Pull
repository

Pull...
Management
24/7 Squad
Manage...

MP


evaluate the
service health
based on flags
MP...
Send notifications
 to MOD
Send notificati...
1
1
2
2
3
3
4
4
5
5
7
7
6
6

SD2


Shows the
service health
SD2...
Graphite TSDB



Graphite TSDB...
Metrics
Metrics

Grafana


Dashboard
Grafana...
8
8
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/source/internal/sd2_training/training_images/sd2_frontend.jpg b/doc/source/internal/sd2_training/training_images/sd2_frontend.jpg new file mode 100755 index 0000000..1824a03 Binary files /dev/null and b/doc/source/internal/sd2_training/training_images/sd2_frontend.jpg differ diff --git a/doc/source/internal/sd2_training/training_images/sd2_incident.jpg b/doc/source/internal/sd2_training/training_images/sd2_incident.jpg new file mode 100755 index 0000000..f8d18eb Binary files /dev/null and b/doc/source/internal/sd2_training/training_images/sd2_incident.jpg differ diff --git a/doc/source/internal/sd2_training/training_images/sd2_sla.jpg b/doc/source/internal/sd2_training/training_images/sd2_sla.jpg new file mode 100755 index 0000000..0ebf2f1 Binary files /dev/null and b/doc/source/internal/sd2_training/training_images/sd2_sla.jpg differ diff --git a/doc/source/internal/sd2_training/workflow.rst b/doc/source/internal/sd2_training/workflow.rst new file mode 100644 index 0000000..42bc584 --- /dev/null +++ b/doc/source/internal/sd2_training/workflow.rst @@ -0,0 +1,26 @@ +.. _sd2_flow: + +SD2 Flow Process +================ + + +.. image:: training_images/sd2_data_flow.svg + :target: training_images/sd2_data_flow.svg + :alt: sd2_data_flow + + +#. Service squad adds new data entries in github repository for + EpMOn (service URL queries), + adjusts flag and health metrics if required, + and adds service entry in SD catalog. +#. Cloudmon fetches public configuration from GitHub + and internal configuration (credentials, certs, keys,...) from local place and generate final configuration. +#. EpMon plugin is executed and triggers HTTP requests from defined configuration +#. Metrics from HTTP requests are collected by Statsd. +#. Collected metrics are stored in time-series database Graphite. +#. Metric Processor evaluates HTTP metrics from Graphite TSDB. + and generates new flag and health metrics based on defined rules and thresholds in configuration. +#. Status Dashboard changing service health semaphore light based on resulting health metrics from Metric Procesor. +#. Grafana uses metrics and statistics databases as the data sources for the + dashboards. The dashboard with various panels show the real-time status of + the platform. Grafana supports also historical views and trends.