Hasko, Vladimir f114248cfb adding SD2 training content
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com>
Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
2023-10-04 10:07:42 +00:00

7.8 KiB

Metric Databases

Metrics are stored in Graphite time series database in different databases:

  • cloudmon-metrics
  • cloudmon

cloudmon database

EpMon data are stored in the clustered Graphite TSDB. Metrics emitted by the processes are gathered in the row of statsd processes which aggregate metrics to 10s precision.

Parameter Value
Grafana Datasource cloudmon
Database type time series
Main namespace stats
Metric type OpenStack API metrics (including otcextensions) collecting response codes, latencies, methods
Database attributes "timers", "counters", "environment name", "monitoring location", "service", "request method", "resource", "response code", "result", custom metrics, etc
result of API calls attempted passed failed

image

All metrics are under "stats" namespace:

Under "stats" there are following important metric types:

  • counters
  • timers
  • gauges

Counters and timers have following subbranches:

  • openstack.api → pure API request metrics

Every section has further following branches:

  • environment name (production_regA, production_regB, etc)

    • monitoring location (production_regA, awx) - specification of the environment from which the metric is gathered

openstack.api

OpenStack metrics branch is structured as following:

  • service (normally service_type from the service catalog, but sometimes differs slightly)
    • request method (GET/POST/DELETE/PUT)

      • resource (service resource, i.e. server, keypair, volume, etc). Sub-resources are joined with "_" (i.e. cluster_nodes)

        • response code - received response code

          • count/upper/lower/mean/etc - timer specific metrics (available only under stats.timers.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*})
          • count/rate - counter specific metrics (available only under stats.counters.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*})
        • attempted - counter for the attempted requests (only for counters)

        • failed - counter of failed requests (not received response, connection problems, etc) (only for counters)

        • passed - counter of requests receiving any response back (only for counters)

cloudmon-metrics database

Cloudmon data are stored in the clustered Graphite TSDB. Metrics are emitted by the Metric Processor. Metric Processor is processing the cloudmon metrics (from EpMon) and based on defined flag metrics (https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml) and defined thresholds(https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml) finally produces the health metrics (https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml) with different impact. Final health metrics are then sent to Status Dashboard to visualize them as semaphore lights.

Parameter Value
Grafana Datasource cloudmon-metrics
Database type time series
Main namespace stats
Metric type Metric Processor produces flag metric values (0,1) and health metric values (0,1,2)
Database attributes "health", "flag", "environment name", "service", "service type", "flag metric type"
result 0 1 2

image

Based on the type of metric All metrics are under "stats" namespace:

Under "cloudmon-metrics" there are following important metric types:

  • flag
  • health
  • environment name (production_regA, production_regB, etc)

flag metrics

flag metrics branch is structured as following:

  • environment name (production_regA, production_regB, etc)
    • service type (service type from the service catalog)

      • flag metric type (api_slow, api_down, api_success_rate_low, ...)

flag metrics contain following values:

  • 0 - flag metric is not breaching the defined threshold
  • 1 - flag metric is breaching the defined threshold

Health metrics

Health metrics branch is structured as following:

  • environment name (production_regA, production_regB, etc)
    • service (cloud service)

Health metrics contain following values:

  • 0 - Service operates normally
  • 1 - Service has a minor issue resulting from defined reached flag metric(s)
  • 2 - Service has an outage resulting from defined reached flag metrics(s)