internal-documentation/doc/source/training/sd2_training/metrics.rst

.. _sd2_metrics_definition:

=======
Metrics
=======

Status Dashboard distinguish 2 types of metrics:

- Metrics emitted by EpMon
- Metrics by Metrics Processor


- The EpMON plugin internally invokes method calls to **OpenStack SDK
  libraries.** They in turn generate metrics about each API call they do. This
  requires some special configuration in the clouds.yaml file (currently
  exposing metrics into statsd and InfluxDB is supported). For details refer
  to the `config
  documentation <https://docs.openstack.org/openstacksdk/latest/user/guides/stats.html>`_
  of the OpenStack SDK. The following metrics are captured:

  - response HTTP code
  - duration of API call
  - name of API call
  - method of API call
  - service type

- Based on EpMon metrics the Metric Processor  **emits flag and health metrics**.  The following
  metrics are captured:

  - environment
  - service
  - service type
  - flag metric type
  - resulting value (0, 1, 2)

Custom metrics:

Besides default flag and health metrics some services might require specific approach
and evaluation of how to aggregate and combine the HTTP query metrics and
whether custom thresholds must be applied.
For such cases, the custom metrics might be introduced in Metric Processor configuration files:
https://github.com/opentelekomcloud-infra/stackmon-config/tree/main/mp-prod/conf.d


More details how to query metrics from databases are described on :ref:`Metric
databases <sd2_metric_databases>` page.


Configuration of Flag metrics
=============================

Flag metrics are defined by 2 configuration files:

https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml

https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml

Example of Autoscaling service entry in flag_metric.yaml:

.. code:: yaml

  ### AutoScaling
  - name: "api_down"
    service: "as"
    template:
      name: "api_down"
    environments:
      - name: "production_eu-de"
      - name: "production_eu-nl"

  - name: "api_slow"
    service: "as"
    template:
      name: "api_slow"
    environments:
      - name: "production_eu-de"
      - name: "production_eu-nl"

  - name: "api_success_rate_low"
    service: "as"
    template:
      name: "api_success_rate_low"
    environments:
      - name: "production_eu-de"
      - name: "production_eu-nl"


For each service set of flag metrics are defined. These metrics are used by Metric Processor to define the health metric for respective service.
Flag metric is represented by its name, service attribute (relation to EpMon service definition),
template reference (which exact query with which threshold is defined for this metric)
and environments entry (list of environments where this metric is applicable).

Example of template metric definition in metric_template.yaml:

.. code:: yaml

  api_success_rate_low:
    query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.{2*,3*,404}.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))"
    op: "lt"
    threshold: 90
  api_down:
    query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.failed.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))"
    op: "eq"
    threshold: 100
  api_slow:
    query: "consolidateBy(aggregate(stats.timers.openstack.api.$environment.*.$service.*.*.*.mean, 'average'), 'average')"
    op: "gt"
    threshold: 300


Templates define how the flag metrics are evaluated.

- "query" parameter defines query to graphite time-series database which stores collected metrics from EpMon.
  For the details how the query is structured in Graphite TSDB, refer to :ref:`Metric databases <sd2_metric_databases>` page.

- "op" parameter defines the operation for comparison with the threshold (lt - lower than, gt - greater than, eq - equal to, ...)

- "threshold" parameter defines the value which is used to compare query with.

For example:

api_slow metric template defines query whether the average and consolidated aggregation of latencies of all GET queries
for specific service is greater than 300 milliseconds. If yes the value of the flag metric will be 1.
if no the value of the fla metric will be 0.

Metric template configuration introduces pre-defined metric queries but in case some service needs different approach,
the custom metric can be introduced here as well.


Configuration of Health metrics
===============================

Once the flag metrics are defined. Metric Processor evaluates health metric based on conditions defined in health_metrics.yaml.
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml


Example of DEH health metric definition:

.. code:: yaml

  ## Compute
  ### DEH
  deh:
    service: deh
    component_name: "Dedicated Host"
    category: compute
    metrics:
      - deh.api_down
      - deh.api_slow
      - deh.api_success_rate_low
    expressions:
      - expression: "deh.api_slow || deh.api_success_rate_low"
        weight: 1
      - expression: "deh.api_down"
        weight: 2

Configuration consists of following attributes:

- service - service name (relation to EpMon)
- component_name - component name (relation to SD catalog)
- category - service category (relation to SD catalog)
- metrics - which metrics apply for health metric evaluation (relation to flag metrics)
- expressions - definition of the resulting health metric value by the defined expression. 1 means minor issue. 2 means outage.