Hasko, Vladimir f114248cfb adding SD2 training content
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com>
Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
2023-10-04 10:07:42 +00:00

5.5 KiB

Metrics

Status Dashboard distinguish 2 types of metrics:

  • Metrics emitted by EpMon
  • Metrics by Metrics Processor
  • The EpMON plugin internally invokes method calls to OpenStack SDK libraries. They in turn generate metrics about each API call they do. This requires some special configuration in the clouds.yaml file (currently exposing metrics into statsd and InfluxDB is supported). For details refer to the config documentation of the OpenStack SDK. The following metrics are captured:
    • response HTTP code
    • duration of API call
    • name of API call
    • method of API call
    • service type
  • Based on EpMon metrics the Metric Processor emits flag and health metrics. The following metrics are captured:
    • environment
    • service
    • service type
    • flag metric type
    • resulting value (0, 1, 2)

Custom metrics:

Besides default flag and health metrics some services might require specific approach and evaluation of how to aggregate and combine the HTTP query metrics and whether custom thresholds must be applied. For such cases, the custom metrics might be introduced in Metric Processor configuration files: https://github.com/opentelekomcloud-infra/stackmon-config/tree/main/mp-prod/conf.d

More details how to query metrics from databases are described on Metric databases <sd2_metric_databases> page.

Configuration of Flag metrics

Flag metrics are defined by 2 configuration files:

https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml

https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml

Example of Autoscaling service entry in flag_metric.yaml:

### AutoScaling
- name: "api_down"
  service: "as"
  template:
    name: "api_down"
  environments:
    - name: "production_eu-de"
    - name: "production_eu-nl"

- name: "api_slow"
  service: "as"
  template:
    name: "api_slow"
  environments:
    - name: "production_eu-de"
    - name: "production_eu-nl"

- name: "api_success_rate_low"
  service: "as"
  template:
    name: "api_success_rate_low"
  environments:
    - name: "production_eu-de"
    - name: "production_eu-nl"

For each service set of flag metrics are defined. These metrics are used by Metric Processor to define the health metric for respective service. Flag metric is represented by its name, service attribute (relation to EpMon service definition), template reference (which exact query with which threshold is defined for this metric) and environments entry (list of environments where this metric is applicable).

Example of template metric definition in metric_template.yaml:

api_success_rate_low:
  query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.{2*,3*,404}.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))"
  op: "lt"
  threshold: 90
api_down:
  query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.failed.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))"
  op: "eq"
  threshold: 100
api_slow:
  query: "consolidateBy(aggregate(stats.timers.openstack.api.$environment.*.$service.*.*.*.mean, 'average'), 'average')"
  op: "gt"
  threshold: 300

Templates define how the flag metrics are evaluated.

  • "query" parameter defines query to graphite time-series database which stores collected metrics from EpMon. For the details how the query is structured in Graphite TSDB, refer to Metric databases <sd2_metric_databases> page.
  • "op" parameter defines the operation for comparison with the threshold (lt - lower than, gt - greater than, eq - equal to, ...)
  • "threshold" parameter defines the value which is used to compare query with.

For example:

api_slow metric template defines query whether the average and consolidated aggregation of latencies of all GET queries for specific service is greater than 300 milliseconds. If yes the value of the flag metric will be 1. if no the value of the fla metric will be 0.

Metric template configuration introduces pre-defined metric queries but in case some service needs different approach, the custom metric can be introduced here as well.

Configuration of Health metrics

Once the flag metrics are defined. Metric Processor evaluates health metric based on conditions defined in health_metrics.yaml. https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml

Example of DEH health metric definition:

## Compute
### DEH
deh:
  service: deh
  component_name: "Dedicated Host"
  category: compute
  metrics:
    - deh.api_down
    - deh.api_slow
    - deh.api_success_rate_low
  expressions:
    - expression: "deh.api_slow || deh.api_success_rate_low"
      weight: 1
    - expression: "deh.api_down"
      weight: 2

Configuration consists of following attributes:

  • service - service name (relation to EpMon)
  • component_name - component name (relation to SD catalog)
  • category - service category (relation to SD catalog)
  • metrics - which metrics apply for health metric evaluation (relation to flag metrics)
  • expressions - definition of the resulting health metric value by the defined expression. 1 means minor issue. 2 means outage.