165 lines
5.5 KiB
ReStructuredText
165 lines
5.5 KiB
ReStructuredText
.. _sd2_metrics_definition:
|
|
|
|
=======
|
|
Metrics
|
|
=======
|
|
|
|
Status Dashboard distinguish 2 types of metrics:
|
|
|
|
- Metrics emitted by EpMon
|
|
- Metrics by Metrics Processor
|
|
|
|
|
|
- The EpMON plugin internally invokes method calls to **OpenStack SDK
|
|
libraries.** They in turn generate metrics about each API call they do. This
|
|
requires some special configuration in the clouds.yaml file (currently
|
|
exposing metrics into statsd and InfluxDB is supported). For details refer
|
|
to the `config
|
|
documentation <https://docs.openstack.org/openstacksdk/latest/user/guides/stats.html>`_
|
|
of the OpenStack SDK. The following metrics are captured:
|
|
|
|
- response HTTP code
|
|
- duration of API call
|
|
- name of API call
|
|
- method of API call
|
|
- service type
|
|
|
|
- Based on EpMon metrics the Metric Processor **emits flag and health metrics**. The following
|
|
metrics are captured:
|
|
|
|
- environment
|
|
- service
|
|
- service type
|
|
- flag metric type
|
|
- resulting value (0, 1, 2)
|
|
|
|
Custom metrics:
|
|
|
|
Besides default flag and health metrics some services might require specific approach
|
|
and evaluation of how to aggregate and combine the HTTP query metrics and
|
|
whether custom thresholds must be applied.
|
|
For such cases, the custom metrics might be introduced in Metric Processor configuration files:
|
|
https://github.com/opentelekomcloud-infra/stackmon-config/tree/main/mp-prod/conf.d
|
|
|
|
|
|
More details how to query metrics from databases are described on :ref:`Metric
|
|
databases <sd2_metric_databases>` page.
|
|
|
|
|
|
Configuration of Flag metrics
|
|
=============================
|
|
|
|
Flag metrics are defined by 2 configuration files:
|
|
|
|
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/flag_metrics.yaml
|
|
|
|
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/metric_templates.yaml
|
|
|
|
Example of Autoscaling service entry in flag_metric.yaml:
|
|
|
|
.. code:: yaml
|
|
|
|
### AutoScaling
|
|
- name: "api_down"
|
|
service: "as"
|
|
template:
|
|
name: "api_down"
|
|
environments:
|
|
- name: "production_eu-de"
|
|
- name: "production_eu-nl"
|
|
|
|
- name: "api_slow"
|
|
service: "as"
|
|
template:
|
|
name: "api_slow"
|
|
environments:
|
|
- name: "production_eu-de"
|
|
- name: "production_eu-nl"
|
|
|
|
- name: "api_success_rate_low"
|
|
service: "as"
|
|
template:
|
|
name: "api_success_rate_low"
|
|
environments:
|
|
- name: "production_eu-de"
|
|
- name: "production_eu-nl"
|
|
|
|
|
|
For each service set of flag metrics are defined. These metrics are used by Metric Processor to define the health metric for respective service.
|
|
Flag metric is represented by its name, service attribute (relation to EpMon service definition),
|
|
template reference (which exact query with which threshold is defined for this metric)
|
|
and environments entry (list of environments where this metric is applicable).
|
|
|
|
Example of template metric definition in metric_template.yaml:
|
|
|
|
.. code:: yaml
|
|
|
|
api_success_rate_low:
|
|
query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.{2*,3*,404}.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))"
|
|
op: "lt"
|
|
threshold: 90
|
|
api_down:
|
|
query: "asPercent(sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.failed.count), sumSeries(stats.counters.openstack.api.$environment.*.$service.*.*.attempted.count))"
|
|
op: "eq"
|
|
threshold: 100
|
|
api_slow:
|
|
query: "consolidateBy(aggregate(stats.timers.openstack.api.$environment.*.$service.*.*.*.mean, 'average'), 'average')"
|
|
op: "gt"
|
|
threshold: 300
|
|
|
|
|
|
Templates define how the flag metrics are evaluated.
|
|
|
|
- "query" parameter defines query to graphite time-series database which stores collected metrics from EpMon.
|
|
For the details how the query is structured in Graphite TSDB, refer to :ref:`Metric databases <sd2_metric_databases>` page.
|
|
|
|
- "op" parameter defines the operation for comparison with the threshold (lt - lower than, gt - greater than, eq - equal to, ...)
|
|
|
|
- "threshold" parameter defines the value which is used to compare query with.
|
|
|
|
For example:
|
|
|
|
api_slow metric template defines query whether the average and consolidated aggregation of latencies of all GET queries
|
|
for specific service is greater than 300 milliseconds. If yes the value of the flag metric will be 1.
|
|
if no the value of the fla metric will be 0.
|
|
|
|
Metric template configuration introduces pre-defined metric queries but in case some service needs different approach,
|
|
the custom metric can be introduced here as well.
|
|
|
|
|
|
Configuration of Health metrics
|
|
===============================
|
|
|
|
Once the flag metrics are defined. Metric Processor evaluates health metric based on conditions defined in health_metrics.yaml.
|
|
https://github.com/opentelekomcloud-infra/stackmon-config/blob/main/mp-prod/conf.d/health_metrics.yaml
|
|
|
|
|
|
Example of DEH health metric definition:
|
|
|
|
.. code:: yaml
|
|
|
|
## Compute
|
|
### DEH
|
|
deh:
|
|
service: deh
|
|
component_name: "Dedicated Host"
|
|
category: compute
|
|
metrics:
|
|
- deh.api_down
|
|
- deh.api_slow
|
|
- deh.api_success_rate_low
|
|
expressions:
|
|
- expression: "deh.api_slow || deh.api_success_rate_low"
|
|
weight: 1
|
|
- expression: "deh.api_down"
|
|
weight: 2
|
|
|
|
Configuration consists of following attributes:
|
|
|
|
- service - service name (relation to EpMon)
|
|
- component_name - component name (relation to SD catalog)
|
|
- category - service category (relation to SD catalog)
|
|
- metrics - which metrics apply for health metric evaluation (relation to flag metrics)
|
|
- expressions - definition of the resulting health metric value by the defined expression. 1 means minor issue. 2 means outage.
|
|
|