Nils Magnus 6e2da0d05c review of training material
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Nils Magnus <magnus@linuxtag.org>
Co-committed-by: Nils Magnus <magnus@linuxtag.org>
2023-10-12 18:02:41 +00:00

4.0 KiB

Dashboard Management

As explained in previous pages, the resulting metrics of the configured monitor plugins (mainly of EpMon, but possibly also from other plugins) are first stored in a Graphite time series database, befor they are furthe rprocessed as flags and semaphores for the actual public dashboard.

However, sometimes Service Engineers or Service Managers benefit from deeper inspection of this time series data for debugging purposes. Therefore a Grafana frontend may be used to visualize and drill down the data. The entrypoint to a set of predefined dahboards is:

https://dashboard.tsi-dev.otc-service.com/dashboards/f/CloudMon/cloudmon

The authentication to this dashboard is only available for OTC staff member. It is managed by Keycloak which in turn utilizes the OTC LDAP directory.

The Dashboards are grouped by the type of service:

  • The Squad Flag and Health dashboard provides a high level overview of the service health and flag metric status for each service of a squad, respectively.
  • The Cloud Service Statistics dashboard monitors the health of each endpoint url listed by an EpMon configuration entry.
  • Dashboards can be replicated and customized for individual squad needs.

The Cloud Service Statistics dashboards honor the Environment (target monitored platform) and Zone (monitoring source location) variables at the top of each dashboard so these views can be adjusted based on chosen value.

All the Squad Flag And Health dashboards support Environment (target monitored platform) variables at the top of each dashboard.

Squad Flag and Health Dashboard

The dashboard provides deeper insight in Metric Processor generated metrics. Flag panels provide information whether service has exceeded a threshold of a predefined flag metric type. Health panels provide information about resulting service health status based on evaluated flag metrics.

The resulting flag values are visualized in state timeline panels with the following values:

  • 0 - flag metric is not breaching the defined threshold.
  • 1 - flag metric is breaching the defined threshold.

The resulting health values are visualized and mapped in state timeline panels with the following values:

  • 0 - Service operates normally.
  • 1 - Service has a minor issue resulting from defined reached flag metric(s).
  • 2 - Service has an outage resulting from defined reached flag metrics(s).

Example at https://dashboard.tsi-dev.otc-service.com/d/s75qyOU4z/compute-flags?orgId=1

image

Cloud Service Statistics dashboard

The Cloud Service Statistics dashboards uses metrics from GET query requests towards OTC platform (EpMon Overview <sd2_epmon_overview>) and visualize it in:

  • API calls duration per each URL query.
  • API calls duration (aggregated).
  • API calls response codes.

Example at https://dashboard.tsi-dev.otc-service.com/d/b4560ed6-95f0-45c0-904c-6ff9f8a491e8/sfs-service-statistics?orgId=1&refresh=10s

image

Custom Dashboards

The dashboards described above are predefined and read-only. Further customization is currently possible via system-config in GitHub:

https://github.com/stackmon/apimon-tests/tree/main/dashboards/grafana

The predefined simplified dashboard panel in YAML syntax is defined in the Stackmon Github repository:

https://github.com/stackmon/apimon-tests/tree/main/dashboards

Dashboards can be customized also just by copy/save function directly in Grafana. The whole dashboard can be saved under new name and then edited without any restrictions.

This approach is valid for testing proofs of concept, temporary solutions, and investigations but should not be used as permanent solution as customized dashboards which are not properly stored on Github repositories might be permanently deleted in case of full dashboard service re-installation.