Nils Magnus 6e2da0d05c review of training material
Reviewed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-authored-by: Nils Magnus <magnus@linuxtag.org>
Co-committed-by: Nils Magnus <magnus@linuxtag.org>
2023-10-12 18:02:41 +00:00

100 lines
4.0 KiB
ReStructuredText

====================
Dashboard Management
====================
As explained in previous pages, the resulting metrics of the configured
monitor plugins (mainly of EpMon, but possibly also from other plugins)
are first stored in a Graphite time series database, befor they are
furthe rprocessed as flags and semaphores for the actual public dashboard.
However, sometimes Service Engineers or Service Managers benefit from
deeper inspection of this time series data for debugging purposes.
Therefore a Grafana frontend may be used to visualize and drill down
the data. The entrypoint to a set of predefined dahboards is:
https://dashboard.tsi-dev.otc-service.com/dashboards/f/CloudMon/cloudmon
The authentication to this dashboard is only available for OTC staff member.
It is managed by Keycloak which in turn utilizes the OTC LDAP directory.
The Dashboards are grouped by the type of service:
- The **Squad Flag and Health** dashboard provides a high level overview
of the service health and flag metric status for each service of a
squad, respectively.
- The **Cloud Service Statistics** dashboard monitors the health of each
endpoint url listed by an EpMon configuration entry.
- Dashboards can be replicated and customized for individual squad needs.
The Cloud Service Statistics dashboards honor the ``Environment`` (target
monitored platform) and ``Zone`` (monitoring source location) variables
at the top of each dashboard so these views can be adjusted based on
chosen value.
All the Squad Flag And Health dashboards support Environment (target
monitored platform) variables at the top of each dashboard.
Squad Flag and Health Dashboard
===============================
The dashboard provides deeper insight in Metric Processor generated metrics.
Flag panels provide information whether service has exceeded a threshold
of a predefined flag metric type. Health panels provide information about
resulting service health status based on evaluated flag metrics.
The resulting flag values are visualized in state timeline panels with the
following values:
- 0 - flag metric is not breaching the defined threshold.
- 1 - flag metric is breaching the defined threshold.
The resulting health values are visualized and mapped in state timeline
panels with the following values:
- 0 - Service operates normally.
- 1 - Service has a minor issue resulting from defined reached flag metric(s).
- 2 - Service has an outage resulting from defined reached flag metrics(s).
Example at https://dashboard.tsi-dev.otc-service.com/d/s75qyOU4z/compute-flags?orgId=1
.. image:: training_images/flag_and_health_dashboard.png
Cloud Service Statistics dashboard
==================================
The Cloud Service Statistics dashboards uses metrics from GET query
requests towards OTC platform (:ref:`EpMon Overview <sd2_epmon_overview>`)
and visualize it in:
- API calls duration per each URL query.
- API calls duration (aggregated).
- API calls response codes.
Example at https://dashboard.tsi-dev.otc-service.com/d/b4560ed6-95f0-45c0-904c-6ff9f8a491e8/sfs-service-statistics?orgId=1&refresh=10s
.. image:: training_images/cloud_service_statistics.png
Custom Dashboards
=================
The dashboards described above are predefined and read-only. Further
customization is currently possible via system-config in GitHub:
https://github.com/stackmon/apimon-tests/tree/main/dashboards/grafana
The predefined simplified dashboard panel in YAML syntax is defined in
the Stackmon Github repository:
https://github.com/stackmon/apimon-tests/tree/main/dashboards
Dashboards can be customized also just by copy/save function directly in
Grafana. The whole dashboard can be saved under new name and then edited
without any restrictions.
This approach is valid for testing proofs of concept, temporary solutions,
and investigations but should not be used as permanent solution as
customized dashboards which are not properly stored on Github repositories
might be permanently deleted in case of full dashboard service re-installation.