tischrei 0618989a8a hc_ops
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com>
Co-authored-by: tischrei <tino.schreiber@t-systems.com>
Co-committed-by: tischrei <tino.schreiber@t-systems.com>
2024-02-22 14:55:55 +00:00

149 lines
5.3 KiB
ReStructuredText

=====================
Dashboards management
=====================
https://dashboard.tsi-dev.otc-service.com
The authentication is centrally managed by OTC LDAP.
The ApiMon Dashboards are segregated based on the type of service:
- The “OTC KPI” dashboard provides high level overview about OTC stability and
reliability for management.
- “Endpoint monitoring” dashboard monitors health of every endpoint url listed
by endpoint services catalogue.
- “Respective service statistics” dashboards provide more detailed overview.
- 24/7 Mission Control dashboard used by 24/7 squad for daily monitoring and
addressing the alerts.
- Dashboards can be replicated/customized for individual Squad needs.
All the dashboards support Environment (target monitored platform) and Zone
(monitoring source location) variables at the top of each dashboard so these
views can be adjusted based on chosen value.
.. image:: training_images/dashboards.png
OTC KPI Dashboard
=================
OTC KPI dashboard was requested by management to provide SLA like views on
services including:
- Global SLI views (Service Level Indicators) of API availability, latency, API errors
- Global SLO views (Service Leven Objectives)
- Service based SLI views of availability, success rate, errors counts, latencies
- Customer service views for specific case like OS boot time duration, server
provisioning failures, volume backup duration, etc
https://dashboard.tsi-dev.otc-service.com/d/APImonKPI/otc-kpi?orgId=1
These views provide immediate status of overall dashboard as well as the status
of the specific service.
.. image:: training_images/kpi_dashboard.png
24/7 Mission control dashboards
===============================
24/7 Mission control squads uses CloudMon, ApiMon and EpMon metrics and present
them on their own customized dashboards which are fulfilling their
requirements.
https://dashboard.tsi-dev.otc-service.com/d/eBQoZU0nk/overview?orgId=1&refresh=1m
.. image:: training_images/24_7_dashboard.jpg
Endpoint Monitoring Dashboard
=============================
Endpoint Monitoring dashboards uses metrics from GET query requests towards OTC
platform (:ref:`EpMon Overview <epmon_overview>`) and visualize it in:
- General endpoint availability dashboard
- Endpoint dashboard response times
- No Response dashboard
- Error count dashboard
https://dashboard.tsi-dev.otc-service.com/d/APImonEPmon/endpoint-monitoring?orgId=1
ApiMon Test Results Dashboard
=============================
This dashboard summarizes the overall status of the ApiMon playbook scenarios
for all services. The scenarios are fetched in endless loop from github
repository (:ref:`Test Scenarios <test_scenarios>`), executed and various metrics (:ref:`Metric
Definitions <metrics_definition>`) are collected.
https://dashboard.tsi-dev.otc-service.com/d/APImonTestRes/apimon-test-results?orgId=1
On this dashboard users can immeditaly identify:
- count of API errors
- which scenarios are passing, failing, being skipped,
- how long these test scenarios are running
- the list of failed scenarios with links to Ansible playbook output.log
Based on historical trends and annotations user can identify whether sudden
change in the scenario behavior has been impacted by some planned change on
platform (JIRA annotations) or whether there's some new outage/bug.
.. image:: training_images/apimon_test_results.jpg
Service Based Dashboard
=======================
The dashboard provides deeper insight in single service with tailored views,
graphs and tables to address the service major functionalities abd specifics.
https://dashboard.tsi-dev.otc-service.com/d/APImonCompute/compute-service-statistics?orgId=1
For example in Compute Service Statistics such dashboard include:
- Success rate of ECS deployments across different availability zones
- Instance boot duration for most common images
- SSH successful logins
- Metadata server latencies and query failures
- API calls duration
- Bad API calls
- Failures in tasks
- Scenario results
This dashboard should be fully customized by respective responsible squad as
they know best what they need to monitor and check for their service.
.. image:: training_images/compute_service_statistics_1.jpg
.. image:: training_images/compute_service_statistics_2.jpg
Custom Dashboards
=================
Previous dashboards are predefined and read-only.
The further customization is currently possible via system-config in github:
https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon
The predefined dashboard Jinja templates are stored there and can be customized
in standard gitops way (fork and pull request) In future this process will be
replaced by simplified dashboard panel definition in Stackmon Github
repository (https://github.com/stackmon/apimon-tests/tree/main/dashboards)
Dashboards can be customized also just by copy/save function directly in
Grafana. So in case of customization of Compute Service Statistics dashboard the
whole dashboard can be saved under new name and then edited without any
restrictions.
This approach is valid for PoC, temporary solutions and investigations but
should not be used as permanent solution as customized dashboards which are not
properly stored on Github repositories might be permanently deleted in case of
full dashboard service re-installation.