forked from docs/internal-documentation
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com> Co-authored-by: tischrei <tino.schreiber@t-systems.com> Co-committed-by: tischrei <tino.schreiber@t-systems.com>
149 lines
5.3 KiB
ReStructuredText
149 lines
5.3 KiB
ReStructuredText
=====================
|
|
Dashboards management
|
|
=====================
|
|
|
|
https://dashboard.tsi-dev.otc-service.com
|
|
|
|
The authentication is centrally managed by OTC LDAP.
|
|
|
|
|
|
The ApiMon Dashboards are segregated based on the type of service:
|
|
|
|
- The “OTC KPI” dashboard provides high level overview about OTC stability and
|
|
reliability for management.
|
|
- “Endpoint monitoring” dashboard monitors health of every endpoint url listed
|
|
by endpoint services catalogue.
|
|
- “Respective service statistics” dashboards provide more detailed overview.
|
|
- 24/7 Mission Control dashboard used by 24/7 squad for daily monitoring and
|
|
addressing the alerts.
|
|
- Dashboards can be replicated/customized for individual Squad needs.
|
|
|
|
|
|
All the dashboards support Environment (target monitored platform) and Zone
|
|
(monitoring source location) variables at the top of each dashboard so these
|
|
views can be adjusted based on chosen value.
|
|
|
|
.. image:: training_images/dashboards.png
|
|
|
|
|
|
OTC KPI Dashboard
|
|
=================
|
|
|
|
OTC KPI dashboard was requested by management to provide SLA like views on
|
|
services including:
|
|
|
|
- Global SLI views (Service Level Indicators) of API availability, latency, API errors
|
|
- Global SLO views (Service Leven Objectives)
|
|
- Service based SLI views of availability, success rate, errors counts, latencies
|
|
- Customer service views for specific case like OS boot time duration, server
|
|
provisioning failures, volume backup duration, etc
|
|
|
|
https://dashboard.tsi-dev.otc-service.com/d/APImonKPI/otc-kpi?orgId=1
|
|
|
|
These views provide immediate status of overall dashboard as well as the status
|
|
of the specific service.
|
|
|
|
.. image:: training_images/kpi_dashboard.png
|
|
|
|
|
|
24/7 Mission control dashboards
|
|
===============================
|
|
|
|
24/7 Mission control squads uses CloudMon, ApiMon and EpMon metrics and present
|
|
them on their own customized dashboards which are fulfilling their
|
|
requirements.
|
|
|
|
https://dashboard.tsi-dev.otc-service.com/d/eBQoZU0nk/overview?orgId=1&refresh=1m
|
|
|
|
.. image:: training_images/24_7_dashboard.jpg
|
|
|
|
Endpoint Monitoring Dashboard
|
|
=============================
|
|
|
|
Endpoint Monitoring dashboards uses metrics from GET query requests towards OTC
|
|
platform (:ref:`EpMon Overview <epmon_overview>`) and visualize it in:
|
|
|
|
- General endpoint availability dashboard
|
|
- Endpoint dashboard response times
|
|
- No Response dashboard
|
|
- Error count dashboard
|
|
|
|
https://dashboard.tsi-dev.otc-service.com/d/APImonEPmon/endpoint-monitoring?orgId=1
|
|
|
|
|
|
ApiMon Test Results Dashboard
|
|
=============================
|
|
|
|
This dashboard summarizes the overall status of the ApiMon playbook scenarios
|
|
for all services. The scenarios are fetched in endless loop from github
|
|
repository (:ref:`Test Scenarios <test_scenarios>`), executed and various metrics (:ref:`Metric
|
|
Definitions <metrics_definition>`) are collected.
|
|
|
|
https://dashboard.tsi-dev.otc-service.com/d/APImonTestRes/apimon-test-results?orgId=1
|
|
|
|
On this dashboard users can immeditaly identify:
|
|
|
|
- count of API errors
|
|
- which scenarios are passing, failing, being skipped,
|
|
- how long these test scenarios are running
|
|
- the list of failed scenarios with links to Ansible playbook output.log
|
|
|
|
Based on historical trends and annotations user can identify whether sudden
|
|
change in the scenario behavior has been impacted by some planned change on
|
|
platform (JIRA annotations) or whether there's some new outage/bug.
|
|
|
|
.. image:: training_images/apimon_test_results.jpg
|
|
|
|
Service Based Dashboard
|
|
=======================
|
|
|
|
The dashboard provides deeper insight in single service with tailored views,
|
|
graphs and tables to address the service major functionalities abd specifics.
|
|
|
|
https://dashboard.tsi-dev.otc-service.com/d/APImonCompute/compute-service-statistics?orgId=1
|
|
|
|
For example in Compute Service Statistics such dashboard include:
|
|
|
|
- Success rate of ECS deployments across different availability zones
|
|
- Instance boot duration for most common images
|
|
- SSH successful logins
|
|
- Metadata server latencies and query failures
|
|
- API calls duration
|
|
- Bad API calls
|
|
- Failures in tasks
|
|
- Scenario results
|
|
|
|
This dashboard should be fully customized by respective responsible squad as
|
|
they know best what they need to monitor and check for their service.
|
|
|
|
|
|
.. image:: training_images/compute_service_statistics_1.jpg
|
|
|
|
.. image:: training_images/compute_service_statistics_2.jpg
|
|
|
|
Custom Dashboards
|
|
=================
|
|
|
|
Previous dashboards are predefined and read-only.
|
|
The further customization is currently possible via system-config in github:
|
|
|
|
https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon
|
|
|
|
The predefined dashboard Jinja templates are stored there and can be customized
|
|
in standard gitops way (fork and pull request) In future this process will be
|
|
replaced by simplified dashboard panel definition in Stackmon Github
|
|
repository (https://github.com/stackmon/apimon-tests/tree/main/dashboards)
|
|
|
|
Dashboards can be customized also just by copy/save function directly in
|
|
Grafana. So in case of customization of Compute Service Statistics dashboard the
|
|
whole dashboard can be saved under new name and then edited without any
|
|
restrictions.
|
|
|
|
This approach is valid for PoC, temporary solutions and investigations but
|
|
should not be used as permanent solution as customized dashboards which are not
|
|
properly stored on Github repositories might be permanently deleted in case of
|
|
full dashboard service re-installation.
|
|
|
|
|
|
|