tischrei 0618989a8a hc_ops
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com>
Co-authored-by: tischrei <tino.schreiber@t-systems.com>
Co-committed-by: tischrei <tino.schreiber@t-systems.com>
2024-02-22 14:55:55 +00:00

5.3 KiB

Dashboards management

https://dashboard.tsi-dev.otc-service.com

The authentication is centrally managed by OTC LDAP.

The ApiMon Dashboards are segregated based on the type of service:

  • The “OTC KPI” dashboard provides high level overview about OTC stability and reliability for management.
  • “Endpoint monitoring” dashboard monitors health of every endpoint url listed by endpoint services catalogue.
  • “Respective service statistics” dashboards provide more detailed overview.
  • 24/7 Mission Control dashboard used by 24/7 squad for daily monitoring and addressing the alerts.
  • Dashboards can be replicated/customized for individual Squad needs.

All the dashboards support Environment (target monitored platform) and Zone (monitoring source location) variables at the top of each dashboard so these views can be adjusted based on chosen value.

image

OTC KPI Dashboard

OTC KPI dashboard was requested by management to provide SLA like views on services including:

  • Global SLI views (Service Level Indicators) of API availability, latency, API errors
  • Global SLO views (Service Leven Objectives)
  • Service based SLI views of availability, success rate, errors counts, latencies
  • Customer service views for specific case like OS boot time duration, server provisioning failures, volume backup duration, etc

https://dashboard.tsi-dev.otc-service.com/d/APImonKPI/otc-kpi?orgId=1

These views provide immediate status of overall dashboard as well as the status of the specific service.

image

24/7 Mission control dashboards

24/7 Mission control squads uses CloudMon, ApiMon and EpMon metrics and present them on their own customized dashboards which are fulfilling their requirements.

https://dashboard.tsi-dev.otc-service.com/d/eBQoZU0nk/overview?orgId=1&refresh=1m

image

Endpoint Monitoring Dashboard

Endpoint Monitoring dashboards uses metrics from GET query requests towards OTC platform (EpMon Overview <epmon_overview>) and visualize it in:

  • General endpoint availability dashboard
  • Endpoint dashboard response times
  • No Response dashboard
  • Error count dashboard

https://dashboard.tsi-dev.otc-service.com/d/APImonEPmon/endpoint-monitoring?orgId=1

ApiMon Test Results Dashboard

This dashboard summarizes the overall status of the ApiMon playbook scenarios for all services. The scenarios are fetched in endless loop from github repository (Test Scenarios <test_scenarios>), executed and various metrics (Metric Definitions <metrics_definition>) are collected.

https://dashboard.tsi-dev.otc-service.com/d/APImonTestRes/apimon-test-results?orgId=1

On this dashboard users can immeditaly identify:

  • count of API errors
  • which scenarios are passing, failing, being skipped,
  • how long these test scenarios are running
  • the list of failed scenarios with links to Ansible playbook output.log

Based on historical trends and annotations user can identify whether sudden change in the scenario behavior has been impacted by some planned change on platform (JIRA annotations) or whether there's some new outage/bug.

image

Service Based Dashboard

The dashboard provides deeper insight in single service with tailored views, graphs and tables to address the service major functionalities abd specifics.

https://dashboard.tsi-dev.otc-service.com/d/APImonCompute/compute-service-statistics?orgId=1

For example in Compute Service Statistics such dashboard include:

  • Success rate of ECS deployments across different availability zones
  • Instance boot duration for most common images
  • SSH successful logins
  • Metadata server latencies and query failures
  • API calls duration
  • Bad API calls
  • Failures in tasks
  • Scenario results

This dashboard should be fully customized by respective responsible squad as they know best what they need to monitor and check for their service.

image

image

Custom Dashboards

Previous dashboards are predefined and read-only. The further customization is currently possible via system-config in github:

https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon

The predefined dashboard Jinja templates are stored there and can be customized in standard gitops way (fork and pull request) In future this process will be replaced by simplified dashboard panel definition in Stackmon Github repository (https://github.com/stackmon/apimon-tests/tree/main/dashboards)

Dashboards can be customized also just by copy/save function directly in Grafana. So in case of customization of Compute Service Statistics dashboard the whole dashboard can be saved under new name and then edited without any restrictions.

This approach is valid for PoC, temporary solutions and investigations but should not be used as permanent solution as customized dashboards which are not properly stored on Github repositories might be permanently deleted in case of full dashboard service re-installation.