diff --git a/doc/source/internal/apimon_training/dashboards.rst b/doc/source/internal/apimon_training/dashboards.rst index bee7801..f2772e4 100644 --- a/doc/source/internal/apimon_training/dashboards.rst +++ b/doc/source/internal/apimon_training/dashboards.rst @@ -7,32 +7,142 @@ https://dashboard.tsi-dev.otc-service.com The authentication is centrally managed by LDAP. - - The ApiMon Dashboards are segregated based on the type of service. +The ApiMon Dashboards are segregated based on the type of service: + - The “OTC KPI” dashboard provides high level overview about OTC stability and reliability for management. - “Endpoint monitoring” dashboard monitors health of every endpoint url listed by endpoint services catalogue. - “Respective service statistics” dashboards provide more detailed overview. + - 24/7 Mission Control dashboard used by 24/7 squad for daily monitoring and + addressing the alerts. - Dashboards can be replicated/customized for individual Squad needs. + +All the dashboards support Environment (target monitored platform) and Zone +(monitoring source location) variables at the top of each dashboard so these +views can be adjusted based on chosen value. + .. image:: training_images/dashboards.png OTC KPI Dashboard ================= +OTC KPI dashobard was requested by management to provide SLA like views on +services including: + + - Global SLI views (Service Level Indicators) of API availability, latency, API errors + - Global SLO views (Service Leven Objectives) + - Service based SLI views of availability, success rate, errors counts, latencies + - Custome service views for specific case like OS boot time duration, server + provisioning failues, volume backup duration, etc + +https://dashboard.tsi-dev.otc-service.com/d/APImonKPI/otc-kpi?orgId=1 + +These views provide immediate status of overall dashboard as well as the status +of the specific service. + .. image:: training_images/kpi_dashboard.png -24/7 dasbhoards -=============== + +24/7 Mission control dasbhoards +=============================== + +24/7 Mission control squads uses CloudMon, ApiMon and EpMon metrics and present +them on their own customized dashboards which are fullfilling their +requirements. + +https://dashboard.tsi-dev.otc-service.com/d/eBQoZU0nk/overview?orgId=1&refresh=1m + +.. image:: training_images/24_7_dashboard.png Endpoint Monitoring Dashboard ============================= -Common Test Results Dashboard +Endpoint Monitoring dashboards uses metrics from GET query requests towards OTC +platform (:ref:`_EpMon Overview`) and visualze it in: + + - General endpoint availability dashboard + - Endpoint dasbhoard response times + - No Response dashboard + - Error count dashboard + +https://dashboard.tsi-dev.otc-service.com/d/APImonEPmon/endpoint-monitoring?orgId=1 + + +ApiMon Test Results Dashboard ============================= -Service Based dashboard +This dasbhoards summarize the overall status of the ApiMon playbook scenarios +for all services. The scenarios are fetched in endless loop from github +repository (:ref:`_Test Scenarios`), executed and various metrics (:ref:`Metric +Definitions `) are collected. + +https://dashboard.tsi-dev.otc-service.com/d/APImonTestRes/apimon-test-results?orgId=1 + +On this dashboard users can immeditaly identify: + + - count of API errors + - which scenarios are passing, failing, being skipped, + - how long these test scenarios are running + - the list of failed scenarios with links to ansible playbook output.log + +Based on historical trends and annotations user can identify whether sudden +change in the scenario behavior has been impacted by some planned change on +platform (JRIA annotations) or whether there's some new outage/bug. + +.. image:: training_images/apimon_test_results.png + +Service Based Dashboard ======================= +The dashboad provides deeper insight in single service with tailored views, +graphs and tables to address the service major functionalities abd specifics. + +https://dashboard.tsi-dev.otc-service.com/d/APImonCompute/compute-service-statistics?orgId=1 + +For example in Compute Service Statistics such dasbhoard include: + + - Success rate of ECS deployments across different availability zones + - Instance boot duration for most common images + - SSH succesfull logins + - Metadata server latencies and query failures + - API calls duration + - Bad API calls + - Failures in tasks + - Scenario results + +This dashboard should be fully customized by respective responsible squad as +they know best what they need to monitor and check for their service. + + +.. image:: training_images/compute_service_statistics_1.jpg + +.. image:: training_images/compute_service_statistics_2.jpg + +Custom Dashboards +================= + +Previous dashboards are predefined and read-only. +THe further customization is currently possible via system-config in github: + +https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon + +The predefined dashboard jinja templates are stored there and can be customized +in standard gitops way (fork and pull request) In future this process will be +replaced by simplified dashboard panel definition in stackmon github +repostiory(https://github.com/stackmon/apimon-tests/tree/main/dashboards) + +Dasbhoards can be customized also just by copy/save function directly in +Grafana. So in case of customization of Compute Service Statistics dashboard the +whole dashboard can be saved under new name and then edited without any +restrictions. + +This approach is valid for PoC, temporary solutions and investigations but +should not be used as permanent solution as customized dasbhoards which are not +properly stored on github repositories might be permanently deleted in case of +full daashboard service re-installation. + + diff --git a/doc/source/internal/apimon_training/epmon_checks.rst b/doc/source/internal/apimon_training/epmon_checks.rst index 96468c3..4beb953 100644 --- a/doc/source/internal/apimon_training/epmon_checks.rst +++ b/doc/source/internal/apimon_training/epmon_checks.rst @@ -1,3 +1,5 @@ +.. _EpMon Overview: + ============================ Endpoint Monitoring overview ============================ diff --git a/doc/source/internal/apimon_training/index.rst b/doc/source/internal/apimon_training/index.rst index 1cee84d..0b07871 100644 --- a/doc/source/internal/apimon_training/index.rst +++ b/doc/source/internal/apimon_training/index.rst @@ -11,6 +11,7 @@ Apimon Training test_scenarios epmon_checks dashboards + metrics alerts notifications logs diff --git a/doc/source/internal/apimon_training/metrics.rst b/doc/source/internal/apimon_training/metrics.rst new file mode 100644 index 0000000..223ed39 --- /dev/null +++ b/doc/source/internal/apimon_training/metrics.rst @@ -0,0 +1,49 @@ +.. _Metrics: + +======= +Metrics +======= + +The ansible playbook scenarios generate metrics in two ways: + +- The Ansible playbook internally invokes method calls to **OpenStack SDK + libraries.** They in turn generate metrics about each API call they do. This + requires some special configuration in the clouds.yaml file (currently + exposing metrics into statsd and InfluxDB is supported). For details refer + to the [config + documentation](https://docs.openstack.org/openstacksdk/latest/user/guides/stats.html) + of the OpenStack SDK. The following metrics are captured: + - response HTTP code + - duration of API call + - name of API call + - method of API call + - service type +- Ansible plugins may **expose additional metrics** (i.e. whether the overall + scenario succeed or not) with help of [callback + plugin](https://github.com/stackmon/apimon/tree/main/apimon/ansible/callback). + Since sometimes it is not sufficient to know only the timings of each API + call, Ansible callbacks are utilized to report overall execution time and + result (whether the scenario succeeded and how long it took). The following + metrics are captured: + - test case + - playbook name + - environment + - action name + - result code + - result string + - service type + - state type + - total amount of (failed, passed, ignored, skipped tests) + +Custom metrics: + +In some situations more complex metric generation is required which consists of +execution of multiple tasks in scenario. For such cases the tags parameter is +used. Once the specific tasks in playbook are tagged with some specific metric +name the metrics are calculated as sum of all executed tasks with respective +tag. It's useful in cases where measured metric contains multiple steps to +achieve the desired state of service or service resource. For example boot up of +virtual machine from deployment until succesfull login via SSH. + + tags: ["metric=delete_server"] + tags: ["az={{ availability_zone }}", "service=compute", "metric=create_server{{ metric_suffix }}"] \ No newline at end of file diff --git a/doc/source/internal/apimon_training/test_scenarios.rst b/doc/source/internal/apimon_training/test_scenarios.rst index aafde84..d77b795 100644 --- a/doc/source/internal/apimon_training/test_scenarios.rst +++ b/doc/source/internal/apimon_training/test_scenarios.rst @@ -1,3 +1,5 @@ +.. _Test Scenarios: + ============== Test Scenarios ============== diff --git a/doc/source/internal/apimon_training/training_images/24_7_dashboard.jpg b/doc/source/internal/apimon_training/training_images/24_7_dashboard.jpg new file mode 100644 index 0000000..65f2ac6 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/24_7_dashboard.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/apimon_test_results.jpg b/doc/source/internal/apimon_training/training_images/apimon_test_results.jpg new file mode 100644 index 0000000..f6eb863 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/apimon_test_results.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/compute_service_statistics_1.jpg b/doc/source/internal/apimon_training/training_images/compute_service_statistics_1.jpg new file mode 100644 index 0000000..8275ea5 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/compute_service_statistics_1.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/compute_service_statistics_2.jpg b/doc/source/internal/apimon_training/training_images/compute_service_statistics_2.jpg new file mode 100644 index 0000000..fc96990 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/compute_service_statistics_2.jpg differ