diff --git a/doc/source/internal/apimon_training/alerts.rst b/doc/source/internal/apimon_training/alerts.rst new file mode 100644 index 0000000..789e5f9 --- /dev/null +++ b/doc/source/internal/apimon_training/alerts.rst @@ -0,0 +1,31 @@ +====== +Alerts +====== + +Alerta is the component of the ApiMon that is designed to integrate alerts +from multiple sources. It supports many different standard sources like Syslog, +SNMP, Prometheus, Nagios, Zabbix, etc. Additionally any other type of source +using URL request or command line can be integrated as well. + +Native functions like correlation and de-duplication help to manage thousands of +alerts in transparent way and consolidate alerts in proper categories based on +environment, service, resource, failure type, etc. + +Alerta is hosted on https://alerts.eco.tsi-dev.otc-service.com/ . +The authentication is centrally managed by OTC LDAP. + +The Zulip API was integrated with Alerta, to send notification of errors/alerts +on Zulip stream. + +Alerts displayed on OTC Alerta are generated either by Executor, Scheduler, +EpMon or by Grafana. + + - “Executor alerts” focus on playbook results, whether playbook has completed + or failed. + - “Grafana alerts” focus on breaching the defined thresholds. For example API + response time is higher than defined threshold. + - "Scheduler alerts" TBD + - "EpMon alerts" provide information about failed endpoint queries with details + of the request in curl form and the respective error response details + +.. image:: training_images/alerta_dashboard.png diff --git a/doc/source/internal/apimon_training/contact.rst b/doc/source/internal/apimon_training/contact.rst new file mode 100644 index 0000000..7a37812 --- /dev/null +++ b/doc/source/internal/apimon_training/contact.rst @@ -0,0 +1,29 @@ +Contact - Whom to address for Feedback? +======================================= + +In case you have any feedback, proposals or found any issues regarding the +ApiMon, EpMon or CloudMon, you can address them in the corresponding GitHub +OpenTelekomCloud-Infra repositories or StackMon repositories. + +Issues or feedback regarding the **ApiMon, EpMon, Status Dashboard, Metric +processor** as well as new feature requests can be addressed by filing an issue +on the **Gihub** repository under +https://github.com/opentelekomcloud-infra/system-config/blob/main/inventory/service/group_vars/apimon.yaml (CMO) +https://github.com/opentelekomcloud-infra/stackmon-config (FMO) + +If you have found any problems which affects the **ApiMon dashboard design** +please open an issue/PR on **GitHub** +https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon (CMO) +https://github.com/stackmon/apimon-tests (FMO) + + +If you have found any problems which affects the **ApiMon playbook scenarios** +please open an issue/PR on **GitHub** +https://github.com/opentelekomcloud-infra/apimon-tests (CMO) +https://github.com/stackmon/apimon-tests (FMO). + +If there is another issue/demand/request try to locate proper repository in +https://github.com/orgs/stackmon/repositories + +For general questions you can write an E-Mail to the `Ecosystems Squad +`_. \ No newline at end of file diff --git a/doc/source/internal/apimon_training/dashboards.rst b/doc/source/internal/apimon_training/dashboards.rst new file mode 100644 index 0000000..b2e5ae9 --- /dev/null +++ b/doc/source/internal/apimon_training/dashboards.rst @@ -0,0 +1,148 @@ +===================== +Dashboards management +===================== + +https://dashboard.tsi-dev.otc-service.com + +The authentication is centrally managed by OTC LDAP. + + +The ApiMon Dashboards are segregated based on the type of service: + + - The “OTC KPI” dashboard provides high level overview about OTC stability and + reliability for management. + - “Endpoint monitoring” dashboard monitors health of every endpoint url listed + by endpoint services catalogue. + - “Respective service statistics” dashboards provide more detailed overview. + - 24/7 Mission Control dashboard used by 24/7 squad for daily monitoring and + addressing the alerts. + - Dashboards can be replicated/customized for individual Squad needs. + + +All the dashboards support Environment (target monitored platform) and Zone +(monitoring source location) variables at the top of each dashboard so these +views can be adjusted based on chosen value. + +.. image:: training_images/dashboards.png + + +OTC KPI Dashboard +================= + +OTC KPI dashboard was requested by management to provide SLA like views on +services including: + + - Global SLI views (Service Level Indicators) of API availability, latency, API errors + - Global SLO views (Service Leven Objectives) + - Service based SLI views of availability, success rate, errors counts, latencies + - Customer service views for specific case like OS boot time duration, server + provisioning failures, volume backup duration, etc + +https://dashboard.tsi-dev.otc-service.com/d/APImonKPI/otc-kpi?orgId=1 + +These views provide immediate status of overall dashboard as well as the status +of the specific service. + +.. image:: training_images/kpi_dashboard.png + + +24/7 Mission control dashboards +=============================== + +24/7 Mission control squads uses CloudMon, ApiMon and EpMon metrics and present +them on their own customized dashboards which are fulfilling their +requirements. + +https://dashboard.tsi-dev.otc-service.com/d/eBQoZU0nk/overview?orgId=1&refresh=1m + +.. image:: training_images/24_7_dashboard.jpg + +Endpoint Monitoring Dashboard +============================= + +Endpoint Monitoring dashboards uses metrics from GET query requests towards OTC +platform (:ref:`EpMon Overview `) and visualize it in: + + - General endpoint availability dashboard + - Endpoint dashboard response times + - No Response dashboard + - Error count dashboard + +https://dashboard.tsi-dev.otc-service.com/d/APImonEPmon/endpoint-monitoring?orgId=1 + + +ApiMon Test Results Dashboard +============================= + +This dashboard summarizes the overall status of the ApiMon playbook scenarios +for all services. The scenarios are fetched in endless loop from github +repository (:ref:`Test Scenarios `), executed and various metrics (:ref:`Metric +Definitions `) are collected. + +https://dashboard.tsi-dev.otc-service.com/d/APImonTestRes/apimon-test-results?orgId=1 + +On this dashboard users can immeditaly identify: + + - count of API errors + - which scenarios are passing, failing, being skipped, + - how long these test scenarios are running + - the list of failed scenarios with links to Ansible playbook output.log + +Based on historical trends and annotations user can identify whether sudden +change in the scenario behavior has been impacted by some planned change on +platform (JIRA annotations) or whether there's some new outage/bug. + +.. image:: training_images/apimon_test_results.jpg + +Service Based Dashboard +======================= + +The dashboard provides deeper insight in single service with tailored views, +graphs and tables to address the service major functionalities abd specifics. + +https://dashboard.tsi-dev.otc-service.com/d/APImonCompute/compute-service-statistics?orgId=1 + +For example in Compute Service Statistics such dashboard include: + + - Success rate of ECS deployments across different availability zones + - Instance boot duration for most common images + - SSH successful logins + - Metadata server latencies and query failures + - API calls duration + - Bad API calls + - Failures in tasks + - Scenario results + +This dashboard should be fully customized by respective responsible squad as +they know best what they need to monitor and check for their service. + + +.. image:: training_images/compute_service_statistics_1.jpg + +.. image:: training_images/compute_service_statistics_2.jpg + +Custom Dashboards +================= + +Previous dashboards are predefined and read-only. +The further customization is currently possible via system-config in github: + +https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon + +The predefined dashboard Jinja templates are stored there and can be customized +in standard gitops way (fork and pull request) In future this process will be +replaced by simplified dashboard panel definition in Stackmon Github +repository (https://github.com/stackmon/apimon-tests/tree/main/dashboards) + +Dashboards can be customized also just by copy/save function directly in +Grafana. So in case of customization of Compute Service Statistics dashboard the +whole dashboard can be saved under new name and then edited without any +restrictions. + +This approach is valid for PoC, temporary solutions and investigations but +should not be used as permanent solution as customized dashboards which are not +properly stored on Github repositories might be permanently deleted in case of +full dashboard service re-installation. + + + diff --git a/doc/source/internal/apimon_training/databases.rst b/doc/source/internal/apimon_training/databases.rst new file mode 100644 index 0000000..ba11015 --- /dev/null +++ b/doc/source/internal/apimon_training/databases.rst @@ -0,0 +1,141 @@ +.. _metric_databases: + +================ +Metric Databases +================ + +Metrics are stored in 2 different database types: + + - Graphite time series database + - Postgresql relational database + + +Graphite +======== + + + `Graphite `_ is an open-source enterprise-ready + time-series database. ApiMon, EpMon, and CloudMon data are stored in the + clustered Graphite TSDB. Metrics emitted by the processes are gathered in the + row of statsd processes which aggregate metrics to 10s precision. + + ++---------------------+-----------------------------------------------------------------------------------------------+ +| Parameter | Value | ++=====================+===============================================================================================+ +| Grafana Datasource | apimon-carbonapi | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Database type | time series | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Main namespace | stats | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Metric type | OpenStack API metrics (including otcextensions) collecting response codes, latencies, methods | +| | ApiMOn metrics (create_cce_cluster, delete_volume_eu-de-01, etc) | +| | Custom metrics which can be created by tags in ansible playbooks | ++---------------------+-----------------------------------------------------------------------------------------------+ +| Database attributes | "timers", "counters", "environment name", "monitoring location", "service", "request method", | +| | "resource", "response code", "result", custom metrics, etc | ++---------------------+-----------------------------------------------------------------------------------------------+ +| result of API calls | attempted | +| | passed | +| | failed | ++---------------------+-----------------------------------------------------------------------------------------------+ + + +.. image:: training_images/graphite_query.jpg + + +All metrics are under "stats" namespace: + +Under "stats" there are following important metric types: + +- counters +- timers +- gauges + +Counters and timers have following subbranches: + +- apimon.metric → specific apimon metrics not gathered by the OpenStack API + methods +- openstack.api → pure API request metrics + +Every section has further following branches: + +- environment name (production_regA, production_regB, etc) + + - monitoring location (production_regA, awx) - specification of the environment from which the metric is gathered + + +openstack.api +------------- + +OpenStack metrics branch is structured as following: + +- service (normally service_type from the service catalog, but sometimes differs slightly) + + - request method (GET/POST/DELETE/PUT) + + - resource (service resource, i.e. server, keypair, volume, etc). Sub-resources are joined with "_" (i.e. cluster_nodes) + + - response code - received response code + + - count/upper/lower/mean/etc - timer specific metrics (available only under stats.timers.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*}) + - count/rate - counter specific metrics (available only under stats.counters.openstack.api.$environment.$zone.$service.$request_method.$resource.$status_code.{count,mean,upper,*}) + + - attempted - counter for the attempted requests (only for counters) + - failed - counter of failed requests (not received response, connection problems, etc) (only for counters) + - passed - counter of requests receiving any response back (only for counters) + + +apimon.metric +------------- + +- metric name (i.e. create_cce_cluster, delete_volume_eu-de-01, etc) - complex metrics branch + + - attempted/failed/failedignored/passed/skipped - counters for the corresponding operation results (this branch element represents status of the corresponding ansible task) + + - $az - some metrics would have availability zone for the operation on that level. Since this info is not always available this is a varying path + +- curl - subtree for the curl type of metrics + + - $name - short name of the host to be checked + + +- stats.timers.apimon.metric.$environment.$zone.**csm_lb_timings**.{public,private}.{http,https,tcp}.$az.__VALUE__ - timer values for the loadbalancer test +- stats.counters.apimon.metric.$environment.$zone.**csm_lb_timings**.{public,private}.{http,https,tcp}.$az.{attempted,passed,failed} - counter values for the loadbalancer test +- stats.timers.apimon.metric.$environment.$zone.**curl**.$host.{passed,failed}.__VALUE__ - timer values for the curl test +- stats.counters.apimon.metric.$environment.$zone.**curl**.$host.{attempted,passed,failed} - counter values for the curl test +- stats.timers.apimon.metric.$environment.$zone.**dns**.$ns_name.$host - timer values for the NS lookup test. $ns_name is the DNS servers used to query the records +- stats.counters.apimon.metric.$environment.$zone.**dns**.$ns_name.$host.{attempted,passed,failed} - counter values for the NS lookup test + + +Postgresql +========== + +Relational database stores ApiMon playbook scenario results which provides statistics about most common service functionalities and use cases. +These queries are used mainly on Test Results dashboard and Service specific statistics dashboards. + + ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ +| Parameter | Value | ++===============================+=============================================================================================================+ +| Grafana Datasource | apimon-pg | ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ +| Database Type | relational | ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ +| Database Table | results_summary | ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ +| Metric type | apimon playbook result statistics | ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ +| Database Fields | "timestamp", "name", "job_id", "result", "duration", "result_task" | ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ +| result field values | 0 - success | +| | 1 - ? | +| | 2 - skipped | +| | 3 - failed | ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ +| result_task object parameters | "timestamp", "name", "job_id", "result", "duration", "action", "environment", "zone", "anonymized_response" | ++-------------------------------+-------------------------------------------------------------------------------------------------------------+ + + +.. image:: training_images/postgresql_query.jpg diff --git a/doc/source/internal/apimon_training/difference_cmo_fmo.rst b/doc/source/internal/apimon_training/difference_cmo_fmo.rst new file mode 100644 index 0000000..2d2fd66 --- /dev/null +++ b/doc/source/internal/apimon_training/difference_cmo_fmo.rst @@ -0,0 +1,34 @@ +.. _difference_apimon_cmo_fmo: + +=================================== +Difference ApiMon(CMO)/ApiMon(FMO) +=================================== + +Due to the ongoing transformation of ApiMon and integration to a more robust +CloudMon there are two operation modes right now. Therefore it's important to +understand what is supported in which mode. + +This page aims to provide navigation links and understand the changes once the +transformation is completed and some of the locations will change. + +The most important differences are described in the table below: + ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ +| **Differences** | **ApiMon (CMO)** | **ApiMon(FMO)** | ++=====================+============================================================================================================+==========================================================================+ +| Playbook scenarios | https://github.com/opentelekomcloud-infra/apimon-test | https://github.com/stackmon/apimon-tests/tree/main/playbooks | ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ +| Dashboards setup | https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon | https://github.com/stackmon/apimon-tests/tree/main/dashboards | ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ +| Environment setup | https://github.com/opentelekomcloud-infra/system-config/blob/main/inventory/service/group_vars/apimon.yaml | https://github.com/opentelekomcloud-infra/stackmon-config | ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ +| Implementation mode | standalone app | plugin based | ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ +| Organization | opentelekomcloud-infra | stackmon | ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ +| Dashboards | https://dashboard.tsi-dev.otc-service.com/ | https://dashboard.tsi-dev.otc-service.com/ | +| | https://dashboard.tsi-dev.otc-service.com/dashboards/f/UaB8meoZk/apimon | https://dashboard.tsi-dev.otc-service.com/dashboards/f/CloudMon/cloudmon | ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ +| Documentation | https://confluence.tsi-dev.otc-service.com/display/ES/API-Monitoring | https://stackmon.github.io/ | +| | | https://stackmon-cloudmon.readthedocs.io/en/latest/index.html | ++---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+ diff --git a/doc/source/internal/apimon_training/epmon_checks.rst b/doc/source/internal/apimon_training/epmon_checks.rst new file mode 100644 index 0000000..8dbb0d8 --- /dev/null +++ b/doc/source/internal/apimon_training/epmon_checks.rst @@ -0,0 +1,40 @@ +.. _epmon_overview: + +============================ +Endpoint Monitoring overview +============================ + + +EpMon is a standalone python based process targeting every OTC service. It +finds service in the service catalogs and sends GET requests to the configured +endpoints. + +Performing extensive tests like provisioning a server is giving a great +coverage, but is usually not something what can be performed very often and +leaves certain gaps on the timescale of monitoring. In order to cover this gap +EpMon component is capable to send GET requests to the given URLs relying on the +API discovery of the OpenStack cloud (perform GET request to /servers or the +compute endpoint). Such requests are cheap and can be performed in the loop, i.e. +every 5 seconds. Latency of those calls, as well as the return codes, are being +captured and sent to the metrics storage. + + + +Currently EpMon configuration is located in system-config: +https://github.com/opentelekomcloud-infra/system-config/blob/main/inventory/service/group_vars/apimon.yaml +(this will change in future once CloudMon will take place) + +And defines the query HTTP targets for every single OTC service. + +EpMon dashboard provides general availability status of every service definition +from service catalog: + +.. image:: training_images/epmon_status_dashboard.jpg + +Additionally it provides further details for the endpoints like response times, +detected error codes or no responses at all. + +.. image:: training_images/epmon_dashboard_details.jpg + +EpMon findings are also reported to Alerta and notifications are sent to Zulip +dedicated topic "apimon_endpoint_monitoring". diff --git a/doc/source/internal/apimon_training/faq/faq_images/alerta_alerts_detail.png b/doc/source/internal/apimon_training/faq/faq_images/alerta_alerts_detail.png new file mode 100644 index 0000000..3f40ac2 Binary files /dev/null and b/doc/source/internal/apimon_training/faq/faq_images/alerta_alerts_detail.png differ diff --git a/doc/source/internal/apimon_training/faq/faq_images/annotations.jpg b/doc/source/internal/apimon_training/faq/faq_images/annotations.jpg new file mode 100644 index 0000000..4481db7 Binary files /dev/null and b/doc/source/internal/apimon_training/faq/faq_images/annotations.jpg differ diff --git a/doc/source/internal/apimon_training/faq/faq_images/dashboard_log_links.jpg b/doc/source/internal/apimon_training/faq/faq_images/dashboard_log_links.jpg new file mode 100644 index 0000000..65d08c0 Binary files /dev/null and b/doc/source/internal/apimon_training/faq/faq_images/dashboard_log_links.jpg differ diff --git a/doc/source/internal/apimon_training/faq/faq_images/zulip_notification_links.jpg b/doc/source/internal/apimon_training/faq/faq_images/zulip_notification_links.jpg new file mode 100644 index 0000000..b724ee4 Binary files /dev/null and b/doc/source/internal/apimon_training/faq/faq_images/zulip_notification_links.jpg differ diff --git a/doc/source/internal/apimon_training/faq/how_can_i_access_dashboard.rst b/doc/source/internal/apimon_training/faq/how_can_i_access_dashboard.rst new file mode 100644 index 0000000..0522222 --- /dev/null +++ b/doc/source/internal/apimon_training/faq/how_can_i_access_dashboard.rst @@ -0,0 +1,7 @@ +============================ +How Can I Access Dashboard ? +============================ + +OTC LDAP authentication is supported on +https://dashboard.tsi-dev.otc-service.com. + diff --git a/doc/source/internal/apimon_training/faq/how_to_read_the_logs_and_understand_the_issue.rst b/doc/source/internal/apimon_training/faq/how_to_read_the_logs_and_understand_the_issue.rst new file mode 100644 index 0000000..95af642 --- /dev/null +++ b/doc/source/internal/apimon_training/faq/how_to_read_the_logs_and_understand_the_issue.rst @@ -0,0 +1,80 @@ +.. _working_with_logs: + +============================================= +How To Read The Logs And Understand The Issue +============================================= + + +Logs are stored on swift OBS and they expire after ~1 week. The logs are can be +accessed from multiple locations: + + - Zulip notifications: + + + .. image:: faq_images/zulip_notification_links.jpg + + + - Alerts in Alerta + + + .. image:: faq_images/alerta_alerts_detail.png + + + - Tables in dashboards + + + .. image:: faq_images/dashboard_log_links.jpg + + +The logs contain whole ansible playbook output and help to analyze the problem +in detail. +For example following log detail describes the failed scenario for ECS deployment:: + + 2023-05-17 21:08:09.038955 | TASK [server_create_delete : Try connecting] + 2023-05-17 21:08:09.485569 | localhost | ERROR + 2023-05-17 21:08:09.485862 | localhost | { + 2023-05-17 21:08:09.485922 | localhost | "changed": true, + 2023-05-17 21:08:09.485950 | localhost | "cmd": [ + 2023-05-17 21:08:09.485984 | localhost | "ssh", + 2023-05-17 21:08:09.486016 | localhost | "-o", + 2023-05-17 21:08:09.486052 | localhost | "UserKnownHostsFile=/dev/null", + 2023-05-17 21:08:09.486076 | localhost | "-o", + 2023-05-17 21:08:09.486097 | localhost | "StrictHostKeyChecking=no", + 2023-05-17 21:08:09.486118 | localhost | "linux@80.158.60.117", + 2023-05-17 21:08:09.486138 | localhost | "-i", + 2023-05-17 21:08:09.486160 | localhost | "~/.ssh/scenario2a-162b6915911748c5809474be69d2a3b3-kp.pem" + 2023-05-17 21:08:09.486192 | localhost | ], + 2023-05-17 21:08:09.486221 | localhost | "delta": "0:00:00.127394", + 2023-05-17 21:08:09.486242 | localhost | "end": "2023-05-17 21:08:09.454247", + 2023-05-17 21:08:09.486262 | localhost | "invocation": { + 2023-05-17 21:08:09.486283 | localhost | "module_args": { + 2023-05-17 21:08:09.486314 | localhost | "_raw_params": "ssh -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' linux@80.158.60.117 -i ~/.ssh/scenario2a-162b6915911748c5809474be69d2a3b3-kp.pem", + 2023-05-17 21:08:09.486373 | localhost | "_uses_shell": false, + 2023-05-17 21:08:09.486397 | localhost | "argv": null, + 2023-05-17 21:08:09.486428 | localhost | "chdir": null, + 2023-05-17 21:08:09.486455 | localhost | "creates": null, + 2023-05-17 21:08:09.486487 | localhost | "executable": null, + 2023-05-17 21:08:09.486513 | localhost | "removes": null, + 2023-05-17 21:08:09.486533 | localhost | "stdin": null, + 2023-05-17 21:08:09.486553 | localhost | "stdin_add_newline": true, + 2023-05-17 21:08:09.486573 | localhost | "strip_empty_ends": true, + 2023-05-17 21:08:09.486593 | localhost | "warn": false + 2023-05-17 21:08:09.486613 | localhost | } + 2023-05-17 21:08:09.486633 | localhost | }, + 2023-05-17 21:08:09.486657 | localhost | "msg": "non-zero return code", + 2023-05-17 21:08:09.486689 | localhost | "rc": 255, + 2023-05-17 21:08:09.486713 | localhost | "start": "2023-05-17 21:08:09.326853", + 2023-05-17 21:08:09.486734 | localhost | "stderr": "Pseudo-terminal will not be allocated because stdin is not a terminal.\r\nWarning: Permanently added '80.158.60.117' (ED25519) to the list of known hosts.\r\nlinux@80.158.60.117: Permission denied (publickey).", + 2023-05-17 21:08:09.486755 | localhost | "stderr_lines": [ + 2023-05-17 21:08:09.486776 | localhost | "Pseudo-terminal will not be allocated because stdin is not a terminal.", + 2023-05-17 21:08:09.486808 | localhost | "Warning: Permanently added '80.158.60.117' (ED25519) to the list of known hosts.", + 2023-05-17 21:08:09.486834 | localhost | "linux@80.158.60.117: Permission denied (publickey)." + 2023-05-17 21:08:09.486855 | localhost | ] + 2023-05-17 21:08:09.486875 | localhost | } + +In this case it seems that deployed ECS doesn't contain injected public SSH key +which can point to issue with cloud init or metadata server. + +The playbooks can be run also manually on any OTC tenant and can be used +for further investigation and analysis. + diff --git a/doc/source/internal/apimon_training/faq/index.rst b/doc/source/internal/apimon_training/faq/index.rst new file mode 100644 index 0000000..533da05 --- /dev/null +++ b/doc/source/internal/apimon_training/faq/index.rst @@ -0,0 +1,10 @@ +========================== +Frequently Asked Questions +========================== + +.. toctree:: + :maxdepth: 1 + + how_can_i_access_dashboard + how_to_read_the_logs_and_understand_the_issue + what_are_the_annotations diff --git a/doc/source/internal/apimon_training/faq/what_are_the_annotations.rst b/doc/source/internal/apimon_training/faq/what_are_the_annotations.rst new file mode 100644 index 0000000..1d0b539 --- /dev/null +++ b/doc/source/internal/apimon_training/faq/what_are_the_annotations.rst @@ -0,0 +1,22 @@ +######################### +What Are The Annotations? +######################### + +Annotations provide a way to mark points on the graph with rich events. When you +hover over an annotation you can get event description and event tags. The text +field can include links to other systems with more detail. + +.. image:: faq_images/annotations.jpg + + +In ApiMon Dashboards annotations are used to show the JIRA change issue types +which change the transition from SCHEDULED to IN EXECUTION. This helps to +identify if some JIRA change has negative impact on platform in real time. The +annotations contain several fields which help to correlate the platform behavior +with the respective change directly on the dashboard: + + - JIRA Change issue ID + - Impacted Availability Zone + - Affected Environment + - Main component + - Summary diff --git a/doc/source/internal/apimon_training/index.rst b/doc/source/internal/apimon_training/index.rst new file mode 100644 index 0000000..7c74ebf --- /dev/null +++ b/doc/source/internal/apimon_training/index.rst @@ -0,0 +1,21 @@ +=================== +Apimon Training +=================== + +.. toctree:: + :maxdepth: 1 + + introduction + workflow + monitoring_coverage + test_scenarios + epmon_checks + dashboards + metrics + databases + alerts + notifications + logs + difference_cmo_fmo + contact + faq/index diff --git a/doc/source/internal/apimon_training/introduction.rst b/doc/source/internal/apimon_training/introduction.rst new file mode 100644 index 0000000..67e5e36 --- /dev/null +++ b/doc/source/internal/apimon_training/introduction.rst @@ -0,0 +1,108 @@ +============ +Introduction +============ + +The Open Telekom Cloud is represented to users and customers by the API +endpoints and the various services behind them. Users and operators are +interested in a reliable way to check and verify if the services are actually +available to them via the Internet. While internal monitoring checks on the OTC +backplane are necessary, they are not sufficient to detect failures that +manifest in the interface, network connectivity, or the API logic itself. Also +helpful, but not sufficient are simple HTTP requests to the REST endpoints and +checking for 200 status codes. + +The ApiMon is Open Telekom Cloud product developed by +Ecosystem squad. + +The ApiMon a.k.a API-Monitoring project: + + - Developed with aim to supervise 24/7 the public APIs of OTC platform. + - Requests repeatedly sent to the API. + - Requests grouped in so-called scenarios, mimicking real-world use + cases. + - Use cases are implemented as Ansible playbooks. + - Easy to extend the API-Monitoring for other use cases like + monitoring the provisioning of extra VMs or deploying extra software. + + +.. image:: https://stackmon.github.io/assets/images/solution-diagram.svg + +ApiMon Architecture Summary +--------------------------- + + - Test Scenarios are implemented as ansible playbooks and pushed to + `Github `_. + + - EpMon executes various HTTP query requests towards service endpoints and + generates statistics + - Scheduler fetches the latest playbooks from repo and puts them in a + queue to run in a endless loop. + - Executor is running the playbooks from queue and capturing the metrics + - The ansible playbook results generates the metrics (duration, result). + - Test scenarios metrics are sent to postgresql relational database. + - The HTTP requests metrics (generated by OpenStackSDK) are collected by + statsd. + - Time Series database (graphite) is pulling metrics from statsd. + - Grafana dashboards visualize data from postgresql and graphite. + - Alerta monitoring is used for rasing Alarms when API times out, returns error + or response time exceeds threshold. + - Alerta further sends error notification on Zulip #Alerts Stream. + - Log Files are maintained on OTC object storage via swift. + +ApiMon features +--------------- + +ApiMon comes with the following features: + +- Support of ansible playbooks for testing scenarios +- Support of HTTP requests (GET) for Endpoint Monitoring +- Support of TSDB and RDB +- Support of all OTC environments + + - EU-DE + - EU-NL + - Swisscloud + - PREPROD + +- Support of multiple Monitoring sources: + + - internal (OTC) + - external (vCloud) + +- Alerts aggregated in Alerta and notifications sent to zulip +- Various dashboards + + - KPI dashboards + - 24/7 squad dashboards + - General test results dashboards + - Specific squad/service based dashboards + +- Each squad can control and manage their test scenarios and dashboards +- Every execution of ansible playbooks stores the log file for further + investigation/analysis on swift object storage + + +What ApiMon is NOT +------------------ + +The following items are out of scope (while some of them are technically +possible): + +- No performance monitoring: The API-Monitoring does not measure degradations of + performance per se. So measuring the access times or data transfer rates of an + SSD disk is out of scope. However, if the performance of a resource drops + under some threshold that is considered as equivalent to non-available, this + is reported. +- No application monitoring: The service availability of applications that run + on top of IaaS or PaaS of the cloud is out of scope. +- No view from inside: The API-Monitoring has no internal backplane insights and + only uses public APIs of the monitored cloud. It requires thus no + administrative permissions on the backend. It can be, however, deployed + additionally in the backplane to monitor additionally internal APIs. +- No synthetic workloads: The service is not simulating any workloads (for + example a benchmark suite) on the provisioned resources. Instead it measures + and reports only if APIs are available and return expected results with an + expected behavior. +- No every single API monitoring .The API-Monitoring focuses on basic API + functionality of selected components. It doesn't cover every single API call + available in OTC API product portfolio. diff --git a/doc/source/internal/apimon_training/logs.rst b/doc/source/internal/apimon_training/logs.rst new file mode 100644 index 0000000..68d46f9 --- /dev/null +++ b/doc/source/internal/apimon_training/logs.rst @@ -0,0 +1,45 @@ +.. _logs: + +==== +Logs +==== + + +- Every single job run log is stored on OpenStack Swift object storage. +- Each single job log file provides unique URL which can be accessed to see log + details +- These URLs are available on all ApiMon levels: + + - In Zulip alarm messages + - In Alerta events + - In Grafana Dashboards + +- Logs are simple plain text files of the whole playbook output:: + + 2020-07-12 05:54:04.661170 | TASK [List Servers] + 2020-07-12 05:54:09.050491 | localhost | ok + 2020-07-12 05:54:09.067582 | TASK [Create Server in default AZ] + 2020-07-12 05:54:46.055650 | localhost | MODULE FAILURE: + 2020-07-12 05:54:46.055873 | localhost | Traceback (most recent call last): + 2020-07-12 05:54:46.057441 | localhost | + 2020-07-12 05:54:46.057499 | localhost | During handling of the above exception, another exception occurred: + 2020-07-12 05:54:46.057535 | localhost | + … + 2020-07-12 05:54:46.063992 | localhost | File "/tmp/ansible_os_server_payload_uz1c7_iw/ansible_os_server_payload.zip/ansible/modules/cloud/openstack/os_server.py", line 500, in _create_server + 2020-07-12 05:54:46.065152 | localhost | return self._send_request( + 2020-07-12 05:54:46.065186 | localhost | File "/root/.local/lib/python3.8/site-packages/keystoneauth1/session.py", line 1020, in _send_request + 2020-07-12 05:54:46.065334 | localhost | raise exceptions.ConnectFailure(msg) + 2020-07-12 05:54:46.065378 | localhost | keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://ims.eu-de.otctest.t-systems.com/v2/images: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected')) + 2020-07-12 05:54:46.295035 | + 2020-07-12 05:54:46.295241 | TASK [Delete server] + 2020-07-12 05:54:48.481374 | localhost | ok + 2020-07-12 05:54:48.505761 | + 2020-07-12 05:54:48.505906 | TASK [Delete SecurityGroup] + 2020-07-12 05:54:50.727174 | localhost | changed + 2020-07-12 05:54:50.745541 | + + +For further details how to work with logs please refer to +:ref:`How To Read The Logs And Understand The Issue ` FAQ +page. + diff --git a/doc/source/internal/apimon_training/metrics.rst b/doc/source/internal/apimon_training/metrics.rst new file mode 100644 index 0000000..ebf7e7c --- /dev/null +++ b/doc/source/internal/apimon_training/metrics.rst @@ -0,0 +1,57 @@ +.. _metrics_definition: + +======= +Metrics +======= + +The Ansible playbook scenarios generate metrics in two ways: + +- The Ansible playbook internally invokes method calls to **OpenStack SDK + libraries.** They in turn generate metrics about each API call they do. This + requires some special configuration in the clouds.yaml file (currently + exposing metrics into statsd and InfluxDB is supported). For details refer + to the `config + documentation `_ + of the OpenStack SDK. The following metrics are captured: + + - response HTTP code + - duration of API call + - name of API call + - method of API call + - service type + +- Ansible plugins may **expose additional metrics** (i.e. whether the overall + scenario succeed or not) with help of `callback + plugin `_. + Since sometimes it is not sufficient to know only the timings of each API + call, Ansible callbacks are utilized to report overall execution time and + result (whether the scenario succeeded and how long it took). The following + metrics are captured: + + - test case + - playbook name + - environment + - action name + - result code + - result string + - service type + - state type + - total amount of (failed, passed, ignored, skipped tests) + +Custom metrics: + +In some situations more complex metric generation is required which consists of +execution of multiple tasks in scenario. For such cases, the tags parameter is +used. Once the specific tasks in playbook are tagged with some specific metric +name the metrics are calculated as sum of all executed tasks with respective +tag. It's useful in cases where the measured metric contains multiple steps to +achieve the desired state of service or service resource. For example, boot up of +virtual machine from deployment until successful login via SSH. + +.. code-block:: + + tags: ["metric=delete_server"] + tags: ["az={{ availability_zone }}", "service=compute", "metric=create_server{{ metric_suffix }}"] + +More details how to query metrics from databases are described on :ref:`Metric +databases ` page. diff --git a/doc/source/internal/apimon_training/monitoring_coverage.rst b/doc/source/internal/apimon_training/monitoring_coverage.rst new file mode 100644 index 0000000..f7d225b --- /dev/null +++ b/doc/source/internal/apimon_training/monitoring_coverage.rst @@ -0,0 +1,51 @@ +=================== +Monitoring coverage +=================== + +Multiple factors define the monitoring coverage to simulate common customer use +cases. + + +Monitored locations +################### + +* EU-DE +* EU-NL +* PREPROD (EU_DE) +* EU-CH2 (Swisscloud) + + +Monitoring sources +################## + +* Inside OTC (eu-de, eu-ch2) +* Outside OTC (Swisscloud) + + +Monitored targets +################# + +* Endpoints and HTTP query requests + + * all services + * multiple GET queries + +* Static Resources + + * specific services + * availability of the resource or resource functionality + +* Dynamic resources + + * ansible playbooks + * specific services + * monitoring of most common use cases in cloud services + + +Monitoring dashboards +##################### + +* KPI dashboards +* 24/7 dashboards +* Test results dashboards +* Specific service dashboards diff --git a/doc/source/internal/apimon_training/notifications.rst b/doc/source/internal/apimon_training/notifications.rst new file mode 100644 index 0000000..123c0bd --- /dev/null +++ b/doc/source/internal/apimon_training/notifications.rst @@ -0,0 +1,68 @@ +============= +Notifications +============= + +Zulip as official OTC communication channel supports API interface for pushing +the notifications from ApiMon to various Zulip streams: + + - #Alerts Stream + - #Alerts-Hybrid Stream + - #Alerts-Preprod Stream + +Every stream contains topics based on the service type (if represented by +standalone Ansible playbook) and general apimon_endpoint_monitor topic which +contains alerts of GET queries towards all services. + + +.. image:: training_images/zulip_notifications.png + + +If the error has been acknowledged on Alerta, the new notification message for +repeating error won't get posted again on Zulip. + +Notifications contain further details which help to identify root cause faster +and more effectively. + +Notification parameters +####################### + +The ApiMon notification consists of several fields: + ++---------------------------+------------------------------------------------------------------------+ +| Notification Field | Description | ++===========================+========================================================================+ +| **APIMon Alert link** | Reference to alert in Alerta | ++---------------------------+------------------------------------------------------------------------+ +| **Status** | Status of the alert in Alerta | ++---------------------------+------------------------------------------------------------------------+ +| **Environment** | Information about affected environment/region | ++---------------------------+------------------------------------------------------------------------+ +| **Severity** | Severity of the alarm | ++---------------------------+------------------------------------------------------------------------+ +| **Origin** | Information about origin location from where the job has been executed | ++---------------------------+------------------------------------------------------------------------+ +| **Service** | Information about affected service and type of monitoring | ++---------------------------+------------------------------------------------------------------------+ +| **Resource** | Further details in which particular resource issue has happened | ++---------------------------+------------------------------------------------------------------------+ +| **Error message Summary** | Short description of error result | ++---------------------------+------------------------------------------------------------------------+ +| **Execution Log link** | Reference to job execution output on Swift object storage | ++---------------------------+------------------------------------------------------------------------+ + +Th EpMon notification consists of several fields: + ++----------------------------+------------------------------------------------------------------+ +| Notification Field | Description | ++============================+==================================================================+ +| **APIMon Alert link** | Reference to alert in Alerta | ++----------------------------+------------------------------------------------------------------+ +| **Environment** | Information about affected environment/region | ++----------------------------+------------------------------------------------------------------+ +| **Curl command** | Interpreted request in curl format for reproducible applications | ++----------------------------+------------------------------------------------------------------+ +| **Request error response** | Error result of the requested API call | ++----------------------------+------------------------------------------------------------------+ + + + diff --git a/doc/source/internal/apimon_training/test_scenarios.rst b/doc/source/internal/apimon_training/test_scenarios.rst new file mode 100644 index 0000000..b87c3de --- /dev/null +++ b/doc/source/internal/apimon_training/test_scenarios.rst @@ -0,0 +1,199 @@ +.. _test_scenarios: + +============== +Test Scenarios +============== + + +The Executor role of each API-Monitoring environment is responsible for +executing individual jobs (scenarios). Those can be defined as Ansible playbooks +(what allow them to be pretty much anything) or any other executable form (as +python script). With Ansible on it's own having nearly limitless capability and +availability to execute anything else ApiMon can do pretty much anything. The +only expectation is that whatever is being done produces some form of metric for +further analysis and evaluation. Otherwise there is no sense in monitoring. The +scenarios are collected in a `Github +`_ and updated in +real-time. In general mentioned test jobs do not need take care of generating +data implicitly. Since the API related tasks in the playbooks rely on the Python +OpenStack SDK (and its OTC extensions), metric data generated automatically by a +logging interface of the SDK ('openstack_api' metrics). Those metrics are +collected by statsd and stored to :ref:`graphite TSDB `. + +Additionally metric data are generated also by executor service which collects +the playbook names, results and duration time ('ansible_stats' metrics) and +stores them to :ref:`postgresql relational database `. + +The playbooks with monitoring scenarios are stored in separate repository on +`Github `_ (the location +will change with CloudMon replacement in `future +`_). Playbooks address the most common use cases +with cloud services conducted by end customers. + +The metrics generated by Executor are described on :ref:`Metric +Definitions ` page. + +In addition to metrics generated and captured by a playbook ApiMon also captures +:ref:`stdout of the execution `. and saves this log for additional +analysis to OpenStack Swift storage where logs are being uploaded there with a +configurable retention policy. + + +New Test Scenario introduction +============================== + +As already mentioned playbook scenarios are stored in separate repository on +`Github `_. Due to the +fact that we have various environments which differ between each other by +location, supported services, different flavors, etc it's required to have +monitoring configuration matrix which defines the monitoring standard and scope +for each environment. Therefore to enable playbook in some of the monitored +environments (PROD EU-DE, EU-NL, PREPROD, Swisscloud) further update is required +in the `monitoring matrix +`_. +This will be also matter of change in future once `StackMon +`_ will take place. + + +Rules for Test Scenarios +======================== + +Ansible playbooks need to follow some basic regression testing principles to +ensure sustainability of the endless execution of such scenarios: + +- **OpenTelekomCloud and OpenStack collection** + + - When developing test scenarios use available `Opentelekomcloud.Cloud + `_ or + `Openstack.Cloud + `_ + collections for native interaction with cloud in ansible. + - In case there are features not supported by collection you can still use + script module and call directly python SDK script to invoke required request + towards cloud + +- **Unique names of resources** + + - Make sure that resources don't conflict with each other and are easily + trackable by its unique name + +- **Teardown of the resources** + + - Make sure that deletion / cleanup of the resources is triggered even if some + of the tasks in playbooks will fail + - Make sure that deletion / cleanup is triggered in right order + +- **Simplicity** + + - Do not over-complicate test scenario. Use default auto-filled parameters + wherever possible + +- **Only basic / core functions in scope of testing** + + - ApiMon is not supposed to validate full service functionality. For such + cases we have different team / framework within QA responsibility + - Focus only on core functions which are critical for basic operation / + lifecycle of the service. + - The less functions you use the less potential failure rate you will have on + running scenario for whatever reasons + +- **No hardcoding** + + - Every single hardcoded parameter in scenario will later lead to potential + outage of the scenario's run in future when such parameter might change + - Try to obtain all such parameters dynamically from the cloud directly. + +- **Special tags for combined metrics** + + - In case that you want to combine multiple tasks in playbook in single custom + metric you can do with using tags parameter in the tasks + + +Custom metrics in Test Scenarios +================================ + + +OpenStack SDK and otcextensions (otcextensions covers services which are out of +scope of OpenStack SDK and extends its functionality with services provided by +OTC) support metric generation natively for every single API call and ApiMon +executor supports collection of ansible playbook statistics so every single +scenario and task can store its result, duration and name in metric database. + +But in some cases there's a need to provide measurement on multiple tasks which +represent some important aspect of the customer use case. For example measure +the time and overall result from the VM deployment until successful login via +SSH. Single task results are stored as metrics in metric database but it would +be complicated to transfer processing logic of metrics on grafana. Therefore +tags feature on task level introduces possibility to address custom metrics. + + +In following example (snippet from `scenario2_simple_ece.yaml +`_) +custom metric stores the result of multiple tasks in special metric name +create_server:: + + - name: Create Server in default AZ + openstack.cloud.server: + auto_ip: false + name: "{{ test_server_fqdn }}" + image: "{{ test_image }}" + flavor: "{{ test_flavor }}" + key_name: "{{ test_keypair_name }}" + network: "{{ test_network_name }}" + security_groups: "{{ test_security_group_name }}" + tags: + - "metric=create_server" + - "az=default" + register: server + + - name: get server id + set_fact: + server_id: "{{ server.id }}" + + - name: Attach FIP + openstack.cloud.floating_ip: + server: "{{ server_id }}" + tags: + - "metric=create_server" + - "az=default" + + - name: get server info + openstack.cloud.server_info: + server: "{{ server_id }}" + register: server + tags: + - "metric=create_server" + - "az=default" + + - set_fact: + server_ip: "{{ server['openstack_servers'][0]['public_v4'] }}" + tags: + - "metric=create_server" + - "az=default" + + - name: find servers by name + openstack.cloud.server_info: + server: "{{ test_server_fqdn }}" + register: servers + tags: + - "metric=create_server" + - "az=default" + + - name: Debug server info + debug: + var: servers + + # Wait for the server to really start and become accessible + - name: Wait for SSH port to become active + wait_for: + port: 22 + host: "{{ server_ip }}" + timeout: 600 + tags: ["az=default", "service=compute", "metric=create_server"] + + - name: Try connecting + retries: 10 + delay: 1 + command: "ssh -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' linux@{{ server_ip }} -i ~/.ssh/{{ test_keypair_name }}.pem" + tags: ["az=default", "service=compute", "metric=create_server"] + diff --git a/doc/source/internal/apimon_training/training_images/24_7_dashboard.jpg b/doc/source/internal/apimon_training/training_images/24_7_dashboard.jpg new file mode 100644 index 0000000..65f2ac6 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/24_7_dashboard.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/alerta_alerts.png b/doc/source/internal/apimon_training/training_images/alerta_alerts.png new file mode 100644 index 0000000..31516de Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/alerta_alerts.png differ diff --git a/doc/source/internal/apimon_training/training_images/alerta_dashboard.png b/doc/source/internal/apimon_training/training_images/alerta_dashboard.png new file mode 100644 index 0000000..c255812 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/alerta_dashboard.png differ diff --git a/doc/source/internal/apimon_training/training_images/apimon_data_flow.svg b/doc/source/internal/apimon_training/training_images/apimon_data_flow.svg new file mode 100644 index 0000000..a57638b --- /dev/null +++ b/doc/source/internal/apimon_training/training_images/apimon_data_flow.svg @@ -0,0 +1,4 @@ + + + +

Scheduler


running
8 parallel
threads
Scheduler...
Add next playbook from the queue when thread is free
Add next playbook...
Graphite TSDB



Graphite TSDB...
Fill in playbooks to the queue of threads
Fill in playboo...
Execute ansible playbooks
Execute ansible...
Remove completed playbook from the thread
Remove complete...

Statsd


Collects the
metrics
Statsd...

Executor


Ansible
Executor...
Send metrics to graphite
Send metrics to...
Service Squad
Servic...
If playbook
failed raise alert
If playbook...
Store the
job logs to
object storage
Store the...
Data
Sources
Data...
Create Alerts based on Thresholds
Create Alerts...
O/M
O/M

Github


apimon tests
repository
Github...
Pull
repository

Pull...
Management
Manage...
Endless loop
Endless loop

Grafana


Dashboard
Grafana...

Alerta


Dashboard
Alerta...
Send notifications to Zulip
Send notificati...

Zulip


Alerts
Alerts-Hybrid
Alerts-Preprod
Zulip...
Swift

Swift
Postgresql RDB



Postgresql RDB...
Test results
Test resul...
Metrics
Metrics

Scheduler


running
8 parallel
threads
Scheduler...
1
1
2
2
3
3
4
4
6
6
5
5
7
7
8
8
9
9
10
10
11
11
Text is not SVG - cannot display
\ No newline at end of file diff --git a/doc/source/internal/apimon_training/training_images/apimon_test_results.jpg b/doc/source/internal/apimon_training/training_images/apimon_test_results.jpg new file mode 100644 index 0000000..f6eb863 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/apimon_test_results.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/compute_service_statistics_1.jpg b/doc/source/internal/apimon_training/training_images/compute_service_statistics_1.jpg new file mode 100644 index 0000000..8275ea5 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/compute_service_statistics_1.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/compute_service_statistics_2.jpg b/doc/source/internal/apimon_training/training_images/compute_service_statistics_2.jpg new file mode 100644 index 0000000..fc96990 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/compute_service_statistics_2.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/dashboards.png b/doc/source/internal/apimon_training/training_images/dashboards.png new file mode 100644 index 0000000..3237d0a Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/dashboards.png differ diff --git a/doc/source/internal/apimon_training/training_images/epmon_dashboard_details.jpg b/doc/source/internal/apimon_training/training_images/epmon_dashboard_details.jpg new file mode 100644 index 0000000..9b61729 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/epmon_dashboard_details.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/epmon_status_dashboard.jpg b/doc/source/internal/apimon_training/training_images/epmon_status_dashboard.jpg new file mode 100644 index 0000000..414b40a Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/epmon_status_dashboard.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/graphite_query.jpg b/doc/source/internal/apimon_training/training_images/graphite_query.jpg new file mode 100644 index 0000000..1321faa Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/graphite_query.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/kpi_dashboard.png b/doc/source/internal/apimon_training/training_images/kpi_dashboard.png new file mode 100644 index 0000000..e179a98 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/kpi_dashboard.png differ diff --git a/doc/source/internal/apimon_training/training_images/postgresql_query.jpg b/doc/source/internal/apimon_training/training_images/postgresql_query.jpg new file mode 100644 index 0000000..9ecbff9 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/postgresql_query.jpg differ diff --git a/doc/source/internal/apimon_training/training_images/zulip_notifications.png b/doc/source/internal/apimon_training/training_images/zulip_notifications.png new file mode 100644 index 0000000..024f644 Binary files /dev/null and b/doc/source/internal/apimon_training/training_images/zulip_notifications.png differ diff --git a/doc/source/internal/apimon_training/workflow.rst b/doc/source/internal/apimon_training/workflow.rst new file mode 100644 index 0000000..3fa1c6d --- /dev/null +++ b/doc/source/internal/apimon_training/workflow.rst @@ -0,0 +1,28 @@ +.. _apimon_flow: + +ApiMon Flow Process +=================== + + +.. image:: training_images/apimon_data_flow.svg + :target: training_images/apimon_data_flow.svg + :alt: apimon_data_flow + + +#. Service squad adds test scenario to github repository. +#. Scheduler fetches test scenarios from Github and add them to queue. +#. Executor plays Ansible test scenario playbooks. Up to 8 parallel threads are enabled +#. Test scenario which has finished is being removed from the thread and next + playbook in the queue is added to the free thread. The previous playbook is + added to the queue on the last position. +#. Test scenario statistics are stored in the Postgresql database. +#. Metrics from HTTP requests are collected by Statsd. +#. Collected metrics are stored in time-series database Graphite. +#. Grafana uses metrics and statistics databases as the data sources for the + dashboards. The dashboard with various panels show the real-time status of + the platform. Grafana supports also historical views and trends. +#. Breached thresholds as well as failed test scenarios result in generated + alerts on Alerta. +#. Notifications containing alert details are sent to Zulip +#. Every test scenario stores it's job output log into Swift object storage for further analysis and investigation. + diff --git a/doc/source/internal/index.rst b/doc/source/internal/index.rst index b4cd379..1a4be59 100644 --- a/doc/source/internal/index.rst +++ b/doc/source/internal/index.rst @@ -6,3 +6,4 @@ Internal Documentation :maxdepth: 1 helpcenter_training/index + apimon_training/index