adding new content for ApiMon training

Reviewed-by: gtema <artem.goncharov@gmail.com>
Co-authored-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
Co-committed-by: Hasko, Vladimir <vladimir.hasko@t-systems.com>
This commit is contained in:
Hasko, Vladimir 2023-05-25 13:25:47 +00:00 committed by zuul
parent 12d8d43edb
commit 2838ebed03
37 changed files with 1124 additions and 0 deletions

View File

@ -0,0 +1,31 @@
======
Alerts
======
Alerta is the component of the ApiMon that is designed to integrate alerts
from multiple sources. It supports many different standard sources like Syslog,
SNMP, Prometheus, Nagios, Zabbix, etc. Additionally any other type of source
using URL request or command line can be integrated as well.
Native functions like correlation and de-duplication help to manage thousands of
alerts in transparent way and consolidate alerts in proper categories based on
environment, service, resource, failure type, etc.
Alerta is hosted on https://alerts.eco.tsi-dev.otc-service.com/ .
The authentication is centrally managed by OTC LDAP.
The Zulip API was integrated with Alerta, to send notification of errors/alerts
on Zulip stream.
Alerts displayed on OTC Alerta are generated either by Executor, Scheduler,
EpMon or by Grafana.
- “Executor alerts” focus on playbook results, whether playbook has completed
or failed.
- “Grafana alerts” focus on breaching the defined thresholds. For example API
response time is higher than defined threshold.
- "Scheduler alerts" TBD
- "EpMon alerts" provide information about failed endpoint queries with details
of the request in curl form and the respective error response details
.. image:: training_images/alerta_dashboard.png

View File

@ -0,0 +1,29 @@
Contact - Whom to address for Feedback?
=======================================
In case you have any feedback, proposals or found any issues regarding the
ApiMon, EpMon or CloudMon, you can address them in the corresponding GitHub
OpenTelekomCloud-Infra repositories or StackMon repositories.
Issues or feedback regarding the **ApiMon, EpMon, Status Dashboard, Metric
processor** as well as new feature requests can be addressed by filing an issue
on the **Gihub** repository under
https://github.com/opentelekomcloud-infra/system-config/blob/main/inventory/service/group_vars/apimon.yaml (CMO)
https://github.com/opentelekomcloud-infra/stackmon-config (FMO)
If you have found any problems which affects the **ApiMon dashboard design**
please open an issue/PR on **GitHub**
https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon (CMO)
https://github.com/stackmon/apimon-tests (FMO)
If you have found any problems which affects the **ApiMon playbook scenarios**
please open an issue/PR on **GitHub**
https://github.com/opentelekomcloud-infra/apimon-tests (CMO)
https://github.com/stackmon/apimon-tests (FMO).
If there is another issue/demand/request try to locate proper repository in
https://github.com/orgs/stackmon/repositories
For general questions you can write an E-Mail to the `Ecosystems Squad
<mailto:dl-pbcotcdeleco@t-systems.com>`_.

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,34 @@
.. _difference_apimon_cmo_fmo:
===================================
Difference ApiMon(CMO)/ApiMon(FMO)
===================================
Due to the ongoing transformation of ApiMon and integration to a more robust
CloudMon there are two operation modes right now. Therefore it's important to
understand what is supported in which mode.
This page aims to provide navigation links and understand the changes once the
transformation is completed and some of the locations will change.
The most important differences are described in the table below:
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+
| **Differences** | **ApiMon (CMO)** | **ApiMon(FMO)** |
+=====================+============================================================================================================+==========================================================================+
| Playbook scenarios | https://github.com/opentelekomcloud-infra/apimon-test | https://github.com/stackmon/apimon-tests/tree/main/playbooks |
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+
| Dashboards setup | https://github.com/opentelekomcloud-infra/system-config/tree/main/playbooks/templates/grafana/apimon | https://github.com/stackmon/apimon-tests/tree/main/dashboards |
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+
| Environment setup | https://github.com/opentelekomcloud-infra/system-config/blob/main/inventory/service/group_vars/apimon.yaml | https://github.com/opentelekomcloud-infra/stackmon-config |
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+
| Implementation mode | standalone app | plugin based |
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+
| Organization | opentelekomcloud-infra | stackmon |
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+
| Dashboards | https://dashboard.tsi-dev.otc-service.com/ | https://dashboard.tsi-dev.otc-service.com/ |
| | https://dashboard.tsi-dev.otc-service.com/dashboards/f/UaB8meoZk/apimon | https://dashboard.tsi-dev.otc-service.com/dashboards/f/CloudMon/cloudmon |
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+
| Documentation | https://confluence.tsi-dev.otc-service.com/display/ES/API-Monitoring | https://stackmon.github.io/ |
| | | https://stackmon-cloudmon.readthedocs.io/en/latest/index.html |
+---------------------+------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------+

View File

@ -0,0 +1,40 @@
.. _epmon_overview:
============================
Endpoint Monitoring overview
============================
EpMon is a standalone python based process targeting every OTC service. It
finds service in the service catalogs and sends GET requests to the configured
endpoints.
Performing extensive tests like provisioning a server is giving a great
coverage, but is usually not something what can be performed very often and
leaves certain gaps on the timescale of monitoring. In order to cover this gap
EpMon component is capable to send GET requests to the given URLs relying on the
API discovery of the OpenStack cloud (perform GET request to /servers or the
compute endpoint). Such requests are cheap and can be performed in the loop, i.e.
every 5 seconds. Latency of those calls, as well as the return codes, are being
captured and sent to the metrics storage.
Currently EpMon configuration is located in system-config:
https://github.com/opentelekomcloud-infra/system-config/blob/main/inventory/service/group_vars/apimon.yaml
(this will change in future once CloudMon will take place)
And defines the query HTTP targets for every single OTC service.
EpMon dashboard provides general availability status of every service definition
from service catalog:
.. image:: training_images/epmon_status_dashboard.jpg
Additionally it provides further details for the endpoints like response times,
detected error codes or no responses at all.
.. image:: training_images/epmon_dashboard_details.jpg
EpMon findings are also reported to Alerta and notifications are sent to Zulip
dedicated topic "apimon_endpoint_monitoring".

Binary file not shown.

After

Width:  |  Height:  |  Size: 109 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 60 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

View File

@ -0,0 +1,7 @@
============================
How Can I Access Dashboard ?
============================
OTC LDAP authentication is supported on
https://dashboard.tsi-dev.otc-service.com.

View File

@ -0,0 +1,80 @@
.. _working_with_logs:
=============================================
How To Read The Logs And Understand The Issue
=============================================
Logs are stored on swift OBS and they expire after ~1 week. The logs are can be
accessed from multiple locations:
- Zulip notifications:
.. image:: faq_images/zulip_notification_links.jpg
- Alerts in Alerta
.. image:: faq_images/alerta_alerts_detail.png
- Tables in dashboards
.. image:: faq_images/dashboard_log_links.jpg
The logs contain whole ansible playbook output and help to analyze the problem
in detail.
For example following log detail describes the failed scenario for ECS deployment::
2023-05-17 21:08:09.038955 | TASK [server_create_delete : Try connecting]
2023-05-17 21:08:09.485569 | localhost | ERROR
2023-05-17 21:08:09.485862 | localhost | {
2023-05-17 21:08:09.485922 | localhost | "changed": true,
2023-05-17 21:08:09.485950 | localhost | "cmd": [
2023-05-17 21:08:09.485984 | localhost | "ssh",
2023-05-17 21:08:09.486016 | localhost | "-o",
2023-05-17 21:08:09.486052 | localhost | "UserKnownHostsFile=/dev/null",
2023-05-17 21:08:09.486076 | localhost | "-o",
2023-05-17 21:08:09.486097 | localhost | "StrictHostKeyChecking=no",
2023-05-17 21:08:09.486118 | localhost | "linux@80.158.60.117",
2023-05-17 21:08:09.486138 | localhost | "-i",
2023-05-17 21:08:09.486160 | localhost | "~/.ssh/scenario2a-162b6915911748c5809474be69d2a3b3-kp.pem"
2023-05-17 21:08:09.486192 | localhost | ],
2023-05-17 21:08:09.486221 | localhost | "delta": "0:00:00.127394",
2023-05-17 21:08:09.486242 | localhost | "end": "2023-05-17 21:08:09.454247",
2023-05-17 21:08:09.486262 | localhost | "invocation": {
2023-05-17 21:08:09.486283 | localhost | "module_args": {
2023-05-17 21:08:09.486314 | localhost | "_raw_params": "ssh -o 'UserKnownHostsFile=/dev/null' -o 'StrictHostKeyChecking=no' linux@80.158.60.117 -i ~/.ssh/scenario2a-162b6915911748c5809474be69d2a3b3-kp.pem",
2023-05-17 21:08:09.486373 | localhost | "_uses_shell": false,
2023-05-17 21:08:09.486397 | localhost | "argv": null,
2023-05-17 21:08:09.486428 | localhost | "chdir": null,
2023-05-17 21:08:09.486455 | localhost | "creates": null,
2023-05-17 21:08:09.486487 | localhost | "executable": null,
2023-05-17 21:08:09.486513 | localhost | "removes": null,
2023-05-17 21:08:09.486533 | localhost | "stdin": null,
2023-05-17 21:08:09.486553 | localhost | "stdin_add_newline": true,
2023-05-17 21:08:09.486573 | localhost | "strip_empty_ends": true,
2023-05-17 21:08:09.486593 | localhost | "warn": false
2023-05-17 21:08:09.486613 | localhost | }
2023-05-17 21:08:09.486633 | localhost | },
2023-05-17 21:08:09.486657 | localhost | "msg": "non-zero return code",
2023-05-17 21:08:09.486689 | localhost | "rc": 255,
2023-05-17 21:08:09.486713 | localhost | "start": "2023-05-17 21:08:09.326853",
2023-05-17 21:08:09.486734 | localhost | "stderr": "Pseudo-terminal will not be allocated because stdin is not a terminal.\r\nWarning: Permanently added '80.158.60.117' (ED25519) to the list of known hosts.\r\nlinux@80.158.60.117: Permission denied (publickey).",
2023-05-17 21:08:09.486755 | localhost | "stderr_lines": [
2023-05-17 21:08:09.486776 | localhost | "Pseudo-terminal will not be allocated because stdin is not a terminal.",
2023-05-17 21:08:09.486808 | localhost | "Warning: Permanently added '80.158.60.117' (ED25519) to the list of known hosts.",
2023-05-17 21:08:09.486834 | localhost | "linux@80.158.60.117: Permission denied (publickey)."
2023-05-17 21:08:09.486855 | localhost | ]
2023-05-17 21:08:09.486875 | localhost | }
In this case it seems that deployed ECS doesn't contain injected public SSH key
which can point to issue with cloud init or metadata server.
The playbooks can be run also manually on any OTC tenant and can be used
for further investigation and analysis.

View File

@ -0,0 +1,10 @@
==========================
Frequently Asked Questions
==========================
.. toctree::
:maxdepth: 1
how_can_i_access_dashboard
how_to_read_the_logs_and_understand_the_issue
what_are_the_annotations

View File

@ -0,0 +1,22 @@
#########################
What Are The Annotations?
#########################
Annotations provide a way to mark points on the graph with rich events. When you
hover over an annotation you can get event description and event tags. The text
field can include links to other systems with more detail.
.. image:: faq_images/annotations.jpg
In ApiMon Dashboards annotations are used to show the JIRA change issue types
which change the transition from SCHEDULED to IN EXECUTION. This helps to
identify if some JIRA change has negative impact on platform in real time. The
annotations contain several fields which help to correlate the platform behavior
with the respective change directly on the dashboard:
- JIRA Change issue ID
- Impacted Availability Zone
- Affected Environment
- Main component
- Summary

View File

@ -0,0 +1,21 @@
===================
Apimon Training
===================
.. toctree::
:maxdepth: 1
introduction
workflow
monitoring_coverage
test_scenarios
epmon_checks
dashboards
metrics
databases
alerts
notifications
logs
difference_cmo_fmo
contact
faq/index

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,45 @@
.. _logs:
====
Logs
====
- Every single job run log is stored on OpenStack Swift object storage.
- Each single job log file provides unique URL which can be accessed to see log
details
- These URLs are available on all ApiMon levels:
- In Zulip alarm messages
- In Alerta events
- In Grafana Dashboards
- Logs are simple plain text files of the whole playbook output::
2020-07-12 05:54:04.661170 | TASK [List Servers]
2020-07-12 05:54:09.050491 | localhost | ok
2020-07-12 05:54:09.067582 | TASK [Create Server in default AZ]
2020-07-12 05:54:46.055650 | localhost | MODULE FAILURE:
2020-07-12 05:54:46.055873 | localhost | Traceback (most recent call last):
2020-07-12 05:54:46.057441 | localhost |
2020-07-12 05:54:46.057499 | localhost | During handling of the above exception, another exception occurred:
2020-07-12 05:54:46.057535 | localhost |
2020-07-12 05:54:46.063992 | localhost | File "/tmp/ansible_os_server_payload_uz1c7_iw/ansible_os_server_payload.zip/ansible/modules/cloud/openstack/os_server.py", line 500, in _create_server
2020-07-12 05:54:46.065152 | localhost | return self._send_request(
2020-07-12 05:54:46.065186 | localhost | File "/root/.local/lib/python3.8/site-packages/keystoneauth1/session.py", line 1020, in _send_request
2020-07-12 05:54:46.065334 | localhost | raise exceptions.ConnectFailure(msg)
2020-07-12 05:54:46.065378 | localhost | keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to https://ims.eu-de.otctest.t-systems.com/v2/images: ('Connection aborted.', OSError(107, 'Transport endpoint is not connected'))
2020-07-12 05:54:46.295035 |
2020-07-12 05:54:46.295241 | TASK [Delete server]
2020-07-12 05:54:48.481374 | localhost | ok
2020-07-12 05:54:48.505761 |
2020-07-12 05:54:48.505906 | TASK [Delete SecurityGroup]
2020-07-12 05:54:50.727174 | localhost | changed
2020-07-12 05:54:50.745541 |
For further details how to work with logs please refer to
:ref:`How To Read The Logs And Understand The Issue <working_with_logs>` FAQ
page.

View File

@ -0,0 +1,57 @@
.. _metrics_definition:
=======
Metrics
=======
The Ansible playbook scenarios generate metrics in two ways:
- The Ansible playbook internally invokes method calls to **OpenStack SDK
libraries.** They in turn generate metrics about each API call they do. This
requires some special configuration in the clouds.yaml file (currently
exposing metrics into statsd and InfluxDB is supported). For details refer
to the `config
documentation <https://docs.openstack.org/openstacksdk/latest/user/guides/stats.html>`_
of the OpenStack SDK. The following metrics are captured:
- response HTTP code
- duration of API call
- name of API call
- method of API call
- service type
- Ansible plugins may **expose additional metrics** (i.e. whether the overall
scenario succeed or not) with help of `callback
plugin <https://github.com/stackmon/apimon/tree/main/apimon/ansible/callback>`_.
Since sometimes it is not sufficient to know only the timings of each API
call, Ansible callbacks are utilized to report overall execution time and
result (whether the scenario succeeded and how long it took). The following
metrics are captured:
- test case
- playbook name
- environment
- action name
- result code
- result string
- service type
- state type
- total amount of (failed, passed, ignored, skipped tests)
Custom metrics:
In some situations more complex metric generation is required which consists of
execution of multiple tasks in scenario. For such cases, the tags parameter is
used. Once the specific tasks in playbook are tagged with some specific metric
name the metrics are calculated as sum of all executed tasks with respective
tag. It's useful in cases where the measured metric contains multiple steps to
achieve the desired state of service or service resource. For example, boot up of
virtual machine from deployment until successful login via SSH.
.. code-block::
tags: ["metric=delete_server"]
tags: ["az={{ availability_zone }}", "service=compute", "metric=create_server{{ metric_suffix }}"]
More details how to query metrics from databases are described on :ref:`Metric
databases <metric_databases>` page.

View File

@ -0,0 +1,51 @@
===================
Monitoring coverage
===================
Multiple factors define the monitoring coverage to simulate common customer use
cases.
Monitored locations
###################
* EU-DE
* EU-NL
* PREPROD (EU_DE)
* EU-CH2 (Swisscloud)
Monitoring sources
##################
* Inside OTC (eu-de, eu-ch2)
* Outside OTC (Swisscloud)
Monitored targets
#################
* Endpoints and HTTP query requests
* all services
* multiple GET queries
* Static Resources
* specific services
* availability of the resource or resource functionality
* Dynamic resources
* ansible playbooks
* specific services
* monitoring of most common use cases in cloud services
Monitoring dashboards
#####################
* KPI dashboards
* 24/7 dashboards
* Test results dashboards
* Specific service dashboards

View File

@ -0,0 +1,68 @@
=============
Notifications
=============
Zulip as official OTC communication channel supports API interface for pushing
the notifications from ApiMon to various Zulip streams:
- #Alerts Stream
- #Alerts-Hybrid Stream
- #Alerts-Preprod Stream
Every stream contains topics based on the service type (if represented by
standalone Ansible playbook) and general apimon_endpoint_monitor topic which
contains alerts of GET queries towards all services.
.. image:: training_images/zulip_notifications.png
If the error has been acknowledged on Alerta, the new notification message for
repeating error won't get posted again on Zulip.
Notifications contain further details which help to identify root cause faster
and more effectively.
Notification parameters
#######################
The ApiMon notification consists of several fields:
+---------------------------+------------------------------------------------------------------------+
| Notification Field | Description |
+===========================+========================================================================+
| **APIMon Alert link** | Reference to alert in Alerta |
+---------------------------+------------------------------------------------------------------------+
| **Status** | Status of the alert in Alerta |
+---------------------------+------------------------------------------------------------------------+
| **Environment** | Information about affected environment/region |
+---------------------------+------------------------------------------------------------------------+
| **Severity** | Severity of the alarm |
+---------------------------+------------------------------------------------------------------------+
| **Origin** | Information about origin location from where the job has been executed |
+---------------------------+------------------------------------------------------------------------+
| **Service** | Information about affected service and type of monitoring |
+---------------------------+------------------------------------------------------------------------+
| **Resource** | Further details in which particular resource issue has happened |
+---------------------------+------------------------------------------------------------------------+
| **Error message Summary** | Short description of error result |
+---------------------------+------------------------------------------------------------------------+
| **Execution Log link** | Reference to job execution output on Swift object storage |
+---------------------------+------------------------------------------------------------------------+
Th EpMon notification consists of several fields:
+----------------------------+------------------------------------------------------------------+
| Notification Field | Description |
+============================+==================================================================+
| **APIMon Alert link** | Reference to alert in Alerta |
+----------------------------+------------------------------------------------------------------+
| **Environment** | Information about affected environment/region |
+----------------------------+------------------------------------------------------------------+
| **Curl command** | Interpreted request in curl format for reproducible applications |
+----------------------------+------------------------------------------------------------------+
| **Request error response** | Error result of the requested API call |
+----------------------------+------------------------------------------------------------------+

File diff suppressed because it is too large Load Diff

Binary file not shown.

After

Width:  |  Height:  |  Size: 188 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 321 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 157 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 96 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 165 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 101 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 68 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 66 KiB

View File

@ -0,0 +1,28 @@
.. _apimon_flow:
ApiMon Flow Process
===================
.. image:: training_images/apimon_data_flow.svg
:target: training_images/apimon_data_flow.svg
:alt: apimon_data_flow
#. Service squad adds test scenario to github repository.
#. Scheduler fetches test scenarios from Github and add them to queue.
#. Executor plays Ansible test scenario playbooks. Up to 8 parallel threads are enabled
#. Test scenario which has finished is being removed from the thread and next
playbook in the queue is added to the free thread. The previous playbook is
added to the queue on the last position.
#. Test scenario statistics are stored in the Postgresql database.
#. Metrics from HTTP requests are collected by Statsd.
#. Collected metrics are stored in time-series database Graphite.
#. Grafana uses metrics and statistics databases as the data sources for the
dashboards. The dashboard with various panels show the real-time status of
the platform. Grafana supports also historical views and trends.
#. Breached thresholds as well as failed test scenarios result in generated
alerts on Alerta.
#. Notifications containing alert details are sent to Zulip
#. Every test scenario stores it's job output log into Swift object storage for further analysis and investigation.

View File

@ -6,3 +6,4 @@ Internal Documentation
:maxdepth: 1 :maxdepth: 1
helpcenter_training/index helpcenter_training/index
apimon_training/index