tischrei 0618989a8a hc_ops
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com>
Co-authored-by: tischrei <tino.schreiber@t-systems.com>
Co-committed-by: tischrei <tino.schreiber@t-systems.com>
2024-02-22 14:55:55 +00:00

109 lines
4.3 KiB
ReStructuredText

============
Introduction
============
The Open Telekom Cloud is represented to users and customers by the API
endpoints and the various services behind them. Users and operators are
interested in a reliable way to check and verify if the services are actually
available to them via the Internet. While internal monitoring checks on the OTC
backplane are necessary, they are not sufficient to detect failures that
manifest in the interface, network connectivity, or the API logic itself. Also
helpful, but not sufficient are simple HTTP requests to the REST endpoints and
checking for 200 status codes.
The ApiMon is Open Telekom Cloud product developed by
Ecosystem squad.
The ApiMon a.k.a API-Monitoring project:
- Developed with aim to supervise 24/7 the public APIs of OTC platform.
- Requests repeatedly sent to the API.
- Requests grouped in so-called scenarios, mimicking real-world use
cases.
- Use cases are implemented as Ansible playbooks.
- Easy to extend the API-Monitoring for other use cases like
monitoring the provisioning of extra VMs or deploying extra software.
.. image:: https://stackmon.github.io/assets/images/solution-diagram.svg
ApiMon Architecture Summary
---------------------------
- Test Scenarios are implemented as ansible playbooks and pushed to
`Github <https://github.com/opentelekomcloud-infra/apimon-test>`_.
- EpMon executes various HTTP query requests towards service endpoints and
generates statistics
- Scheduler fetches the latest playbooks from repo and puts them in a
queue to run in a endless loop.
- Executor is running the playbooks from queue and capturing the metrics
- The ansible playbook results generates the metrics (duration, result).
- Test scenarios metrics are sent to postgresql relational database.
- The HTTP requests metrics (generated by OpenStackSDK) are collected by
statsd.
- Time Series database (graphite) is pulling metrics from statsd.
- Grafana dashboards visualize data from postgresql and graphite.
- Alerta monitoring is used for rasing Alarms when API times out, returns error
or response time exceeds threshold.
- Alerta further sends error notification on Zulip #Alerts Stream.
- Log Files are maintained on OTC object storage via swift.
ApiMon features
---------------
ApiMon comes with the following features:
- Support of ansible playbooks for testing scenarios
- Support of HTTP requests (GET) for Endpoint Monitoring
- Support of TSDB and RDB
- Support of all OTC environments
- EU-DE
- EU-NL
- Swisscloud
- PREPROD
- Support of multiple Monitoring sources:
- internal (OTC)
- external (vCloud)
- Alerts aggregated in Alerta and notifications sent to zulip
- Various dashboards
- KPI dashboards
- 24/7 squad dashboards
- General test results dashboards
- Specific squad/service based dashboards
- Each squad can control and manage their test scenarios and dashboards
- Every execution of ansible playbooks stores the log file for further
investigation/analysis on swift object storage
What ApiMon is NOT
------------------
The following items are out of scope (while some of them are technically
possible):
- No performance monitoring: The API-Monitoring does not measure degradations of
performance per se. So measuring the access times or data transfer rates of an
SSD disk is out of scope. However, if the performance of a resource drops
under some threshold that is considered as equivalent to non-available, this
is reported.
- No application monitoring: The service availability of applications that run
on top of IaaS or PaaS of the cloud is out of scope.
- No view from inside: The API-Monitoring has no internal backplane insights and
only uses public APIs of the monitored cloud. It requires thus no
administrative permissions on the backend. It can be, however, deployed
additionally in the backplane to monitor additionally internal APIs.
- No synthetic workloads: The service is not simulating any workloads (for
example a benchmark suite) on the provisioned resources. Instead it measures
and reports only if APIs are available and return expected results with an
expected behavior.
- No every single API monitoring .The API-Monitoring focuses on basic API
functionality of selected components. It doesn't cover every single API call
available in OTC API product portfolio.