Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com> Co-authored-by: tischrei <tino.schreiber@t-systems.com> Co-committed-by: tischrei <tino.schreiber@t-systems.com>
109 lines
4.3 KiB
ReStructuredText
109 lines
4.3 KiB
ReStructuredText
============
|
|
Introduction
|
|
============
|
|
|
|
The Open Telekom Cloud is represented to users and customers by the API
|
|
endpoints and the various services behind them. Users and operators are
|
|
interested in a reliable way to check and verify if the services are actually
|
|
available to them via the Internet. While internal monitoring checks on the OTC
|
|
backplane are necessary, they are not sufficient to detect failures that
|
|
manifest in the interface, network connectivity, or the API logic itself. Also
|
|
helpful, but not sufficient are simple HTTP requests to the REST endpoints and
|
|
checking for 200 status codes.
|
|
|
|
The ApiMon is Open Telekom Cloud product developed by
|
|
Ecosystem squad.
|
|
|
|
The ApiMon a.k.a API-Monitoring project:
|
|
|
|
- Developed with aim to supervise 24/7 the public APIs of OTC platform.
|
|
- Requests repeatedly sent to the API.
|
|
- Requests grouped in so-called scenarios, mimicking real-world use
|
|
cases.
|
|
- Use cases are implemented as Ansible playbooks.
|
|
- Easy to extend the API-Monitoring for other use cases like
|
|
monitoring the provisioning of extra VMs or deploying extra software.
|
|
|
|
|
|
.. image:: https://stackmon.github.io/assets/images/solution-diagram.svg
|
|
|
|
ApiMon Architecture Summary
|
|
---------------------------
|
|
|
|
- Test Scenarios are implemented as ansible playbooks and pushed to
|
|
`Github <https://github.com/opentelekomcloud-infra/apimon-test>`_.
|
|
|
|
- EpMon executes various HTTP query requests towards service endpoints and
|
|
generates statistics
|
|
- Scheduler fetches the latest playbooks from repo and puts them in a
|
|
queue to run in a endless loop.
|
|
- Executor is running the playbooks from queue and capturing the metrics
|
|
- The ansible playbook results generates the metrics (duration, result).
|
|
- Test scenarios metrics are sent to postgresql relational database.
|
|
- The HTTP requests metrics (generated by OpenStackSDK) are collected by
|
|
statsd.
|
|
- Time Series database (graphite) is pulling metrics from statsd.
|
|
- Grafana dashboards visualize data from postgresql and graphite.
|
|
- Alerta monitoring is used for rasing Alarms when API times out, returns error
|
|
or response time exceeds threshold.
|
|
- Alerta further sends error notification on Zulip #Alerts Stream.
|
|
- Log Files are maintained on OTC object storage via swift.
|
|
|
|
ApiMon features
|
|
---------------
|
|
|
|
ApiMon comes with the following features:
|
|
|
|
- Support of ansible playbooks for testing scenarios
|
|
- Support of HTTP requests (GET) for Endpoint Monitoring
|
|
- Support of TSDB and RDB
|
|
- Support of all OTC environments
|
|
|
|
- EU-DE
|
|
- EU-NL
|
|
- Swisscloud
|
|
- PREPROD
|
|
|
|
- Support of multiple Monitoring sources:
|
|
|
|
- internal (OTC)
|
|
- external (vCloud)
|
|
|
|
- Alerts aggregated in Alerta and notifications sent to zulip
|
|
- Various dashboards
|
|
|
|
- KPI dashboards
|
|
- 24/7 squad dashboards
|
|
- General test results dashboards
|
|
- Specific squad/service based dashboards
|
|
|
|
- Each squad can control and manage their test scenarios and dashboards
|
|
- Every execution of ansible playbooks stores the log file for further
|
|
investigation/analysis on swift object storage
|
|
|
|
|
|
What ApiMon is NOT
|
|
------------------
|
|
|
|
The following items are out of scope (while some of them are technically
|
|
possible):
|
|
|
|
- No performance monitoring: The API-Monitoring does not measure degradations of
|
|
performance per se. So measuring the access times or data transfer rates of an
|
|
SSD disk is out of scope. However, if the performance of a resource drops
|
|
under some threshold that is considered as equivalent to non-available, this
|
|
is reported.
|
|
- No application monitoring: The service availability of applications that run
|
|
on top of IaaS or PaaS of the cloud is out of scope.
|
|
- No view from inside: The API-Monitoring has no internal backplane insights and
|
|
only uses public APIs of the monitored cloud. It requires thus no
|
|
administrative permissions on the backend. It can be, however, deployed
|
|
additionally in the backplane to monitor additionally internal APIs.
|
|
- No synthetic workloads: The service is not simulating any workloads (for
|
|
example a benchmark suite) on the provisioned resources. Instead it measures
|
|
and reports only if APIs are available and return expected results with an
|
|
expected behavior.
|
|
- No every single API monitoring .The API-Monitoring focuses on basic API
|
|
functionality of selected components. It doesn't cover every single API call
|
|
available in OTC API product portfolio.
|