4.3 KiB

Introduction

The Open Telekom Cloud is represented to users and customers by the API endpoints and the various services behind them. Users and operators are interested in a reliable way to check and verify if the services are actually available to them via the Internet. While internal monitoring checks on the OTC backplane are necessary, they are not sufficient to detect failures that manifest in the interface, network connectivity, or the API logic itself. Also helpful, but not sufficient are simple HTTP requests to the REST endpoints and checking for 200 status codes.

The ApiMon is Open Telekom Cloud product developed by Ecosystem squad.

The ApiMon a.k.a API-Monitoring project:

  • Developed with aim to supervise 24/7 the public APIs of OTC platform.
  • Requests repeatedly sent to the API.
  • Requests grouped in so-called scenarios, mimicking real-world use cases.
  • Use cases are implemented as Ansible playbooks.
  • Easy to extend the API-Monitoring for other use cases like monitoring the provisioning of extra VMs or deploying extra software.

image

ApiMon Architecture Summary

  • Test Scenarios are implemented as ansible playbooks and pushed to Github [repository](https://github.com/opentelekomcloud-infra/apimon-tests).
  • EpMon executes various HTTP query requests towards service endpoints and generates statistsic
  • Scheduler fetches the latest playbooks from repo and puts them in queue to run in a endless loop.
  • Executor is running the playbooks from queue and capturing the metrics
  • The ansible playbook results generates the metrics (duration, result).
  • Test scenarios metrics are sent to postgresql relational database.
  • The HTTP requests metrics (generated by OpenStackSDK) are collected by statsd.
  • Time Series database (graphite) is pulling metrics from statsd.
  • Grafana dashboards visualize data from postgresql and graphite.
  • Alerta monitoring is used for rasing Alarms when API times out, returns error or response time exceeds threshold.
  • Alerta further sends error notification on Zulip #Alerts Stream.
  • Log Files are maintained on OTC object storage via swift.

ApiMon features

ApiMon comes with the following features:

  • Support of ansible playbooks for testing scenarios
  • Support of HTTP requests (GET) for Endpoint Monitoring
  • Support of TSDB and RDB
  • Support of all OTC environments
  • EU-DE
  • EU-NL
  • Swisscloud
  • PREPROD
  • Support of multiple Monitoring sources:
  • internal (OTC)
  • external (vCloud)
  • Alerts agregated in Alerta and notifications sent to zulip
  • Various dasbhoards
  • KPI dashboards
  • 24/7 squad dashboards
  • General test results dashboards
  • Specific squad/service based dashboards
  • Each squad can control and manage their test scenarios and dashboards
  • Every exectution of ansible playbooks stores the log file for further investigation/analysis on swift

What ApiMon is NOT

The following items are out of scope (while some of them are technically possible):

  • No performance monitoring: The API-Monitoring does not measure degradations of performance per se. So measuring the access times or data transfer rates of an SSD disk is out of scope. However, if the performance of a resource drops under some threshold that is considered as equivalent to non-available, this is reported.
  • No application monitoring: The service availability of applications that run on top of IaaS or PaaS of the cloud is out of scope.
  • No view from inside: The API-Monitoring has no internal backplane insights and only uses public APIs of the monitored cloud. It requires thus no administrative permissions on the backend. It can be, however, deployed additionally in the backplane to monitor additionally internal APIs.
  • No synthetic workloads: The service is not simulating any workloads (for example a benchmark suite) on the provisioned resources. Instead it measures and reports only if APIs are available and return expected results with an expected behaviour.
  • No every single API monitoring .The API-Monitoring focuses on basic API functionality of selected components. It doesn't cover every single API call available in OTC API product portfolio.