10 KiB

Alerts

Alerta is the component of the ApiMon that is designed to integrate alerts from multiple sources. It supports many different standard sources like Syslog, SNMP, Prometheus, Nagios, Zabbix, etc. Additionally any other type of source using URL request or command line can be integrated as well.

Native functions like correlation and de-duplication help to manage thousands of alerts in transparent way and consolidate alerts in proper categories based on environment, service, resource, failure type, etc.

Alerta is hosted on https://alerts.eco.tsi-dev.otc-service.com/ . The authentication is centrally managed by OTC LDAP.

The Zulip API was integrated with Alerta, to send notification of errors/alerts on Zulip stream.

Alerts displayed on OTC Alerta are generated either by Executor, Scheduler, EpMon or by Grafana.

  • “Executor alerts” focus on playbook results, whether playbook has completed or failed.
  • “Grafana alerts” focus on breaching the defined thresholds. For example API response time is higher than defined threshold.
  • "Scheduler alerts" TBD
  • "EpMon alerts" provide information about failed endpoint queries with details of the request in curl form and the respective error response details

image

Alerts in Alerta are organized in environment tabs based on OTC regions.

  • PRODUCTION EU-DE
  • PRODUCTION EU-NL
  • HYBRID-SWISS
  • ALL

Every single alert shows 3 views:

  • Details - all alert parameters are shown on the single views
  • History - occurrences of the alert in time (without de-duplication)
  • Data - extracted error message from the event

Alert object consists of the following fields:

Alert Field Description
Alert ID Reference to alert in Alerta
Create Time Timestamp of alert creation
Service Information about affected service and type of monitoring
Environment Information about affected environment/region
Resource Further details in which particular resource issue has happened
Event Short description of error result
Correlate Currently not in use
Group Further categorization of alerts (currently not used)
Severity Critical - EpMon, Major - ApiMon
Status
  • Open - default status when alert is received in Alerta
  • Ack - Acknowledged status, indicating that the incident of the service or of the host has been taken into account by a user.
  • Shelve - change alert status to shelved which removes the alerts from the active console and prevents any further notifications.
  • Close - change alert status to closed
Value Same like Event field
Text Currently not in use
Trend Indication Currently not in use
Timeout Time after which alert disappears from Alerta (default is 24h)
Type
  • Apimon Executor Alert - ApiMon related events
  • Exception Alert - EpMon related events
Duplicate count De-duplication feature - number of re-occurring same alerts
Repeat If duplicateCount is 0 or the alert status has changed then repeat is False, otherwise it is True
Origin Information about origin location from where the job has been executed
Tags Further details in which particular resource issue has happened
Log Url Reference to job execution output on Swift object storage (only for ApiMon alerts)
Log Url Web Reference to job execution output on Swift object storage (only for ApiMon alerts)
State
  • Present - if alert is still actual
  • Present - if alert is not occurring anymore

image