9.9 KiB
Alerts
Alerta is the component of the ApiMon that is designed to integrate alerts from multiple sources. It supports many different standard sources like Syslog, SNMP, Prometheus, Nagios, Zabbix, etc. Additionally any other type of source using URL request or command line can be integrated as well.
Native functions like correlation and de-duplication help to manage thousands of alerts in transparent way and consolidate alerts in proper categories based on environment, service, resource, failure type, etc.
Alerta is hosted on https://alerts.eco.tsi-dev.otc-service.com/ . The authentication is centrally managed by OTC LDAP.
The Zulip API was integrated with Alerta, to send notification of errors/alerts on Zulip stream.
Alerts displayed on OTC Alerta are generated either by Executor, Scheduler, EpMon or by Grafana.
- “Executor alerts” focus on playbook results, whether playbook has completed or failed.
- “Grafana alerts” focus on breaching the defined thresholds. For example API response time is higher than defined threshold.
- "Scheduler alerts" TBD
- "EpMon alerts" provide information about failed endpoint queries with details of the request in curl form and the respective error response details
Alerts in Alerta are organized in environment tabs based on OTC regions.
- PRODUCTION EU-DE
- PRODUCTION EU-NL
- HYBRID-SWISS
- ALL
Every single alert shows 3 views:
- Details - all alert parameters are shown on the single views
- History - occurrences of the alert in time (without de-duplication)
- Data - extracted error message from the event
Alert object consists of the following fields:
Alert Field | Description |
---|---|
Alert ID | Reference to alert in Alerta |
Create Time | Timestamp of alert creation |
Service | Information about affected service and type of monitoring |
Environment | Information about affected environment/region |
Resource | Further details in which particular resource issue has happened |
Event | Short description of error result |
Correlate | Currently not in use |
Group | Further categorization of alerts (currently not used) |
Severity | Critical - EpMon, Major - ApiMon |
Status |
|
Value | Same like Event field |
Text | Currently not in use |
Trend Indication | Currently not in use |
Timeout | Time after which alert disappears from Alerta (default is 24h) |
Type |
|
Duplicate count | De-duplication feature - number of re-occurring same alerts |
Repeat | If duplicateCount is 0 or the alert status has changed then repeat is False, otherwise it is True |
Origin | Information about origin location from where the job has been executed |
Tags | Further details in which particular resource issue has happened |
Log Url | Reference to job execution output on Swift object storage (only for ApiMon alerts) |
Log Url Web | Reference to job execution output on Swift object storage (only for ApiMon alerts) |
State |
|