forked from docs/internal-documentation
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com> Co-authored-by: tischrei <tino.schreiber@t-systems.com> Co-committed-by: tischrei <tino.schreiber@t-systems.com>
111 lines
10 KiB
ReStructuredText
111 lines
10 KiB
ReStructuredText
======
|
|
Alerts
|
|
======
|
|
|
|
Alerta is the component of the ApiMon that is designed to integrate alerts
|
|
from multiple sources. It supports many different standard sources like Syslog,
|
|
SNMP, Prometheus, Nagios, Zabbix, etc. Additionally any other type of source
|
|
using URL request or command line can be integrated as well.
|
|
|
|
Native functions like correlation and de-duplication help to manage thousands of
|
|
alerts in transparent way and consolidate alerts in proper categories based on
|
|
environment, service, resource, failure type, etc.
|
|
|
|
Alerta is hosted on https://alerts.eco.tsi-dev.otc-service.com/ .
|
|
The authentication is centrally managed by OTC LDAP.
|
|
|
|
The Zulip API was integrated with Alerta, to send notification of errors/alerts
|
|
on Zulip stream.
|
|
|
|
Alerts displayed on OTC Alerta are generated either by Executor, Scheduler,
|
|
EpMon or by Grafana.
|
|
|
|
- “Executor alerts” focus on playbook results, whether playbook has completed
|
|
or failed.
|
|
- “Grafana alerts” focus on breaching the defined thresholds. For example API
|
|
response time is higher than defined threshold.
|
|
- "Scheduler alerts" TBD
|
|
- "EpMon alerts" provide information about failed endpoint queries with details
|
|
of the request in curl form and the respective error response details
|
|
|
|
|
|
|
|
.. image:: training_images/alerta_dashboard.png
|
|
|
|
|
|
|
|
Alerts in Alerta are organized in environment tabs based on OTC regions.
|
|
|
|
- PRODUCTION EU-DE
|
|
- PRODUCTION EU-NL
|
|
- HYBRID-SWISS
|
|
- ALL
|
|
|
|
Every single alert shows 3 views:
|
|
|
|
- **Details** - all alert parameters are shown on the single views
|
|
- **History** - occurrences of the alert in time (without de-duplication)
|
|
- **Data** - extracted error message from the event
|
|
|
|
|
|
Alert object consists of the following fields:
|
|
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| Alert Field | Description |
|
|
+======================+========================================================================================================================================+
|
|
| **Alert ID** | Reference to alert in Alerta |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Create Time** | Timestamp of alert creation |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Service** | Information about affected service and type of monitoring |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Environment** | Information about affected environment/region |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Resource** | Further details in which particular resource issue has happened |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Event** | Short description of error result |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Correlate** | Currently not in use |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Group** | Further categorization of alerts (currently not used) |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Severity** | Critical - EpMon, Major - ApiMon |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Status** | - **Open** - default status when alert is received in Alerta |
|
|
| | - **Ack** - Acknowledged status, indicating that the incident of the service or of the host has been taken into account by a user. |
|
|
| | - **Shelve** - change alert status to shelved which removes the alerts from the active console and prevents any further notifications. |
|
|
| | - **Close** - change alert status to closed |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Value** | Same like Event field |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Text** | Currently not in use |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Trend Indication** | Currently not in use |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Timeout** | Time after which alert disappears from Alerta (default is 24h) |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Type** | - Apimon Executor Alert - ApiMon related events |
|
|
| | - Exception Alert - EpMon related events |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Duplicate count** | De-duplication feature - number of re-occurring same alerts |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Repeat** | If duplicateCount is 0 or the alert status has changed then repeat is False, otherwise it is True |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Origin** | Information about origin location from where the job has been executed |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Tags** | Further details in which particular resource issue has happened |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Log Url** | Reference to job execution output on Swift object storage (only for ApiMon alerts) |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **Log Url Web** | Reference to job execution output on Swift object storage (only for ApiMon alerts) |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
| **State** | - Present - if alert is still actual |
|
|
| | - Present - if alert is not occurring anymore |
|
|
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|
|
|
|
|
|
|
|
.. image:: training_images/alerta_detail.jpg
|
|
|
|
|