tischrei 0618989a8a hc_ops
Reviewed-by: Gode, Sebastian <sebastian.gode@t-systems.com>
Co-authored-by: tischrei <tino.schreiber@t-systems.com>
Co-committed-by: tischrei <tino.schreiber@t-systems.com>
2024-02-22 14:55:55 +00:00

111 lines
10 KiB
ReStructuredText

======
Alerts
======
Alerta is the component of the ApiMon that is designed to integrate alerts
from multiple sources. It supports many different standard sources like Syslog,
SNMP, Prometheus, Nagios, Zabbix, etc. Additionally any other type of source
using URL request or command line can be integrated as well.
Native functions like correlation and de-duplication help to manage thousands of
alerts in transparent way and consolidate alerts in proper categories based on
environment, service, resource, failure type, etc.
Alerta is hosted on https://alerts.eco.tsi-dev.otc-service.com/ .
The authentication is centrally managed by OTC LDAP.
The Zulip API was integrated with Alerta, to send notification of errors/alerts
on Zulip stream.
Alerts displayed on OTC Alerta are generated either by Executor, Scheduler,
EpMon or by Grafana.
- “Executor alerts” focus on playbook results, whether playbook has completed
or failed.
- “Grafana alerts” focus on breaching the defined thresholds. For example API
response time is higher than defined threshold.
- "Scheduler alerts" TBD
- "EpMon alerts" provide information about failed endpoint queries with details
of the request in curl form and the respective error response details
.. image:: training_images/alerta_dashboard.png
Alerts in Alerta are organized in environment tabs based on OTC regions.
- PRODUCTION EU-DE
- PRODUCTION EU-NL
- HYBRID-SWISS
- ALL
Every single alert shows 3 views:
- **Details** - all alert parameters are shown on the single views
- **History** - occurrences of the alert in time (without de-duplication)
- **Data** - extracted error message from the event
Alert object consists of the following fields:
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Alert Field | Description |
+======================+========================================================================================================================================+
| **Alert ID** | Reference to alert in Alerta |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Create Time** | Timestamp of alert creation |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Service** | Information about affected service and type of monitoring |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Environment** | Information about affected environment/region |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Resource** | Further details in which particular resource issue has happened |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Event** | Short description of error result |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Correlate** | Currently not in use |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Group** | Further categorization of alerts (currently not used) |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Severity** | Critical - EpMon, Major - ApiMon |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Status** | - **Open** - default status when alert is received in Alerta |
| | - **Ack** - Acknowledged status, indicating that the incident of the service or of the host has been taken into account by a user. |
| | - **Shelve** - change alert status to shelved which removes the alerts from the active console and prevents any further notifications. |
| | - **Close** - change alert status to closed |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Value** | Same like Event field |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Text** | Currently not in use |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Trend Indication** | Currently not in use |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Timeout** | Time after which alert disappears from Alerta (default is 24h) |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Type** | - Apimon Executor Alert - ApiMon related events |
| | - Exception Alert - EpMon related events |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Duplicate count** | De-duplication feature - number of re-occurring same alerts |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Repeat** | If duplicateCount is 0 or the alert status has changed then repeat is False, otherwise it is True |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Origin** | Information about origin location from where the job has been executed |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Tags** | Further details in which particular resource issue has happened |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Log Url** | Reference to job execution output on Swift object storage (only for ApiMon alerts) |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **Log Url Web** | Reference to job execution output on Swift object storage (only for ApiMon alerts) |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| **State** | - Present - if alert is still actual |
| | - Present - if alert is not occurring anymore |
+----------------------+----------------------------------------------------------------------------------------------------------------------------------------+
.. image:: training_images/alerta_detail.jpg