3.5 KiB
Incidents
TODO Incidents inform customers about the reason why some cloud service has changed its status from "green" (normal operation) to any other state.
Incidents are created under following conditions:
- Metric Processor evaluates value 1 or 2 on health metric of specific cloud service and incident is automatically created on SD.
- Service Incident Manager (SIM) manually creates incident on SD for one or more cloud services.
Each cloud service on SD is represented by its name and the status semaphore color icon representing its current health status. The following states of the service can be shown on SD2:
- Operational - green "check" mark icon
- Maintenance - blue "wrench" mark icon
- Minor Issue - yellow "cross" mark icon
- Major Issue - brown "cross" mark icon
- Service Outage - red "cross" mark icon
These 5 states can be set manually for specific service(s) during incident creation but only 2 states (Minor issue and Major issue) are set automatically by the Metric Processor health metrics. Incidents are visualized in the respective color scheme on the top of the SD page. Also it's possible to navigate to the related incident via clicking on the service state icon next to the service.
Once the service health status is changed and incident is created there's no automated clean-up of the incident and incident must be handledl by respective SIM. Only after incident is closed the service changes its state back to "green" Operation state.
Incident manual creation process
As mentioned besides the automated incident creation the incidents can be created manually as well. Service incident manager must authenticate prior to be able to create an incident. Login is ensured by Openid connect feature on page https://status.cloudmon.eco.tsi-dev.otc-service.com/login/openid
Once logged in the new option "Open new incident" appears at top right corner of the page.
The incident creation process consists of these mandatory fields:
- Incident Summary - Description of the incident
- Incident Impact - Drop-down menu of 4 service states (Scheduled Maintenance, Minor Issue, Major Issue, Service Outage)
- Affected services - List of all OTC cloud services in conjunctions with regions. One or more items can be chosen
- Start - Timestamp when incident has started
Incident update process
During the incident lifecycle SIM can update incident with relevant information. The incident creation process consists of these optional fields:
- Incident title - Change the title of the incident
- Update Message - Additional details related to the current status of the incident
- Update Status - Drop-down menu of 4 incident statuses (Analyzing incident, Fixing incident, Observing fix, Incident resolved)
- Next Update by - Timestamp when incident is expected to be updated with another information
Incident manual closure process
Incident is never closed automatically. SIM needs to close the incident by changing its status during the update incident process to "Incident resolved". After that incident disappears from the active list of incidents and service health status is changed back to "green" operational state. Every closed incident is recorded in the Incident History.
Incident notifications
Status Dashboard support RSS feeds for incident notifications. The details how to setup RSS feed are described on notifications <sd2_notifications>
page.