blueprints/doc/source/caf/govern-and-manage/emergency-handling.rst
2023-11-30 12:20:52 +01:00

8.5 KiB

Emergency Handling

The process of receiving, handling, and escalating incidents must be standardized to ensure that customer issues are handled at the promised service level. The incident handling responsibilities, time requirements, and notification mechanism must be clearly defined. Services must be quickly recovered to ensure the promised quality and availability.

Roles and responsibilities

Role Responsibility
O&M engineer
  • Receive customer incidents reported by hotline or email.
  • Record all information about received incidents, including contact methods of reporters, incident features, details, and time.
  • Diagnose and analyze incidents and provide solutions by referring to relevant documentation. For incidents that cannot be resolved, transfer them to developers or seek help from O&M leaders.
  • Hold first accountability to track incidents, record handling progress, and keep customers updated on the progress as required.
  • Demarcate incidents and provide solutions within a specified period. Transfer unresolvable incidents to developers within the specified time.
  • Close incidents that the reporters have confirmed resolved with the provided solutions.
  • Transfer reoccurring incidents and those with unknown causes or known defects to the issue management process.
  • Summarize lessons learned from typical and general incidents.
Developer
  • Locate and analyze the causes of incidents transferred from O&M engineers, and resolve the incidents.
  • Follow up with incident owners to confirm resolution.
  • Locate causes of bugs in the production environment, and provide and implement comprehensive solutions.

Incident manager

(O&M leader)

  • Coordinate and monitor the incident handling process.
  • Coordinate resources for major incidents.
  • Review and approve solutions for major incidents.
  • Trace the output of major incident reports.

Incident severity and response

Level 1: incidents that have major impacts on services, such as serious damage, data loss, service data or function errors (which cause multiple customer complaints), and system faults recurring within a short period of time.

Level 2: incidents that have minor impacts on services, such as unavailability of a few functions (service degradation), function impairment on some users, data inconsistency (no financial loss), and common system faults.

Level 3: incidents that have no impact on services, such as data query and consulting. Services are normal but experience is affected.

Response and resolution time requirements vary by incident level. The response timing is 24/7 and starts once an incident is reported.

Incident Level Response Time (minutes) Recovery Time (hours) Resolution Time (days)
1 10 2 7
2 30 6 20
3 60 24 60

Remarks:

  • The incident levels and response time above are only examples. Adjust them as required.
  • O&M engineers can transfer out incidents that they cannot resolve by referring to relevant documentation within the specified time.
  • The response time is the maximum delay before an incident handler starts handling an incident after receiving it.
  • The recovery time is the maximum duration needed to recover services after an incident occurs.
  • The resolution time is the maximum duration taken by an O&M engineer to resolve or transfer out an incident.

Incident escalation and notification

The notification mechanism defined in the following table is for incidents that are Level 1 and 2.

Notification Method Recipients

Initial notification

(30 minutes)

SMS and email

(Send to) Service owner

  1. O&M team
Phone call R&D leader

Handling progress

(1 hour)

SMS and email

(Send to) Service owner

  1. O&M team
Fault rectification SMS and email

(Send to) Service owner

  1. O&M team
Escalation of overdue incidents SMS and email

(Send to) Service owner

(CC) O&M team, R&D leader

Phone call R&D leader

If a Level-2 incident's impact on services and users worsens, escalate it to Level 1 and then handle it with Level-1 standards.

Precautions during incident management

  • Record all incidents on the live network in the unified event management system for analysis.
  • Respond to and resolve incidents within the specified time.
  • Analyze incidents handled every month.