2 Own lifecycle of all Major Incidents Enable quick restoration of service and minimize impact for all key incidents via centralized communication, collaboration, facilitation, and coordination.Major Incident FacilitationWhen the IMT takes ownership of an incident record they will provide direction over an AT&T bridge.The bridge will be used for bringing support groups together to make decisions for service restoration and normalization. This will allow technicians to collaborate and troubleshoot the incident.Research potential causes by utilizing tools available to the team including researching change requests, dashboards and related Remedy tickets that maybe related to the incident.Engage Support Groups / Technicians during the course of a Major Incident as neededDocument actions from the Major Incident in an Incident TicketRegularly communicate to stakeholders and interested partiesMonitor if the Major Incident meets any Disaster Recovery TriggersOwn lifecycle of all Major IncidentsSuccessful restoration and normalization of Service for Major IncidentsRemedy Ticket Handling and DocumentationCreation of Problem Investigation; Emergency Change Requests, & CI Unavailability FormsSend Incident Notifications and EscalationsPages, texts, and s regarding a Major IncidentMaintenance of the Incident Management Notifications siteMyQumas documentation, trainingDirect-line for Vendors, Partners, & CustomersLifecycle OwnerLeadership, Direction, Standards & PracticesIncident Task ForceFor elongated incidents where there is no immediate resolution, the IMT will schedule and facilitate meetings to investigate root cause and normalization of servicePlatform OwnerPlanning, Road-mapping, & SupportingSupport Groups / Technical StaffInvestigation & Resolution of Major IncidentsParticipate in Problem ReviewsThe Problem Management team schedules and invites the IMT to review the Key Incident. The IMT is needed to collaborate on details and assist in determining root cause of the incident handled by IMT. The IMT will also give feedback in how well the incident process was executed.Problem COECommunicate & CooperateBusinessInform on Major IncidentsService DeskCoordination during a Major IncidentEmergency Change ManagementWhen working a Key Incident and a change is required within 24 hours or less to restore service, the Incident Management Team will be engaged to initiate the Emergency Change process.Service Management teamsConsult with other SMO process teams for consistency
3 What is a Major Incident? A Major Incident will be an Incident that meets any of the below criteria.
5 Issues with current Root Cause Investigation process Lack of focusUnclear OwnershipLonger Investigation timesInconsistent PrioritizationFinger-pointingHot PotatoAfraid to get ‘dinged’Lack of customer perspectiveDo we have the right skill set performing the right functionality?
8 Future: Major Incident Process Goal: decrease IMT engagement TimePilot to begin where we dissect a critical application:understand current monitoring strategy,enhance the monitorsHave monitors directly notify IMTBuild pre-defined communication / escalation scripts per monitorSome Stats:The average time to engage IMT for 2013 which is 16:59:46 (for 700 total Incidents). There were a few outliers that caused this number to shoot to just under 17hrs to engage the IMT. If I take out the top 5, this number drops to 7:37:35.In terms of engaging the IMT, we use 20 minutes as a high-water mark:In 2012, for 82% of Incidents handled by the IMT, took longer than 20 mins to engage the IMTIn 2013, this number dropped to 59%.One of our issues today is not so much as monitoring, but easily identifying what is critical. Our current model is for support teams to engage the IMT after they have determined that a Major Incident occurred. With this model, we have seen a big variance of engagement time based on the support groups and issue.