Service Outages

Production outages do happen. Whether they stem from human error, hardware failure, or an act of God, networks, systems and applications will eventually crash. The manner in which an IT manager handles outages is critical to their perception of competency, professionalism and leadership. How a manager handles a critical outage may make or break their career in that organization.

First, the IT manager should establish a culture that believes outages are not acceptable. The team operating principles or charter is an excellent place to document this. This will give outages a sense of urgency even for those IT employees who are not exposed to the business customers screaming about the lack of service. Also, having a written commitment to keeping Production stable will let business management know they are a priority. Team members should be forewarned that outages may require reprioritization and overtime and communications may follow a different protocol.

Secondly, the IT manager must be able to quickly categorize the impact of the loss of service. The manager must assess the ramifications of the continued down time both from a business and political perspective. The first decision point is to determine whether the situation is a disaster worthy of activating a DR plan or if it is local and self-contained. The former case should branch into a completely different set of actions out of the scope of these comments. Hopefully, the manager has application profile and CMDB documentation that can provide an accurate representation of impact and criticality. The manager must consider items such as the following.

  • Is revenue generation direct affected?
  • Is an external customer experience directly affected?
  • Are there readily quantifiable costs to the downtime? For instance, an outage to a development server may result in $100/hour contract developers sitting on their hands.
  • Are any major deadlines or projects at risk? (e.g., year-end close, delivery of new application, etc.)
  • Is there a workaround?

Thirdly, damage control is important. There are politics at play in any organization and they must be managed. The IT manager should strive to be the one to inform their management and business management about an outage. They need to provide management, internal customers, and external customers with a sense of comfort that the situation is being addressed. Depending on the nature and criticality of the outage, a targeted email, phone call, or a trip to someone’s office in wholly appropriate, even expected.

Next, the IT manager must marshal all necessary resources to address the outage, asking such questions as the following.

  • Does the situation warrant pulling resources off critical projects?
  • Do we open up trouble tickets with the relevant vendor support organizations?
  • Should we engage partners and line up emergency assistance?
  • Do we create an all-hands-on-deck scenario in a war room?

Managers must walk a careful line between appropriately reacting and over-reacting as well as the perception of appropriately reacting and over-reacting. Business management may not be satisfied with anything less than a host of people in a conference room; however, those same people may not be tolerant of missed dates on a project due to reassignment of a priority.

There is indeed value in having everyone present to troubleshoot an outage as someone might have specialized knowledge or a lucky epiphany that will save the day. However, that comes with a cost of confusion caused by cacophony of more voices and ideas and disruption of other operations and project execution.

Someone must be the focal point, coordinator or point of contact. Very often that is the IT manager or team lead of the affected group. This individual should assist with facilitation of the outage response team, document suggestions and activities, and provide a unified communications to affected parties. For larger organizations and/or more critical outages, multiple individuals may take on these roles.

Finally, the IT manager should ensure there is proper ongoing status and communications. The response team must know of what fixes were attempted and their success and what new issues have arisen. Maybe the response team should gather periodically for a status call/meeting. Maybe an issues list should be circulated or displayed on a collaboration space. Based on the nature of the outage and existing service management procedures, customers and management must be notified accordingly. In fact, depending on the situation, the IT manager may not want to rely solely on an existing service management process, but follow-up with personal notifications and statuses.

The culmination of these communications and the lasting artifact of the outage is the Root Cause Analysis, or RCA.

Outages are a fact of life for the IT manager and here we discussed a protocol of establishing the right culture, assessing the outage, performing damage control, marshaling resources, communicating properly and RCAs.