Root Cause Analysis (RCA)

The culmination of outage communications and the lasting artifact of the outage is the Root Cause Analysis, or RCA. Some organizations call this a Reason for Outage, RFO, or a Post-Mortem. Sometimes multiple versions of these documents exist based on formality and audience. This document is critical as closure, CYA, and lessons learned. It shows customers that IT is transparent, acknowledges its mistakes and commitment to not repeat them.

I recommend these sections in an RCA.

Outage identification: Here, assign a phrase that identifies the outage event, such as Internet circuit failure, CRM application outage, or SAP database crash. This section should note any relevant helpdesk/vendor support, change management, or problem management ticket numbers that may be referenced for more detail.

Symptom identification: Document the exact symptoms of the outage, i.e. what services were unavailable, and for how long. This perspective is generally from that of the service consumers.

Apparent root cause: Describe the cause of the service outage from the perspective of the service provider; for instance, flapping circuits, deployment of a corrupted build, or RAM failure. Alternatively, the outage could have been caused by a bungled change via a human or process error. Maybe the change management process was not correctly followed. If this is the case, the IT manager must ensure this is treated as transparently as a hardware error.

Resolutions: Provide detail on the technical resolution of the problem

Timeline: List a detailed timeline of events, compiled from email, system logs, and personal experiences. The level of detail should be commensurate with the criticality and the political ramifications.

Recommendations for Mitigating Future Occurrence: Describe how IT can prevent this outage from recurring. This is the most critical section, because it addresses the future. The past is past, but in the section describes how the organization will learn from what happened. If the outage resulted from human error, discuss an improvement plan. If the outage resulted from lack of funding or lack of support of procedures from the business or customers, here is an opportunity to sell your case with real facts.

The RCA is an excellent vehicle for communication, transparency, accountability, and improvements. In some organizations and scenarios, the document may be produced in two versions, internal and external. I also recommend mentioning the outage in change management meetings and reporting as it was a definite change to Production that everyone should know about. For the mature shops with Problem Management processes, the RCA is a foundation component in that discipline.

Outages are a fact of life for the IT manager and here we discussed a protocol of establishing the right culture, assessing the outage, performing damage control, marshaling resources, communicating properly and RCAs.