Root Cause Analysis for Software Problems

It happens. To the best effort of your developers and test team, bugs sometimes escape to customers. Being a quality or test leader, its important to handle these situations in a way to learn and improve. This template for root cause analysis has worked very well to help the team learn about the escape and make improvements

To help guide the investigation, I’ve developed the following set of questions which help guide a comprehensive root cause. These questions help guide the review in more productive direction than finger pointing.

Describe Problem:

Describe symptoms and consequences of the problem.This description should be a summary with enough description for readers to become familiar with the issue.Include a reference to the trouble ticket or other documentation your organizing used to track such issues. Usually the Quality team or leader prepares the description.

Where was the problem introduced?

Describe the phase in which this was introduced, i.e. requirements, design, code, build, etc.

Describe the root cause of the problem:

For the phase where the problem was introduced, describe what actually happened. For example, the requirements could be missing, incorrect, unclear. Designs might not have provided for error handling, or consider the performance requirements. Coding errors may be logic, missing table entry, incorrect logic (“and” instead of “or”), etc.

The root cause is often difficult to determine. One tool to use is the “5-whys” approach. This is where you ask the question “why” 5 times or until you get to the root cause.

In many cases, there will be multiple causes, that combined together, contributed to the escape to production. In these more complex scenarios, the “fishbone diagram” is a useful tool.

Determining the root cause is the hardest part of this process. Often, the people involved are defensive or may not be oriented towards digging for root cause (especially if the problem is already fixed).

What reviews were held for the phase?

Now that you understand the root cause, you want to start diagnosing your quality system to understand how the problem escaped. For example, if the root cause is a missing or incorrect requirement, inspect the actual documentation, understand the reviews (and perhaps sign off) of the requirements.

For coding errors, check to see how the code was peer reviewed and what, if any, static analysis was performed.

How and why did the problem escape testing?

Examples include:

  • Test case doesn’t exist
  • Test exists but was not run (not planned, risk-based decision, forgot/miscommunication)
  • Test executed with a different set of data
  • Problem was introduced after test passed, etc.

Also, describe why the decisions were made. Include unit tests, integration/functional tests, and system tests.

What could be done to prevent this type of error in the future?

What process improvement could have prevented this problem from occurring?

What could be done to find this type of problem in the future? 

What test or review should be implemented?Examples are adding a test case, implementing a new automated test, adding new exit criteria to peer review, etc.

Answering these questions does take some effort, but following this methodology for the most impactful escapes helps the whole team learn to be more effective.

2 thoughts on “Root Cause Analysis for Software Problems

  1. Pingback: Improving Test Practices After Deployment

Comments are closed.