Category Archives: Quality

Software Root Cause Analysis: 3 Questions to Answer

Image of explosion represents things going wrong on a project.

Sometimes, things don’t go as planned.

Here are three questions that I like to answer when performing a root cause analysis for escaped bugs:

  1. How was the bug introduced in the first place?
  2. How did we not catch it earlier?
  3. What are we doing to prevent this problem in the future?

For the first two questions, I have a handy template for performing root cause analysis.

Generally, for the 3rd question, what are we doing to prevent the problem, we have short and longer term solutions.  In the short term, we should add the appropriate test or check that missed the problem in the first place.  That is the answer to the specific question for that particular issue.

For the longer term, we collect data about the escaped causes and reasons for escape.  We collect that data in the bug tracking system as two fields with categories.  When we have enough data, we can examine trends.  I usually start with a simple Pareto analysis, showing the top few causes/reasons. Then work with the team to ask how can we improve our processes/practices.  Its often useful to filter the Pareto analysis to the most painful bugs (those found by customers, high severity, etc.)

Please drop a comment below and let me know what you do for root cause?

By Photo courtesy of National Nuclear Security Administration / Nevada Site Office [Public domain], via Wikimedia Commons

When is cutting corners the right answer?

Glass shelf with a sharp corner and a label saying "caution sharp corner"

Cutting corners is the right answer when your problem is sharp corners

When is cutting corners the right answer?  When the problem is sharp corners.

A key concept in quality engineering is “fail-safe” design.  I’ve written about fail-safe design in the past, regarding software controlled rifles.  This example, a sharp corner, is a much more simple, and visual, example.

In the US, we have lots of litigation. I’m sure this label was applied to the sharp corner to point out the danger to customers. Also, maybe to protect from lawsuits if someone gets injured.  A better solution would be to grind down that corner so it isn’t a hazard.

Fail safe design means to build your systems in a way that, if they fail, they fail in a safe manner.  In this case, if someone bumps into this shelf, they shouldn’t get cut.

Coming back to software, what if you have a cron job that does some cleanup.  What happens if that job fails?  Does it leave data behind which might consume your storage?  Would any of that data be Personally Identifiable?

Using a FMEA – Failure Mode and Effects Analysis is a good method to identify these potential failures and ask, does the system fail in the most safe manner?

The Test Leader

Katrina Clokie has a really good blog post describing the difference between a test leader and a test manager.  The test leader influences testing across the organization without direct positional authority.  I’d suggest one tweak, the test manager role should also include the leadership qualities of the test leader.

Especially in Agile, but applicable in any life-cycle, everyone tests.  Developers should be testing their code, and product managers (or product owners) should be participating in the acceptance testing. Everyone has a role in building high quality, even if the activities are not directly tests.  The test manager can play an influential role in these other groups, in addition to leading their direct team.

I’ve been advocating the role Quality Leader instead of Test Manger to stress the influential capabilities of test managers:

 

Leading from the front

Photo Credit: Olivier Carré-Delisle

 

Testing Efficiency – A Better View

The ISTQB defines Testing Efficiency as the number of defects resolved over the total number of defects reported.  This is meant to measure the test team by the relevance of the bugs they report.  A low efficiency would imply that the test team is reporting many bugs that are not worth fixing.

This view is pretty limited and simplistic.

A better approach would be to measure the “resolution category” for the bugs that are closed. The bugs, when resolved, are marked with a category like “fixed”, “Cannot Duplicate”, or “Duplicate of another bug”.  The categories can be graphed on a pie chart:

Pie Chart showing Resolved Bugs by Category

Pie Chart showing Resolved Bugs by Category

Now, you can have a conversation about the bugs being reported, and whether improvements are warranted. We had this exact issue in a team that I lead a while back.  We made a few adjustments:

Duplicate – we upgraded the bug tracking system to improve the search function. This allowed the testers to search for duplicates before submitting a new bug.  If they found a bug already, they reviewed it to see if they could add any new information.

Cannot Duplicate – for these, we did bug huddles with the developers. Showing them a demo of the bug before writing/submitting the bug. This practice really helped get the bugs fixed faster, by eliminating the back-forth that sometimes happens.

Business Decision – Many of these were closed by the developers without involving the Product Manager in the decision. We added the PM as the person to “verify” bugs closed with this resolution to make sure they agreed.

Pie chart after improvements.

Pie chart after improvements.

Want to learn more about leadership in software testing? Check out the Software Leadership Academy.