Author Archives: John Ruberto

Software Root Cause Analysis: 3 Questions to Answer

Image of explosion represents things going wrong on a project.

Sometimes, things don’t go as planned.

Here are three questions that I like to answer when performing a root cause analysis for escaped bugs:

  1. How was the bug introduced in the first place?
  2. How did we not catch it earlier?
  3. What are we doing to prevent this problem in the future?

For the first two questions, I have a handy template for performing root cause analysis.

Generally, for the 3rd question, what are we doing to prevent the problem, we have short and longer term solutions.  In the short term, we should add the appropriate test or check that missed the problem in the first place.  That is the answer to the specific question for that particular issue.

For the longer term, we collect data about the escaped causes and reasons for escape.  We collect that data in the bug tracking system as two fields with categories.  When we have enough data, we can examine trends.  I usually start with a simple Pareto analysis, showing the top few causes/reasons. Then work with the team to ask how can we improve our processes/practices.  Its often useful to filter the Pareto analysis to the most painful bugs (those found by customers, high severity, etc.)

Please drop a comment below and let me know what you do for root cause?

By Photo courtesy of National Nuclear Security Administration / Nevada Site Office [Public domain], via Wikimedia Commons

When is cutting corners the right answer?

Glass shelf with a sharp corner and a label saying "caution sharp corner"

Cutting corners is the right answer when your problem is sharp corners

When is cutting corners the right answer?  When the problem is sharp corners.

A key concept in quality engineering is “fail-safe” design.  I’ve written about fail-safe design in the past, regarding software controlled rifles.  This example, a sharp corner, is a much more simple, and visual, example.

In the US, we have lots of litigation. I’m sure this label was applied to the sharp corner to point out the danger to customers. Also, maybe to protect from lawsuits if someone gets injured.  A better solution would be to grind down that corner so it isn’t a hazard.

Fail safe design means to build your systems in a way that, if they fail, they fail in a safe manner.  In this case, if someone bumps into this shelf, they shouldn’t get cut.

Coming back to software, what if you have a cron job that does some cleanup.  What happens if that job fails?  Does it leave data behind which might consume your storage?  Would any of that data be Personally Identifiable?

Using a FMEA – Failure Mode and Effects Analysis is a good method to identify these potential failures and ask, does the system fail in the most safe manner?

Testing Pi day, which is more accurate?

In the US, today is Pi day.  March 14, or as we write it with numbers 3.14.  Happy Pi day everyone!

However, in Europe, they tend to write out dates starting with the day, then the month.  So, today is 14.3 over there.  No-where near Pi.  Instead, the closest to Pi day in Europe would be 22/7 (July 22nd), where 22/7 is a common approximation of Pi.

Which is more accurate?

Testing both approximations is pretty easy with Wolfram Alpha. The error in the approximation is determined by taking the absolute value of the difference between Pi and the approximation. So, the following screen shows the result, asking if the error in the US version is greater than the European version of Pi day:

Comparing the US version of Pi day (3.14) to the European version (22/7) with Wolfram Alpha

Comparing the US version of Pi day (3.14) to the European version (22/7) with Wolfram Alpha

Europe wins this time. 22/7 is a better approximation than 3.14.

Vanity Testing Metrics

This is a preview of a topic that I will cover in the upcoming talk, Testing Metrics – Choose Wisely at STPCon.

Vanity metrics are popular in marketing. These are metrics that allow you to feel good, but aren’t directly actionable, and are not related to your (true) goals.  Vanity metrics are also easily manipulated.  An example would be a hit counter, measuring page views, on a web site.  What would really matter for a business web site would be the conversion rate (how many visitors actually purchase) or revenue per customer.

I’ve seen marketing campaigns that add a lot of page views, but actually cause a decrease in conversion rate. The advertising may find more viewers, but if the people are less interested in your product, its not really useful to drive up traffic.   (and who knows if those viewers are really people and not bots) Measuring the impact of advertising by measuring revenue or number of visitors that become customers is more powerful.

An example in software testing is measuring the Average Age of bugs.  You might start a campaign to reduce bug backlog or improve the velocity of fixing the bugs, and a measure might be the average age.  However, what you are really looking for is a quicker response to every bug, not the average bug.

The average age of bugs chart from JIRA shows trends in the average age, over time.

The average age of bugs chart from JIRA shows trends in the average age, over time.

This metric is often misleading in these efforts, as really old bugs can be fixed or closed and dramatically reducing the average age.  In the chart above,  the dramatic downward swings actually came from closing only a couple of bugs. Those bugs weren’t fixed, they were closed as obsolete.  But, they were open in the backlog for several years, so closing them had a dramatic impact on the average age.  Closing those, however, didn’t tell us anything about the responsiveness to current bugs.

Instead of Average age, tracking the median age.   The median measure would be much less affected by really old bugs.  Medians are a way to prevent outliers in having outsized impact on your metrics.  Even better, a more direct measure of our goal to improve velocity might be to set a target timeframe, say 30 days – then measure the percentage of bugs that are fixed within that target.

These views will more directly measure your goal (improved velocity) and be less susceptible to manipulation.