The BBST Foundations course has a lesson focused on Measurement. The goal of the lesson is to lay the groundwork for understanding how to evaluate a measurement’s validity, and why it’s difficult to measure software quality. When I was a student in the course, the lesson seemed to me more focused on curbing my enthusiasm in terms of applying measures than in providing quick tips for how to start measuring my work in my day-to-day job. After a few years of experience, I appreciate much more the approach of carefully considering how to put measures to question instead of accepting them at face value and jumping to conclusions. At the time I was hoping for some tips that would help me answer some measurement requests and questions with more than “I don’t think that will give us much valuable insight, here’s why”, or “I don’t think that’s a good idea”.
I think that working with software and striving to make my contribution from a quality perspective, I wanted to be able to show that we’re improving, or that some decisions hindered our progress. I sometimes found myself explaining the subjective nature of quality, and how numbers are objective but incomplete, and getting as a reply a confused or skeptical look, and a request for metrics that are easy to understand. People who are not involved in our day-to-day work want to get the bigger picture at a glance. It’s hard to provide this without some data points. Be they numbers, percentages, colors, or smiley faces, they require some sort of synthesis of relevant and meaningful data.
Inspired by a book I read recently, admittedly with a somewhat tacky title, The Success Equation, I’d like to share a few insights that I think apply quite well to measuring software quality. The book is not without plenty of warnings about misusing metrics. It also provides a few tips for how to measure success in a meaningful way.
Connect the quality metrics to bigger goals
In order to set meaningful goals for what to measure and improve in terms of quality, it is useful to start with a high-level view. What are the objectives of the organization you are part of? What is the organization’s strategy? Metrics that will measure your successes in terms of quality need to tie in to the contribution that quality has to the broader success of the organization. Compiling statistics that have nothing to do with the end goals doesn’t really help in making better decisions to achieve those goals. They may help some people to feel better, but they don’t help to guide improvements.
The book provides an example which is related to setting quality goals, from the Wallace company. The lesson there is that optimizing for high quality can become counterproductive at some point. The question is not how to get to the highest level of quality possible, but how much quality, and what kind of quality we need in order to fulfill our goals.
The sample size matters so set it consciously
Let’s say you decide to look at how issues reported by users via customer support for each release reflect the quality of a software product, to make an assessment of the current situation and to measure the effect of improvement strategies. This can be considered an effective metric, because it reflects, at least partially, the actual outcome of a release in terms of impact to end users. If you make a change in how testing is done, you can then check to see if there is any impact on that metric to decide if the change is worth it or not. For how many subsequent releases would you need to gather this metric to confirm a connection between the changes in testing and the release outcome? Is one enough? Or do you need 2, 3, 10? Well, which are some factors that can influence the number of bugs in a release?
How many changes were there in the release? How risky were the changes? How often are the areas changed actually used by people? How reliable is the deployment mechanism? What changes were there in upstream or downstream dependencies? What changed in the customers’ behavior and expectations compared to previous releases? How were people who use the product familiarized with the changes? How many people actually upgraded to this version compared to other releases? How lucky were you lately(yes, serendipity and luck do play a role in finding bugs)? How unlucky were the customers? And the list of factors could go on. So probably a significant sample size is needed to draw reasonable conclusions and to get meaningful information from the measurements.
Decide if averages have a meaning
It’s not a given that computing the average of measurements makes sense. When you are dealing with power-law distributions, where there are very few huge values and lots of small values, computing the average may not help you know what to do or change.
If you set up the testing process so that you cut efforts from addressing the highest risks in order to release faster, having a series of successful releases that bring revenue doesn’t matter anymore the moment a disastrous release destroys the whole business. So for some metrics, you might be interested in looking at where spikes and extreme values occur instead of compressing data in averages.
If your business model relies on to-be customers submitting a form and an issue prevents them from doing that, the bug behind this is not really comparable to 10 other bugs that don’t interfere at all with acquiring customers. So measuring quality meaningfully may include identifying core workflows and areas first, then assessing the quality and testing process for them on a variety of dimensions. Instead of averaging them out, you can focus on addressing extreme values so that the project is less exposed to a particular set of risks. Some measurement dimensions could be:
- the extent to which the workflows on the list are touched at all by tests – one extreme could be that not all of those workflows are touched by tests for each release, and at the other end of the spectrum, it could be that they are all exercised by tests on each commit.
- the levels of tests covering the workflows – unit, integration, end-to-end
- the variety of techniques that contribute to coverage – only function testing could be one extreme, boundary testing would already be another level, load and scenario tests could be an extra one, stress testing another one, and so forth.
- hours of use involving those workflows by expert users in beta testing before release
Measuring aspects of quality can carry a lot of power, as the outcome can mean big, important decisions, and a lot of visibility that’s crucial for fast feedback loops. You can also fall into the trap of investing a lot of energy and time in measurements that don’t provide quality information on the quality of the project, and may lead to disastrous decisions. So it’s one of those tools we better use with care and not be shy to analyze using critical thinking. They can be part of the success equation, as much as they can be part of the failure equation.
Alexandra Casapu is a BBST instructor, with over 11 years of experience in software testing, specialized in exploratory testing. She enjoys coaching and mentoring other testers, shaping teams, and company-wide testing approaches, as well as doing hands-on testing and development.
Maaret Pyhäjärvi, Social Software Testing Approaches
Alexandra Casapu, How well do you advocate for your bugs?