Earlier this week my friends in marketing sent me a tweet-debate they stumbled across centered around the merits of unit test code coverage as a metric for quality. The original tweet presented the opinion that unit test code coverage – a metric that measures what percentage of your code base is covered by unit tests – is at best useless, and at worst damaging to a team’s ability to develop for quality. I find these types of debates exciting because in my experience as a scrum master / developer / research engineer, when it comes to technology, there’s no one right way to do something. So, in the spirit of appreciating a good debate, let’s look at the arguments made, and then talk about measuring quality with unit tests.
The side against code coverage had the most supporters on Twitter. There were two common arguments. The first was that the metric is useless. Numerous blogs illustrated this point by showing the ease of building a passing unit test that gives 100% coverage but does nothing of value—an assertionless test, for example. I think most developers would agree that no one wants to write misleading or ineffective tests, but when you’re being held to a metric in a high-pressure command and control environment, the situation is realistic.
The second argument went beyond false utility and presented something worse; code coverage metrics are dangerous. Here, developers cited the concept of perverse incentives. Perverse incentives are when you create rules to promote certain behaviors but actually incentivize undesired results. For example, a developer can either fully implement a new feature without gaps or take shortcuts. If code coverage is a strictly monitored metric, and if the full implementation requires more unit tests while the short cut does not, then they will be incentivized to maintain test coverage with the shortcut. Many ifs, assumptions, and anecdotal evidence all around, but I think they’re fair concerns.
The primary argument for code cover was that while not perfect, code coverage gave teams a quantifiable goal, started discussions on what quality meant for their products, and emphasized that it’s no longer an option. The people in favor of code coverage had used the metric as part of a wider transformation to cement quality into the development organizations culture and found the metric beneficial in that time of flux. It’s important to note that the people arguing for code coverage did not use the metric as an absolute measurement for quality, but as a leading indicator in conjunction with other metrics to help assess their product’s quality.
To me, the debate is interesting because I feel it’s focusing on the effects without identifying the root cause. Metrics can be hard things to get right. Useful metrics need to be two things. The first is to be quantifiable – saying the average speed of traffic is 30mph is more informative than saying “pretty slow.” The second is that metrics need to be tied to a specific value in such a way that quantitatively measuring the metric helps you understand the value realized. For example, “increase average speed of traffic to 60mph” is not as powerful as saying “reduce commute times between A and B by 50% for the existing volume of traffic.” The second example’s value is clear. We want faster commutes, so maybe we make wider roads to reduce congestion.
Without a value, like the first scenario, the focus is on achieving the metric for the sake of the metric. Maybe we add an expensive toll – now less cars are on the road, and speed goes up, but our value to commuters was not realized because no one can afford driving to work. To me, the debate shows that the efficacy of code coverage depends on tying it to a value. Those who had a bad experience with code coverage were tasked with achieving a goal – say 100% – for the sake of achieving 100% code coverage. Those who had good experiences with code coverage prioritized the number less, and the value of their product more. The question is, how do you tie code coverage to product value? To answer that question, I think we need to shift the discussion to BDD.
BDD stands for Behavioral Driven Development. The concept is that you should start development with the end users’ expectations in mind. The BDD methodology asks us to reflect on questions like: “What is the user really trying to do?”, “How will they try to do it?”, “What do they expect to happen when they do it?”, and “What is the context they’re doing it in?” The answers to these questions are formalized into a scenario (using the Gherkin sytnax) that can then be used to develop against, for example:
Given I’ve provided a valid payment option
When I submit an order
Then I should be notified of a successful purchase.
These scenarios are extremely useful for requirements gathering because they create a common language for collaboration between business and technical team members. These scenarios go beyond simply capturing requirements, though. The idea of BDD evolved from Test Driven Development (TDD). The true value of these scenarios becomes clear when the developers transform them into tests, which only pass once the functionality needed to satisfy the use case is fully implemented. The circle is completed when the product team looks at the passing test cases derived from the BDD scenarios to understand exactly which use cases or functionality within the product is available for the end users. We call this living documentation.
Before we bring this back to code coverage, we need to discuss one more benefit of BDD. Look at the example in the last paragraph; none of the steps tell you HOW to do anything, only WHAT. A valid payment option could be several things, like a credit card, gift card, Pay Pal account, etc. This is on purpose. We want generic BDD scenarios – we call this declarative testing – so that our team is forced to discuss what a valid payment option means in this scenario. At test execution time, we will iterate through the scenario test case for each payment option.
Now we’re ready to tie this back to code coverage, and hopefully at this point you’re starting to see how we can do that. When we pass specific parameters to the declarative scenario – like paying via credit card – we turn it into a more imperative test and start to invoke concrete connections to our source code that can be unit tested. Perhaps processing a payment via credit card and gift card calls the same payment processing method, but with different parameters. Now we’re able to quantify the available code paths (ex. 5 different valid payment options) while associating unit test execution to a real user scenario. This means we get an understanding of where our test coverage is weak from how much of our code base is exercised, while making sure the tests we run to exercise that code are actually testing something tied to our customers idea of quality.
The debate on unit test code coverage will undoubtably rage on as well-meaning managers looking for ways to enforce good development practices, and conscientious developers asking to not be held accountable to metrics that do more harm than good. Like most debates in life, the path forward will likely involve a bit from both sides.
Code coverage certainly shines light on the parts of your codebase that lack automated testing. However, it’s effectiveness as a measure of quality depends heavily on the context in which it exists; it can successfully help teams focus on quality when it’s looked at as more of one indicator among many. Unit tests derived from the BDD methodology give us a way to get the best of both worlds. Teams get information to help them determine where their automated test coverage is lacking, while still ensuring that the unit tests are exercising their code in a way that customers care about.