Reporting with Testing Metrics
Learn which testing metrics to report and which to avoid at all cost.
"You can’t control what you can’t measure"
Tom DeMarco, author of Controlling Software Projects
The desire for control
It is natural for humans to want to track progress in personal life, work projects, or business. The logical conclusion is to track metrics - "bad" numbers must stay low, "good" numbers must go up.
On a software project, you may have individual, process, project, and higher-level metrics. There are dozens, if not hundreds, of metrics used in testing. It can be argued that most are poor or useless.
Most metrics related to testing are poor, useless, or vulnerable to abuse (Goodhart's Law).
Poor metrics and their abuses
Some places may keep track of the number of (manual) test scenarios produced per day, or automation scripts written per sprint, to measure a tester's efficiency. This has no value, because intellectual work is very different from manual labor.
No one judges a book by the number of words it contains
No one judges a project manager by the number of emails they send per week
And it is well-known that developers shouldn't be judged based on the number of Lines of Code (LoC) written or Pull Requests created - quantitative metrics that are detached from quality or delivered business value.
Below is a non-exhaustive list of some of the worst testing metrics that you should probably never use. If management expresses a desire to track them, consider explaining to them why they are not a good idea.
Number of test cases written
Quantity says nothing about relevance, depth, or effectiveness
Inflate test cases by splitting trivial variations
Automation scripts written
As above.
Additionally, scripts vary hugely in value and complexity
Write shallow or redundant scripts
Test cases executed
Execution does not imply meaningful coverage
Re-run easy or low-value tests to boost numbers
Bug count found by testers in the "testing phase"
At odds with shift-left testing and mindset. Isn't it more cost-effective for testers to help developers and the team to spot problems as early as possible in the SDLC?
Testers are incencitivized to find and report bugs later, rather than catching errors before they are coded and deployed to a testing environment.
Bug count per tester
Penalizes collaboration
Avoid helping others or focus on easy areas
Severe bug count per tester
As above + incentivizes testers to inflate the importance of bugs found
As above + increases arguing if "this bug is really that severe"
Test coverage (%)*
Coverage ≠ quality or meaningful test data or scenarios
Add superficial tests to raise coverage
Test execution speed
Speed alone ignores learning and investigation
Rush testing and miss important issues
Bugs found in production
Influenced by many factors outside testers' direct control**
When used as a punishment, leads to toxic blame games
*Test coverage is a vague term. From a black-box perspective, it could mean "requirements" or "feature" coverage. This again has severe caveats:
"I created enough tests to achieve 100% coverage" actually means "coverage with whatever scenarios I could think of. I can't possibly know if I missed or misunderstood something, or whether my test data will miss a bug still. If I knew - I would've written them down".
100% coverage of incomplete, inconsistent, or just wrong requirements still leads to a poor product.
** Defect counts are driven largely by factors outside the tester's control, such as:
requirements clarity and completeness
quality of the code developers produce
degree of technical debt already present in the system
A QA team may find few bugs because developers delivered high-quality work; penalizing the team for this is conceptually flawed. Conversely, a high defect count often indicates underlying issues in the earlier phases of the SDLC, rather than superior QA performance.
Potentially better metrics
Not tracking anything is probably another extreme. So what should we track?
First, establish a team-wide mindset:
Prevention rather than detection. The earlier the better.
Strong collaboration. Joint quality ownership. No "throw it over the wall and let testers do their thing".
Fast feedback loops in CI/CD pipelines
White-box coverage
High code coverage is still a valid metric, assuming the automated tests are written intelligently using various test techniques described on this resource and with quality test data.
Branch coverage
Ensures both true and false paths of decision points are executed. With quality test data, it may prove every decision outcome is effective and behaves correctly at borders of equivalence partitions.
When high, greatly reduces the risk of untested conditional logic and edge cases
Black-box or Quality Control coverage
All of the below metrics should be treated as a learning signal, not a performance KPI.
Bugs found in production over time
Focuses on impactful issues as a trend, not a snapshot in time. This metric (without "over time") is also listed in the "Poor Metrics" table above. This is intentional. It depends on how this metric is used - to punish or to analyze and improve?
Helps gauge real-world risk escape. Important! There must be no blaming, only an honest discussion of why it happened and how it can be prevented in the future.
User ratings and reviews/feedback over time
100% coverage or a million test cases will not matter if the end users end up disliking the product and switching to the competition.
Ratings can be measured. Reviews may be qualitatively assessed.
Mean Time to Detect (MTTD)*
Measures how quickly problems are noticed after introduction
Tracks feedback loop efficiency
Mean Time to Repair (MTTR)
Shows how fast issues are resolved when found
Indicates team responsiveness and process health
*MTTD raises serious questions. You usually can’t know the exact moment a defect was introduced. When a Business Analyst typed it into a JIRA ticket? The moment a developer typed it in code? MTTD is almost always an approximation, and sometimes a proxy metric.
To handle this in practice, teams may define a consistent reference point, even if it’s imperfect:
Code-related defects Introduced time ≈ commit or merge time
Requirements or design defects Introduced time ≈ requirement approval, story “Ready”, or sprint start (artificial, but good enough)
MTTD may still be acceptable (with caveats) if used to track feedback speed, not root cause timing.
Admittedly, the biggest drawback of formally following MTTD is a lot of meta-work (bureaucracy). In well-communicating teams, the overhead outweighs the value.
Last updated