file-chart-columnReporting with Testing Metrics

Learn which testing metrics to report and which to avoid at all cost.

"You can’t control what you can’t measure"

Tom DeMarco, author of Controlling Software Projects

The desire for control

It is natural for humans to want to track progress in personal life, work projects, or business. The logical conclusion is to track metrics - "bad" numbers must stay low, "good" numbers must go up.

On a software project, you may have individual, process, project, and higher-level metrics. There are dozens, if not hundreds, of metrics used in testing. It can be argued that most are poor or useless.

circle-exclamation

Poor metrics and their abuses

Some places may keep track of the number of (manual) test scenarios produced per day, or automation scripts written per sprint, to measure a tester's efficiency. This has no value, because intellectual work is very different from manual labor.

  • No one judges a book by the number of words it contains

  • No one judges a project manager by the number of emails they send per week

  • And it is well-known that developers shouldn't be judged based on the number of Lines of Code (LoC) written or Pull Requests created - quantitative metrics that are detached from quality or delivered business value.

Below is a non-exhaustive list of some of the worst testing metrics that you should probably never use. If management expresses a desire to track them, consider explaining to them why they are not a good idea.

Metric
Why it is a poor metric
How it can be abused or gamed

Number of test cases written

Quantity says nothing about relevance, depth, or effectiveness

Inflate test cases by splitting trivial variations

Automation scripts written

As above.

Additionally, scripts vary hugely in value and complexity

Write shallow or redundant scripts

Test cases executed

Execution does not imply meaningful coverage

Re-run easy or low-value tests to boost numbers

Bug count found by testers in the "testing phase"

At odds with shift-left testing and mindset. Isn't it more cost-effective for testers to help developers and the team to spot problems as early as possible in the SDLC?

Testers are incencitivized to find and report bugs later, rather than catching errors before they are coded and deployed to a testing environment.

Bug count per tester

Penalizes collaboration

Avoid helping others or focus on easy areas

Severe bug count per tester

As above + incentivizes testers to inflate the importance of bugs found

As above + increases arguing if "this bug is really that severe"

Test coverage (%)*

Coverage ≠ quality or meaningful test data or scenarios

Add superficial tests to raise coverage

Test execution speed

Speed alone ignores learning and investigation

Rush testing and miss important issues

Bugs found in production

Influenced by many factors outside testers' direct control**

When used as a punishment, leads to toxic blame games

*Test coverage is a vague term. From a black-box perspective, it could mean "requirements" or "feature" coverage. This again has severe caveats:

  • "I created enough tests to achieve 100% coverage" actually means "coverage with whatever scenarios I could think of. I can't possibly know if I missed or misunderstood something, or whether my test data will miss a bug still. If I knew - I would've written them down".

  • 100% coverage of incomplete, inconsistent, or just wrong requirements still leads to a poor product.

** Defect counts are driven largely by factors outside the tester's control, such as:

  • requirements clarity and completeness

  • quality of the code developers produce

  • degree of technical debt already present in the system

A QA team may find few bugs because developers delivered high-quality work; penalizing the team for this is conceptually flawed. Conversely, a high defect count often indicates underlying issues in the earlier phases of the SDLC, rather than superior QA performance.

Potentially better metrics

circle-info

Remember, no metric is completely immune to Goodhart's law or other abuse

Not tracking anything is probably another extreme. So what should we track?

First, establish a team-wide mindset:

  • Prevention rather than detection. The earlier the better.

  • Strong collaboration. Joint quality ownership. No "throw it over the wall and let testers do their thing".

  • Fast feedback loops in CI/CD pipelines

White-box coverage

High code coverage is still a valid metric, assuming the automated tests are written intelligently using various test techniques described on this resource and with quality test data.

Metric
Why it's useful
How it supports quality

Branch coverage

Ensures both true and false paths of decision points are executed. With quality test data, it may prove every decision outcome is effective and behaves correctly at borders of equivalence partitions.

When high, greatly reduces the risk of untested conditional logic and edge cases

Black-box or Quality Control coverage

All of the below metrics should be treated as a learning signal, not a performance KPI.

Metric
Why it’s more useful
How it supports quality

Bugs found in production over time

Focuses on impactful issues as a trend, not a snapshot in time. This metric (without "over time") is also listed in the "Poor Metrics" table above. This is intentional. It depends on how this metric is used - to punish or to analyze and improve?

Helps gauge real-world risk escape. Important! There must be no blaming, only an honest discussion of why it happened and how it can be prevented in the future.

User ratings and reviews/feedback over time

100% coverage or a million test cases will not matter if the end users end up disliking the product and switching to the competition.

Ratings can be measured. Reviews may be qualitatively assessed.

Mean Time to Detect (MTTD)*

Measures how quickly problems are noticed after introduction

Tracks feedback loop efficiency

Mean Time to Repair (MTTR)

Shows how fast issues are resolved when found

Indicates team responsiveness and process health

*MTTD raises serious questions. You usually can’t know the exact moment a defect was introduced. When a Business Analyst typed it into a JIRA ticket? The moment a developer typed it in code? MTTD is almost always an approximation, and sometimes a proxy metric.

To handle this in practice, teams may define a consistent reference point, even if it’s imperfect:

  1. Code-related defects Introduced time ≈ commit or merge time

  2. Requirements or design defects Introduced time ≈ requirement approval, story “Ready”, or sprint start (artificial, but good enough)

MTTD may still be acceptable (with caveats) if used to track feedback speed, not root cause timing.

Admittedly, the biggest drawback of formally following MTTD is a lot of meta-work (bureaucracy). In well-communicating teams, the overhead outweighs the value.

Last updated