🔥 MoTaCon tickets are hot! Get yours today! 🔥

How I measure quality without looking at bug counts

Redefine software quality by replacing bug counts with metrics that track rework rates, the survival of assumptions in production, and the speed of recovery from system friction.

by Sanjeev Kumar
12 May 2026

How I measure quality without looking at bug counts image

Bug counts make teams feel busy. They rarely make users feel safe.

A few years ago, I was part of a 14-person team whose deliverable was a multi-tenant B2B SaaS product used by HR and operations teams. We released code to production every two weeks. We tracked defects carefully. Our dashboards looked healthy, but production felt unpredictable. When I asked, 'How is quality?' the answer was usually 'We closed 127 bugs this sprint.' That told me how active we were. It did not tell me how safe the system was.

Bug counts describe what the tracking system sees before release. Quality becomes evident after release, when real users rely on the system under conditions that teams cannot model perfectly.

So I stopped using bug counts as my primary signal and reframed quality around three questions:

How much damage could we have avoided by catching issues before deployment?
When did actual usage expose incorrect assumptions and at what cost?
When something failed, how quickly did we recover?

Everything I track now maps back to one of these queries.

1. How much damage could we have avoided by catching issues before deployment?

The earlier we discover that our assumptions about requirements, real user behaviour, edge cases, or configuration are wrong, the cheaper the correction will be. But after we have written the code, tested it, marked the story done, or released it based on mistaken assumptions, any contradiction that surfaces later returns as rework and instability.

Two signals helped us see that clearly.

Rework rate: making late learning visible

We stopped counting reopened tickets and started tracking time. To do so, we used this definition:

Any work item that moved from 'Done' or 'Ready For Release' back to 'In Progress' was tagged 'Rework'
We logged the number of hours spent redoing it
At the end of each sprint, we calculated: rework rate = rework hours / total sprint engineering hours

Our sprint capacity averaged approximately 480 engineering hours (6 engineers x 2 weeks). For three consecutive sprints, rework consumed:

Sprint 1: 62 hours (12.9 percent)
Sprint 2: 71 hours (14.8 percent)
Sprint 3: 88 hours (18.3 percent)

That trend mattered more than any bug count. Then we split rework into two categories: requirement gaps and build / test gaps. That was enlightening too.

In one sprint, we released a configurable approval workflow feature. It passed testing and met its deadline. Two weeks later, a client reported that approval chains broke when roles were dynamically reassigned mid-cycle: a scenario we hadn't considered during requirements refinement. Fixing it required three days of backend rework, regression testing across four flows, and one emergency patch release. That single requirement gap cost 26 hours of rework.

In contrast, a build / test gap that occurred in the same sprint involved a misconfigured validation rule in a bulk import endpoint. The requirement was explicit: reject records that violated defined field constraints. The rule itself was known and documented. The failure came from implementing the constraint incorrectly in code, not from missing or ambiguous behavior. That correction of a build / test issue took six hours.

Over six sprints, approximately 70 percent of rework hours came from requirement gaps, not testing misses. The signal was uncomfortable. We were not testing poorly. We were committing before we fully understood usage variability.

Rework rate evaluation answered the first question directly: how much damage did we fail to avoid?

Development cycle versus delivery cycle: when production feedback lags behind deployment

We ran two-week sprints. On paper, delivery looked predictable. But when we mapped the actual lifecycle of work, a pattern emerged. Here's an example feature timeline:

Day 1: Moved to 'Ready for Dev'
Day 6: Dev complete
Day 9: Testing complete
Day 10: Moved to 'Done'
Day 15: Production release
Day 21: Production issue discovered
Day 23-25: Fix + retest

From 'Ready for Dev' to stable production took 25 days for that feature, contrasted with the delivery cycle of 10 days. Something had to give.

So we began tracking:

Lead time: 'Ready-for-dev' to 'Stable in production'
'Interrupt work' hours (time spent on work outside the stories in the sprint) per sprint
Production stabilisation spillover of the last release into the next sprint

In one quarter, interrupt work averaged 32 percent of sprint capacity. Testing time was reduced to protect scope. Stabilisation of the production environment frequently extended into the next iteration.

That mismatch showed us that learning consistently happened after delivery. Quality was not failing. It was being deferred.

2. When did actual usage expose incorrect assumptions and at what cost?

Even when something is deployed 'on time,' assumptions may fail later. So we started tracking predictability not as 'did we hit the date,' but as 'did reality match our model?'

Predictability: are our assumptions reflecting real-world usage?

We defined predictability as follows:

Tag all issues discovered:
- After code freeze
- During the first seven days in production
Record whether they were:
- Functional mismatches
- Integration surprises
- Performance deviations

We tracked: late learning rate = (post-freeze Issues + 7-day prod issues) / total issues per release

One release contained 24 total issues across its lifecycle. Nine were discovered after code freeze and six were found within seven days of release to production.

So our Late Learning Rate was equal to 15 / 24, or 62.5 percent. That number was hard to ignore.

A concrete case: we released a reporting dashboard that aggregated tenant data. It passed performance testing in staging with simulated traffic.

In production, under real tenant distribution, database locking caused latency spikes during peak hours. There were no evident failures, just slow, inconsistent response times.

In staging, we had tested with evenly distributed simulated traffic. In production, tenant activity was uneven. Several high-volume tenants triggered concurrent aggregation queries against the same tables, leading to locking and latency spikes. Similarly, we had modelled average load, but we had not modelled contention under concentrated load. All told, the fix required query refactoring, index tuning, load re-testing, and patch releases. The plan was correct, but our assumptions were incomplete.

Predictability improved when we began explicitly asking during requirements and design refinement: 'What conditions in production could cause this feature to fail to work as expected?' Asking and answering that question reduced late-learning issues by approximately 30 percent over two quarters.

The 48 hour monitoring period after release

We formalised a 48-hour 'hot period' after major releases. This wasn't a war room. Instead, it was a structured observation window.

For example: for a feature involving bulk document uploads, we monitored:

Error rate per endpoint
95th percentile latency
Support ticket volume
Upload abandonment rate

On release day, abandonment jumped from 4 percent baseline to 11 percent. No crashes. No visible errors. Logs showed silent validation failures for specific file encodings. We deployed a fix within 18 hours.

Without the hot period, that issue would have lingered for weeks as 'user friction.' Learning arrived early because we were deliberately watching for 'silent' failures.

3. When something failed, how quickly did we recover?

Occasional failure is not optional. Unrecoverable failure is avoidable.

Change failure rate: making release risk measurable

We decided to examine the characteristics of releases that caused the most rework. We defined:

A change = any production deployment (feature, patch, config change)
A failed change = a deployment that triggered:
- Rollback
- Hotfix within 24 hours
- Sev-1 or Sev-2 incident
- Support spike greater than 30 percent above baseline tied to release

We used the formula: change failure rate = failed deployments / total deployments. Over one quarter, there were 42 deployments with nine failures as we defined 'failed change.' So our change failure rate was 9 / 42, or 21.4 percent.

Most failures were not dramatic outages. They were small regressions that required same-day hotfixes. We noticed that larger batch releases correlated with a greater failure rate. When we reduced average PR size and deployed smaller increments, the rate dropped to 11 percent over the next quarter. The metric did not shame teams. It revealed batch risk.

Friction events: when the system is working but disappoints end users

We define 'friction events' as moments where the system technically works without outages or logged bugs, but users hesitate, retry, refresh, or contact support anyway because the behavior they witness is unclear, slow, or incomplete.

We tracked four friction signals:

Uncertainty about progress: end users repeatedly refresh pages after triggering background processes
Silent failure: API returning 200 with only partial data, logged but not surfaced
Dead ends after errors: validation errors with no suggested correction path
Slowness: P95 latency exceeding 2.5 seconds on core workflows.

Example: users who submitted expense reports saw no confirmation state during async processing. Support tickets described these events as 'random disappearance.' There was no clear error, just silence and absence of feedback.

To address this issue, we added an explicit processing state, a progress indicator, and notification of completion. Support tickets on that workflow dropped 40 percent the following month.

Friction rarely announces itself as a defect. It shows up as hesitation.

Closing: what this changed for us

We still track defects, but they are no longer our sole definition of quality. Quality, for us, became measurable through:

The cost of learning after commitment
The percentage of assumptions that survived real usage
The rate at which production rejected our changes
The speed at which we detected and corrected friction

The shift was not technical. It was conversational. Sprint review discussions changed from: 'How many bugs did we close?' to: 'What did production teach us this week?'

When the latter question becomes normal, quality stops being solely a QA responsibility and becomes a system property. And that is quality's true nature.

What do YOU think?

Got comments or thoughts? Share them in the comments box below. If you like, use the ideas below as starting points for reflection and discussion.

Questions to discuss

How often are production issues discovered only after code freeze or release?
When a deployment causes disruption, how quickly do you detect and stabilise it?
What signals tell you that real usage conditions differ from your internal assumptions?

Actions to take

Review your last release. How many issues were discovered after code freeze or within the first week of production?
Introduce a short, structured monitoring window after your next release. Decide in advance what signals you will observe.
Identify one workflow where users exert avoidable effort without raising defects. Make that effort visible.

For more information

The night our 'highly available' system went dark: How testers can drive resiliency, Ravikiran Karanjkar
Too Many Bugs in Production - What Are We Going to Do?, Melissa Fisher
Evolving Our Testing: Assessing Quality Throughout The SDLC, Dan Ashby

Sanjeev Kumar

Product Lead

Award Star

Building products at GreyB Services Ltd

Open To

Write

Comments

Explore MoT

QA Leadership Summit - The AI-Native Edge: Leading the Future of QA

Wed, 22 Jul

QALS Summer 2026: a leadership summit to move beyond AI testing pilots and build production-ready, AI-first QA organizations - powered by the BrowserStack AI Test Platform and 25+ connected AI agents

Introduction To Accessibility Testing

Learn with me about what Accessibility is, why it's important to test for and how to get your team started with an Accessibility testing mindset

14 Apr 21

Course

Into The Motaverse

Into the MoTaverse is a podcast by Ministry of Testing, hosted by Rosie Sherry, exploring the people, insights, and systems shaping quality in modern software teams.