πŸ—‘οΈ 7 bad metrics to remind yourself what to avoid

11 May 2026

A space scenery with rainbow vomit Bug. Mother Bug. Cycling Bug in the moon. There is Space Seagull and Space Duck th... image
In this moment: Neil Younger
I've been enjoying listening to Managing Engineers podcast with Neil Younger and Si Jobling. They have an episode on metrics and I'm always looking for metric inspiration, so I thought I'd capture what they mentioned.

This post is on bad metrics, because, don't we all just wanna know some! πŸ˜†

I'll follow up with a good metrics.

This is not meant to be a complete list, rather ones that were mentioned in the podcast. The quotes extracted are also from the podcast. I hope you find it helpful, even if it's just a reminder that you are hopefully on the right track.

1. Counting Test Cases

What is it? Measuring productivity or progress by the number of test cases created or executed.

"If I wanted those numbers to look bigger, I can make one test case and then break it into two. And now I've done more. Like, it's such a gamified metric."

Why it's bad: The number alone carries no context, it doesn't indicate risk coverage, test quality, or whether the right things are being tested. It's trivially gameable by splitting cases arbitrarily. Without the story behind the number it's "pretty meaningless." It also creates perverse incentives where testers optimise for quantity over value, and can give false confidence to stakeholders that coverage is adequate when it isn't.

Note, whilst it wasn't mentioned, counting bugs as a metric also falls into similar bad metrics reasonings to this.

2. Commit Count

What is it? Tracking how many times an engineer commits code to a repository as a measure of output or effort.

"Most number of commits in the repo's history. Bye. I'm like, dude, you could have completely gamified that."

Why it's bad: Commit frequency reflects working style and habit, not value delivered. A developer doing deep architectural work may commit rarely; someone making trivial changes may commit constantly. It also varies by discipline, a single thoughtful refactor or a complex bug fix may represent far more value than dozens of small commits. Using it as a comparative measure between individuals is particularly harmful, as it ignores entirely different contexts, codebases, and roles.

3. Story Points as Predictability

What is it? Using accumulated story points from previous sprints to forecast future delivery capacity.

"Story point estimation and using that for predictability of products... based on our previous four sprints, we should be able to do this amount of points... it's just a very basic calculation... rarely is it right."

Why it's bad: Story points are a relative sizing tool, not a unit of time or effort. Teams size things differently, points drift in meaning over time, and team composition changes. Using them for capacity planning imports all that subjectivity into what looks like an objective forecast. The speakers advocate for Monte Carlo simulation instead, using actual cycle time data to generate probabilistic forecasts, which is mathematically more honest and surfaces uncertainty rather than hiding it behind a single number.

4. Velocity / Burn Down Charts

What is it? Measuring team speed via story points completed per sprint, often used to compare teams.

"When they compare a team or a person... Team A versus Team B... asking questions around why is team B slower... it's never always uncomfortable if it's used as a comparison metric just by itself."

Why it's bad: Velocity is highly context-dependent, team size, domain complexity, technical debt, and even sprint goals all affect it. Comparing velocity across teams is like comparing apples to entirely different fruit. It also incentivises teams to inflate point estimates over time to make velocity look healthy, which is a textbook example of Goodhart's Law: the measure becomes the target, and the underlying behaviour it was meant to reflect gets distorted or gamed entirely.

5. Annual Pulse Surveys

What is it? Yearly (or twice-yearly) sentiment snapshots used to gauge organisational health.

"I think an organization will kid themselves if it's a reflection on the year. I imagine most people cannot remember an entire year's worth of sentiment."

Why it's bad: Human memory is emotionally shaped, recent events, particularly negative ones, tend to dominate recall. A difficult Q4 can colour an entire year's retrospective view unfairly. The long gap between surveys also means problems fester undetected and any action taken comes far too late to be meaningful. Different personality types and departments respond very differently, making aggregated scores hard to interpret. It is described it as "theatre", something done to appear to be listening, rather than to actually understand what's happening.

6. Daily Mood Check-ins

What is it? Logging individual mood or wellbeing every single day.

"After a time people were just like I can't... it's not giving them any value... Sometimes it's okay to have a not great day. You don't necessarily need to tell everyone about it."

Why it's bad: Too much frequency creates survey fatigue and people disengage, either ignoring it or defaulting to middle scores that reveal nothing. It can also feel intrusive, not every difficult moment needs to be reported upward. Paradoxically, forcing daily emotional disclosure can undermine the psychological safety it's meant to support. The speakers found a fortnightly cadence tied to retrospectives worked far better, giving enough breathing room for honest reflection rather than reactive noise.

7. Attendance at Communities of Practice

What is it? Counting how many engineers turn up to optional learning sessions as a proxy for healthy engineering culture.

"What you have is people who go to a community of practice which may or may not be giving them any value. Your measure suddenly becomes the thing that's driving the wrong behaviour."

Why it's bad: This is a near-perfect illustration of Goodhart's Law. Once attendance becomes visible and tracked, people attend to be seen attending, not because it's valuable. The metric conflates presence with participation, and participation with learning or cultural health. A team could score brilliantly on this measure while the sessions themselves are widely considered a waste of time. It mistakes an input (showing up) for the outcome it was supposed to signal (genuine learning culture).

5 metric questions to help you along your metric way:

  1. Who is this metric actually for? Is it genuinely helping the team understand and improve their own work, or is it primarily giving someone further up the chain a sense of control or visibility? Metrics that exist to reassure leadership rather than inform teams tend to produce the behaviours you least want.Β 
  2. What behaviour does this metric reward, and are you comfortable with that? Every metric creates an incentive, whether you intend it to or not. If you follow the logic of your measure to its natural conclusion and imagine someone optimising purely for that number, do you like what you see? If not, the metric may be quietly pulling your team in the wrong direction already.Β 
  3. When did you last stop measuring something? A growing list of metrics is a warning sign. If you cannot remember the last time you retired a measure because it had served its purpose or stopped being useful, you are probably carrying dead weight . This can show up in dashboards that nobody questions, numbers that nobody acts on, and data that exists because it always has.Β 
  4. Could this metric be used against the people it is measuring? Even well-intentioned measures can become weapons in the wrong hands or the wrong moment. Before you collect data on people or teams, it is worth asking honestly whether it could one day be used for comparison, performance management, or justification in ways that were never the original intent.
  5. What conversation should this metric be starting? And is it? A metric that is not prompting curiosity, questions, or action is just theatre. If your dashboards are being reported but not discussed, if numbers are being shared but nobody is asking why, the measure has stopped doing its job. The most important question is not what does this number say, but what does it make us want to find out.

What else would you add to this list?

Rosie Sherry
CEO & Founder at Ministry of Testing
She/Her

I've been working in the software testing and quality engineering space since the year 2000 whilst also combining it with my love for education and community. It turns out quality, community and education go nicely hand in hand.

πŸŽ“ MoT-STEC qualified

Team Account Member
MoTaverse Team
Chapter Lead
Sign in to comment
Explore MoT
AI-driven testing in practice: from requirements to reliable automation image
See where AI genuinely helps, where it doesn’t, and how testers can stay firmly in control
MoT Software Testing Essentials Certificate image
Boost your career in software testing with the MoT Software Testing Essentials Certificate. Learn essential skills, from basic testing techniques to advanced risk analysis, crafted by industry experts.
This Week in Quality image
Debrief the week in Quality via a community radio show hosted by Simon Tomes and members of the community
Subscribe to our newsletter