Ujjwal Kumar Singh
SDET @ Skeps
He/Him
I am Open to Speak, Write, Podcasting, Meet at MoTaCon 2026, Teach
Hi, I’m Ujjwal, a software tester and quality advocate. Exploring how quality works beyond tools and into systems, decisions, and trade-offs.
Substack: https://substack.com/@beinghumantester
Achievements
Certificates
Awarded for:
Achieving 5 or more Community Star badges
Activity
earned:
Schrödinbug
contributed:
Definitions of Schrödinbug
earned:
Hiccup Error
contributed:
Definitions of Hiccup Error
earned:
Bug convergence
Contributions
A Schrödinbug is a bug that works purely by accident until someone examines or modifies the code. The software behaves correctly even though the logic is wrong, usually because of lucky conditions like memory being zeroed or timing working out just right.The moment someone refactors the code, adds logs, or changes compiler settings, the behavior breaks. The problem was always there, just hidden, nothing new was introduced. These bugs rely on undefined behavior, which makes them fragile and unpredictable.They are dangerous because the code looks stable and builds false confidence. Apparent correctness hides serious flaws that can surface after a compiler update, platform change, or even a small modification. If something works only because of luck, it’s already broken, it just hasn’t failed yet.
A hiccup error is a brief, self-resolving failure that appears once and then disappears. It usually happens because of momentary instability, a network glitch, a timing issue, a short resource spike, or a system running close to its limits. When the same action is retried, it succeeds, which makes the issue easy to ignore.For example, a payment API returns a 503 error and then works on the next attempt. The transaction completes, so no one investigates. Or a database query times out during a CPU spike, but the retry finishes in milliseconds. These errors leave little trace and rarely point to a clear code defect, they expose fragile interactions between systems.Hiccup errors are dangerous because teams treat them as flukes until they become frequent. A weekly hiccup turns daily, then hourly. By the time it's taken seriously, the system is already degraded. What looks like noise at first is often an early warning of a deeper reliability problem.
Bug convergence is the point in testing where the number of new bugs being found starts to drop and flatten out. At first glance, it looks like the product is becoming stable. But this slowdown doesn't always mean the system is high quality, it often means testers are hitting the same areas repeatedly, or the test approach has stopped uncovering new risks.For example, a team finds 50 bugs in week one, 20 in week two, 8 in week three. Leadership celebrates the progress, but testers have only been exercising the login and checkout flows. The entire admin panel, bulk operations, and error handling paths remain untested. The system still contains serious defects, just not in the places currently being tested.Bug convergence can happen because coverage is limited, test data is repetitive, or exploration has stalled. If bug reports cluster in the same few modules, or if testers struggle to think of new test scenarios, convergence is likely artificial, a sign that the test strategy has been exhausted, not the bugs.This is why convergence should be treated as a signal to change strategy, not a reason to stop testing. Shifting to different user personas, testing edge cases, varying test data, or exploring less-traveled system paths can reveal the defects that routine testing missed.
A cascading failure happens when one small failure triggers a chain reaction across the system. A single component breaks, and the impact spreads quickly through dependent systems. What starts as a minor issue turns into a much larger outage.For example, a slow database causes request timeouts. Applications retry aggressively, multiplying the load. The increased traffic exhausts connection pools and brings down related services, even though they were functioning normally moments before. The failure cascades beyond the original problem.Cascading failures happen in tightly coupled systems with poor isolation, no circuit breakers, no rate limits, no bulkheads between components. They are dangerous because the original cause is often hidden by the larger breakdown. By the time teams respond, multiple services are down and root cause analysis becomes difficult.Preventing cascading failures requires designing for isolation and graceful degradation. Circuit breakers stop unhealthy dependencies from being called. Rate limits prevent retry storms. Timeouts prevent slow operations from blocking resources. Systems should degrade gracefully rather than assume everything will always work.
A complexity trap occurs when efforts to improve a system inadvertently make it harder to understand, test, and maintain.Well-intentioned fixes accumulate into layers of rules, safeguards, and special cases that reduce clarity and increase fragility.The system may appear safer or more controlled, but hidden interactions introduce new and unexpected failure modes.From a testing point of view, complexity traps shift failures from predictable defects to emergent behaviour that is difficult to design tests for.As complexity grows, test coverage becomes less meaningful, confidence becomes harder to calibrate, and passing tests no longer guarantee reliable system behaviour.Complexity traps are difficult to escape because they create the illusion of control.Avoiding them usually requires simplifying design, removing unnecessary mechanisms, and accepting that not all risk can be engineered away.
A blast radius describes how much of a system is affected when something goes wrong.It captures the scope and impact of a failure which components break, which users are affected, and how far the damage spreads beyond the original fault.A small blast radius means failures are contained and isolated. A large blast radius means a single issue can cascade across services, teams, or customer journeys.From a testing perspective, understanding blast radius helps prioritise risk. Testing focuses not just on whether a component can fail, but on how much harm that failure can cause and how well the system contains it.Reducing blast radius relies on isolation, boundaries, and graceful degradation so that failures remain local instead of becoming system-wide incidents.
Shared my thoughts on how test analytics without decision-making is just reporting.