✨ Register today: We can test in production: An introduction to shifting testing right. ✨

The night our “highly available” system went dark: How testers can drive resiliency

Test system resilience by mapping failure paths and running small experiments that reveal what users experience when things fail

by Ravikiran Karanjkar
Feb 3, 2026
6 min read

The night our “highly available” system went dark: How testers can drive resiliency image

Thank

Bookmark

Add to collection

When supposedly "resilient" systems aren't that way

One of the first things you might hear in a new testing role is that your organisation's system is “mission‑critical” and “highly available." Then, one night, it is very much not that way. A regional storage issue turns a perfectly healthy cloud setup into an app that spins, stalls, and fails in confusing ways for real customers. That incident proves that availability on paper does not mean resilience in practice.

The bright side: this kind of experience can completely change how to think about testing, failures, and the role you can play as a tester. This article shows you how to move beyond “Is the system up?” to “What happens to the customer when things go wrong?” You will learn practical ways to map failure paths, design simple resiliency experiments, turn real incidents into reusable tests, and collaborate with SREs and developers, even if you never touch production infrastructure.

What resiliency testing really means

Resiliency testing sounds like the exclusive domain of SREs or platform teams. But at its core it answers one question: when something breaks, what does your customer experience? And so it is well within the realm that testers investigate. When you do resiliency testing, you shift from chasing perfect uptime to intentionally exploring how your system behaves under partial failures, slowdowns, and bad data.

Resiliency testing means that you:

Treat outages, latency, and dependency failures as normal test scenarios, not rare edge cases
Focus less on “Is the service status green?” and more on “Does the system fail in a clear, predictable way?”
Use your existing environments and tools to simulate degraded behaviour without needing full infrastructure access.

When automatic retries make things worse

Here’s a real-life example: a microservice depended on a shared cache for session data that lived in another subnet. A networking misconfiguration made the cache partially unreachable, but health checks still passed because they called only the local process, not the cache.

You might have expected the service to detect failures quickly, retry a few times, and then fall back to a slower but safe path such as a database read. Instead, each call retried the same process multiple times with long timeouts. Threads got stuck waiting, the fallback path was rarely used, and upstream traffic kept increasing until the system slowly melted down.

From a testing point of view, the gap was clear. The only scenarios tested were “cache fully up” or “cache fully down”, never “cache flaky, slow, or half‑broken”. By adding tests that injected latency and forced intermittent errors, and then measured retries, timings, and fallback behaviour, you gained evidence to change retry counts, tighten timeouts, and improve circuit breaker settings.

Simple resiliency tests you can start this sprint

You do not need a full chaos platform to start testing for resilience. You can add lightweight experiments to your current sprint using tools you already know.

Focus on designing one small change, observing what happens and capturing what it means for your users. You can start with:

Timeout and latency tests: Delay responses from key APIs in your test environment using browser devtools, proxies, or scripts. Then watch how the UI behaves when calls take 5, 10, or 30 seconds.
Broken dependency tests: Turn off a non‑critical dependency such as recommendations, analytics, or a third‑party widget. Then check whether core flows still work and how missing pieces are explained to users.
Bad data tests: Seed test systems with corrupted or unexpected data, such as missing fields, huge values, expired tokens, or mismatched states. Then see whether services fail in a verbose way with helpful errors or fail silently with odd behaviour.
Failover behaviour checks: If your stack claims to be “active‑active” or to fail over automatically, ask for a controlled failover in a non‑production environment. Focus on latency spikes, errors and reconnect experience from the customer view.
Customer‑centric monitoring checks: Pick a few customer‑visible metrics, such as join success rate or checkout completion rate. Track how they change during each experiment to build the habit of tying failures to user impact.

Mapping your system to look for potential failure points

Traditional system diagrams show data flow and business logic, but they often hide how things fail. A failure‑first map helps you see where your system will hurt most when something slows down, breaks or returns junk.

To create a simple failure map for one critical user journey, you can:

Pick one high‑value flow, for example, “join a video call”, “place an order,” or “submit a claim”
List every visible dependency for each step: front ends, APIs, queues, caches, databases, third‑party services, feature flags, and config
For each dependency, ask: “If this is down, what would the user see?” “If this is slow, what would the user see?” and “If this returns bad data, what would the user see?”.
Turn each answer into test ideas that simulate “down”, “slow,” or “bad” in a development or test environment, even if you do it only at the API or UI layer.

Turning incidents of system failure into reusable test ideas

Every serious incident contains reusable test ideas if you are willing to unpack what happened. Instead of treating a post‑incident review as a box‑ticking exercise, you can mine it for concrete scenarios and acceptance criteria.

When you join a review, you can:

Capture the first things customers noticed, the technical root causes, and any behaviours that made the outage worse, such as retries, timeouts, or missing alerts.
Turn these into at least one “Can we reproduce something similar in test?” scenario and one “What should happen instead?” scenario you can adapt into acceptance criteria.
Build a living catalogue of resiliency tests that reflect actual failures your organisation has already seen, not just hypothetical ones.

Collaborating with SREs and developers for shared reliability

Resiliency testing works best when you treat reliability as a shared responsibility, not “someone else’s job.” Tooling and infrastructure may be the tasks of SREs, but your understanding of user journeys and bugs is just as valuable.

You can enhance this collaboration by:

Inviting yourself to reliability discussions and bringing user journeys, test logs, and bug patterns that show real customer impact
Asking for a regularly scheduled “chaos hour” in a non‑production environment where you kill a pod, pause a queue consumer, or throttle a dependency while you explore how the system behaves
Offering to document expected behaviour for each tested failure mode in plain language, including what customers should see and what alerts should fire, so that future experiments and real production incidents are easier to handle

Try this next week: Your three‑step starter path

If you want to start without overwhelming your team, you can use a simple three‑step path in your next sprint. These steps are small enough to fit into existing work but strong enough to start changing how your team thinks about reliability.

You can:

Pick one critical flow that matters to your users today, map its dependencies, and write down three “What if this breaks?” questions
Design one failure experiment with a developer or SRE in a dev or test environment, such as a slow API, a missing dependency, or a forced failover, and plan it like any other test session
Capture what you learn, including what users would see, which metrics change, which logs or alerts appear, and where behaviour does not match expectations, then turn at least one observation into a ticket or new acceptance criterion.

Building confidence before the next outage occurs

High availability is not simply the result of architecture diagrams and green dashboards. It is the set of behaviours your system shows when the cloud does what the cloud always does eventually, which is to fail in messy ways. When you add resiliency tests, turn incidents into assets, and collaborate with SREs and developers, you build both confidence in your system and credibility for your role as a tester.

If you start with a single critical flow and one small failure experiment next week, you will already be more prepared than you were yesterday. The next time things go wrong in production, you are much more likely to recognise the pattern and say that you have seen a version of it before and you know how your system should respond.

To wrap up: Building more resilient teams, one test at a time

Resilient systems do not appear just because you deployed to multiple regions or turned on auto‑scaling. They grow out of many small, deliberate experiments that reveal how your software really behaves when things break. When you map failure paths, run lightweight chaos sessions, turn incidents into test assets, and collaborate closely with SREs and developers, you help your team move from “hoping it holds” to “knowing how it fails and recovers.” You are actually building a resilient TEAM that in turn will create more reliable systems.

You do not need permission to start small. If you pick one critical flow, run one simple failure experiment, and capture one concrete lesson learned in your next sprint, you set a new expectation for what “tested” means in your organisation. Over time, those steps build a culture where outages become less surprising, recovery becomes faster, and testers are recognised as key partners in keeping both systems and customers safe when the unexpected happens.

What do YOU think?

Got comments or thoughts? Share them in the comments box below. If you like, use the ideas below as starting points for reflection and discussion.

When your team calls a system “highly available,” what concrete behaviours would convince you it is actually resilient during partial failures, not just when everything is green?
Think about a recent incident or outage due to flakiness that you experienced: what is one specific failure mode from that event you could turn into a recurring test in your current sprint?
If you mapped a single critical user flow tomorrow, which dependency would you most want to “break” first, and what should the customer see when it is slow, down or returning bad data?
What is one small “chaos hour”‑style experiment you could safely run in a dev or test environment next week to start changing how your team thinks about reliability and testing?

For more information

Testing Ask Me Anything - Reliability Engineering, Jordan Brennan
Moving fast and breaking things? Build reliable software from day one instead, Ishalli Garg
Scaling change, one iteration at a time, Kat Obring and Rosie Sherry

Thank

Bookmark

Add to collection

Ravikiran Karanjkar

Engineering Manager

He/Him

Senior QA and engineering leader with 18+ years of experience at Amazon, NVIDIA, etc. specializing in large-scale cloud systems, resiliency testing, and building high-performing quality organizations.

Comments

Explore MoT

RBCN 2026

Tue, 10 Feb

Where the Robot Framework community shines brightest.

MoT Software Testing Essentials Certificate

Boost your career in software testing with the MoT Software Testing Essentials Certificate. Learn essential skills, from basic testing techniques to advanced risk analysis, crafted by industry experts.

Certification

This Week in Quality

Debrief the week in Quality via a community radio show hosted by Simon Tomes and members of the community

All-in-One Test Automation, Powered by AI