The Future Of Software Testing Part One

By Seth Eliot

Hey, want to know the future?

OK, here it is:

Testing in Production with real users in real data centres will be a necessity for any high performing large scale software service.
Testers will leverage the Cloud to achieve unprecedented effectiveness and productivity.
Software development organisations will dramatically change the way they test software and how they organise to assure software quality. This will result in dramatic changes to the testing profession.

Is this really the future? Well, maybe. Any attempt to predict the future will almost certainly be wrong. What we can do is look at trends in current changes – whether nascent or well on their way to being established practice – and make some educated guesses.
Here we will cover Testing in Production. The other predictions will be explored in subsequent editions of Testing Planet.

Testing in Production (aka TiP)

Software services such as Gmail, Facebook, and Bing have become an everyday part of the lives of millions of users. They are all considered software services because:

Users do not (or do not have to) install desktop applications to use them
The software provider controls when upgrades are deployed and features are exposed to users
The provider also has visibility into the data centre running the service, granting access to system data, diagnostics, and even user data subject to privacy policies.

Figure 1. Services benefit from a virtuous cycle which enables responsiveness

It is these very features of the service that enables us to TiP. As Figure 1 shows, if software engineers can monitor production then they can detect problems before or contemporaneously to when the first user effects manifest. They can then create and test a remedy to the problem, then deploy it before significant impact from the problem occurs. When we TiP, we are deploying the new and “dangerous” system under test (SUT) to production. The cycle in Figure 1 helps mitigate the risk of this approach by limiting the time users are potentially exposed to problems found in the system under test.

But why TiP? Because our current approach of Big Up-Front Testing (BUFT) in a test lab can only be an attempt to approximate the true complexities of your operating environment. One of our skills as testers is to anticipate the edge cases and understand the environments, but in the big wide world, users do things even we cannot anticipate and data centres are hugely complex systems unto themselves with interactions between servers, networks, power supplies and cooling systems.

TiP, however, is not about throwing any untested rubbish at users’ feet. We want to control risk while driving improved quality:

The virtuous cycle of Figure 1limits user impact by enabling fast response to problems.
Up-Front Testing (UFT)is still important – just not “Big” Up-Front Testing (BUFT). Up-front test the right amount – but no more. While there are plenty of scenarios we can test well in a lab, we should not enter the realm of diminishing returns by trying to simulate all of production in the lab. (Figure 2).
For some TiP methodologies, we can reduce risk by reducing the exposure of the new code under test. This technique is called “Exposure Control” and limits risk by limiting the user base potentially impacted by the new code.

Figure 2. Value spectrum from No Up-Front Testing (UFT) to Big Up-Front Testing (BUFT)

TiP Methodologies

As an emerging trend, TiP is still new and the nomenclature and taxonomy are far from finalized. Butin working with teams at Microsoft, as well as reviewing the publically available literature on practices at other companies, 11 TiP methodologies have been identified (Table 1).

Table 1. TiP Methodologies Defined

METHODOLOGY	DESCRIPTION
Ramped Deployment	Launching new software by first exposing it to subset of users then steadily increasing user exposure.Purpose is to deploy, may include assessment. Users may be hand-picked or aware they are testing a new system.
Controlled Test Flight	Parallel deployment of new code and old with random unbiased assignment of unaware users to each. Purpose is to assess quality of new code, then may deploy. May be part of ramped deployment.
Experimentation for Design	Parallel deployment of new user experience with old one. Former is usually well tested prior to experiment. Random unbiased assignment of unaware users to each. Purpose is to assess business impact of new experience.
Dogfood/Beta	User-aware participation in using new code. Often by invitation. Feedback may include telemetry, but is often manual/asynchronous.
Synthetic Test in Production	Functional test cases using synthetic data and usually at API level, executing against in-production systems. “Write once, test anywhere” is preferred: same test can run in test environment and production. Synthetic tests in production may make use of production monitors/diagnostics to assess pass/fail.
Load/Capacity Test in Production	Injecting synthetic load onto production systems, usually on top of existing real-user load, to assess systems capacity. Requires careful (often automated) monitoring of SUT and back-off mechanisms
Outside-in load /performance testing	Synthetic load injected at (or close to) same point of origin as user load from distributed sources. End to End performance, which will include one or more cycles from user to SUT and back to the user again, is measured.
User Scenario Execution	End-to-end user scenarios executed against live production system from (or close to) same point of origin as user-originated scenarios. Results then assessed for pass/fail. May also include manual testing.
Data Mining	Test cases search through real user data looking for specific scenarios. Those that fail their specified oracle are filed as bugs (sometimes in real-time).
Destructive Testing	Injecting faults into production systems (services, servers, and network) to validate service continuity in the event of a real fault.
Production Validation	Monitors in production check continuously (or on deployment) for file compatibility, connection health, certificate installation and validity, content freshness, etc.

Examples of TiP Methodologies in Action

To bring these methodologies to life, let’s delve into some of them with examples.

Experimentation for Design and Controlled Test Flight are both variations of “Controlled Online Experimentation”, sometimes known as “A/B Testing”. Experimentation for Design is the most commonly known whereby changes to the user experience such as different messaging, layout, or controls are launched to a limited number of unsuspecting users, and measurements from both the exposed users and the un-exposed (control) users are collected. These measurements are then analysed to determine whether the new proposed change is good or not. Both Bing and Google make extensive use of this methodology. Eric Schmidt, former Google CEO reveals, “We do these 1% launches where we float something out and measure that. We can dice and slice in any way you can possibly fathom.”[1]

Controlled Test Flight is almost the same thing, but instead of testing a new user experience the next “dangerous” version of the service is tested versus the tried and true one already in production. Often both methodologies are executed at the same time, assessing both user impact and quality of the new change. For example, Facebook looks at not only user behaviour (e.g., the percentage of users who engage with a Facebook feature), but also error logs, load and memory when they roll out code in several stages[2]:

internal release
small external release
full external release

Testing a release internally like this can also be considered part of Dogfood TiP methodology.

Controlled Test Flight can also be enabled via a TiP technique called Shadowing where new code is exposed to users, but users are not exposed to code. An example of this approach was illustrated when Google first tested Google Talk. The presence status indicator presented a challenge for testing as the expected scale was billions of packets per day. Without seeing it or knowing it, users of Orkut (a Google product) triggered presence status changes in back-end servers where engineers could assess the health and quality of that system. This approach also utilized the TiP technique Exposure Control, as initially, only 1% of Orkut page views triggered the presence status changes, which was then slowly ramped up[3].

As described, Destructive Testing, which is the killing of services and servers running your production software, might sound like a recipe for disaster. But the random and unexpected occurrence of such faults is a certainty in any service of substantial scale.In one year, Googleexpects to see 20 rack failures, three router failures and 1000s of server failures. So if these failures[4] are sure to occur, it is the tester’s duty to assure the service can handle them when they do.

A good example of such testing is Netflix’s Simian Army. It started with their “Chaos Monkey”, a script deployed to randomly kill instances and services within theirproduction architecture. “The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables.”[5] Then they took the concept further with other jobs with other destructive goals. Latency Monkey induces artificial delays,Conformity Monkey finds instances that don’t adhere to bestpractices and shuts them down, Janitor Monkey searches for unused resources and disposes of them[5].

Synthetic Tests in Production may be more familiar to the tester new to TiP. It would seem to be just running the tests we’ve always run, but against production systems. But in production we need to be careful to limit the impact on actual users. The freedoms we enjoy in the test lab are more restricted in production. Proper Test Data Handling is essential in TiP. Real user data should not be modified, while synthetic data must be identified and handled in such a way as to not contaminate production data. Also unlike the test lab, we cannot depend on “clean” starting points for the systems under test and their environments. The Microsoft Exchange team faced the challenge of copying their large, complex, business class enterprise product to the cloud to run as a service, while continuing to support their enterprise “shrink-wrap” product. For the enterprise product they had 70,000 automated test cases running on a 5,000 machine test lab. Their solution was to:

Re-engineer their test automation, adding another level of abstraction to separate the tests from the machine and environment they run on
Create a TiP framework running on Microsoft Azure to run the tests

This way the same test can be run in the lab to test the enterprise edition and run in the cloud to test the hosted service version. By leveraging the elasticity of the cloud, the team is able to:

Run tests continuously, not just at deployment.
Use parallelization to run thousands of tests per run.

Data is collected and displayed in scorecards, providing continuous evaluation of quality and service availability[6].

Testing Somewhere Dangerous

Production has been traditionally off-limits to testers. The dangers of disruption to actual users should not be under-estimated. But by using sound TiP methodologies, we can limit risk and reap the benefits of testing in production to improve quality, which is ultimately the best way to benefit our customers.

References

How Google Fuels Its Idea Factory, Businessweek, April 29, 2008; http://www.businessweek.com/magazine/content/08_19/b4083054277984.htm
FrameThink ; How Facebook Ships Code http://framethink.wordpress.com/2011/01/17/how-facebook-ships-code/
Google Talk, June 2007 @9:00; http://video.google.com/videoplay?docid=6202268628085731280
Jeff Dean, Google IO Conference 2008, via Stephen Shankland, CNET http://news.cnet.com/8301-10784_3-9955184-7.html
The Netflix Simian Army; July 2011; http://techblog.netflix.com/2011/07/netflix-simian-army.html
Experiences of Test Automation; Dorothy Graham (soon to be published book: http://www.dorothygraham.co.uk/automationExperiences/index.html), Chapter: “Moving to the Cloud: The Evolution of TiP, Continuous Regression Testing in Production”; Ken Johnston, Felix Deschamps

Author Profile

Seth Eliot is Senior Knowledge Engineer for Microsoft Test Excellence focusing on driving best practices for services and cloud development and testing across the company. He previously was Senior Test Manager, most recently for the team solving exabyte storage and data processing challenges for Bing, and before that enabling developers to innovate by testing new ideas quickly with users “in production” with the Microsoft Experimentation Platform. Testing in Production (TiP), software processes, cloud computing, and other topics are ruminated upon at Seth’s blog at http://bit.ly/seth_qa and on Twitter (@setheliot). Prior to Microsoft, Seth applied his experience at delivering high quality software services at Amazon.com where he led the Digital QA team to release Amazon MP3 download, Amazon Instant Video Streaming, and Kindle Services.