Scientist, a novel software QA method

This article describes a situation we found ourselves in during refactoring an existing application running in production, the implementation of a novel QA method dubbed “Scientist”, and advantages and drawbacks of this new approach. A future blog post will detail the architecture of a serverless implementation for this new QA method.

Introduction

One of the nice things of working at Xebia is our regular Innovation Day every two-three months. All consultants come together in our Hilversum office to experiment with new technologies, build prototypes, develop new products or brainstorm about new consultancy or training services.

In one of these days two years back, Gero Vermaas and myself built a Slack application for use within Xebia. After a while, it turned out to be quite a useful tool for consultants and staff. But a year later, we were not completely happy with some of the implementation choices we made earlier – a common assessment of things built over a year ago. We wanted to improve and refactor code while obviously keeping the thing running in production as well.

Thinking about how we could do testing of our planned refactoring, we concluded that traditional QA methods were not really applicable or helpful in order to become confident that the improvements were really better than the original implementation. Luckily, we got inspired by a concept and solution promoted by GitHub that we wanted to try out with our application. This resulted in the adoption of a novel software QA method dubbed “Scientist” that we successfully applied to our case.

Our serverless application

Our Slack app is called /whereis #everybody and keeps track of user-submitted locations. Using the app from within Slack, it registers where you are, and you can query for past and current locations of your team members. A “location” in our case as a consultancy firm often is a customer, but it can also be a city, a department, or a building – anything you consider a “location”. It turned out to be a very convenient app to keep track of the whereabouts of a group of consultants spread out over many customers, and it’s used quite frequently for over two years now.

/whereis #everybody is a completely serverless application and currently deployed at AWS. It uses the API Gateway, about 20 Lambda functions, a DynamoDB data store and some minor things, as described here.

But as we said, we were not delighted with our initial choice of the language of our Lambda functions, which was JavaScript. We noticed that refactoring code was cumbersome, and we also felt reservations to add new features to the codebase. The code pushed back to us, fearing changes perhaps.

So, we thought of reimplementing our backend Lambda functions in Python, but…

the app was already running in production;
we didn’t have an extensive test suite – or more honestly: any test suite;
we didn’t completely know how the app was being used, as users are always more creative than developers.

So we wondered, if we’d refactor and reimplement our functionality,

how could we become confident that the ‘improved’ code was indeed better than the current implementation?
which QA method is the most applicable for code changes in software that’s already running in production?

Which QA method is best for testing refactored functions in production?

In short, we had a couple of needs regarding QA of the refactored software:

we want to test an improved implementation of something already running in production;
we don’t want to specify all unit or integration test cases;
we don’t want to undertake the hassle to record and send production traffic towards a new implementation;
we don’t want to activate a new implementation in production before we’re really confident that it’s better;
we don’t want to change our software in production to enable testing.

We also noticed that there’s a big division in software QA methods, especially when looking at “with what do you compare the software?” Do you compare software against some specification or user expectations, e.g., in Unit testing, Integration testing, Performance testing, User acceptance testing? All theses methods are typically done before new or changed software lands in production. Or do you compare a new software version with an earlier or alternative version, e.g., in Feature flags, Blue/green deployments, Canary releases, A/B-testing.

Comparing software QA methods

So, we compared a couple of software QA methods on a number of aspects:

where do you test your software against?
in what phase of the software development life cycle does it typically occur?
where do you get test data from?

*QA method*	Test against	Phase / stage	How to get test data
Unit testing	Test spec	Dev	Manual / test suite
Integration testing	Test spec	Dev	Manual / test suite
Performance testing	Test/user spec	Tst	Dump production traffic/ simulation
Acceptance testing	User spec	Acc	Manual
Feature flags	User expectations	Prd	Segment of production traffic
A/B-testing	Comparing alternatives	Prd	Segment of production traffic
Blue/green deployments	User expectations	Prd	All production traffic
Canary releases	User expectations	Prd	Early segment of production traffic

In our case (and acknowledging that we never invested in a test suite) we wanted to test our software in production… but there’s an important drawback with the QA methods typically used in production: Feature flags, A/B-testing, Blue/green deployments and Canary releases:

To support feature flags or A/B testing, you need to adapt your production code to make those methods work. However minimal and localized, the feature flag or A/B testing logic typically is placed somewhere inside your software, and that code exists in production until you remove it again.
In canary releases and blue/green deployments, you’re actually sending a part of your production traffic to new software, and you can’t compare the same requests processed by miners without a canary, or in the other deployment color. Production traffic lands in either one of the two versions, and a segment of production requests is handled exclusively by one version.

GitHub had a similar problem

Luckily, GitHub had a similar challenge, way earlier than we did, and they describe their approach and solution in two very interesting articles.

GitHub wanted to replace a critical and complex piece of their software for performing merges. The new code promised faster merges, simpler checkout logic, and some more. However, they argued that their normal QA (code review, internal testing) did not give enough confidence for “production readiness”, as often corner cases in production were not covered by normal QA.

GitHub’s solution was to direct all production traffic to both the original and the new code, compare the results, and only send the response of the original code back. (By doing so, the original requestors were completely unaware that functionality was executed by two methods in parallel, and that the results were compared. They couldn’t know that a new implementation was being subjected to QA.) Once GitHub was convinced that the new code was indeed “better” than the original code according to their assessment, the alternative implementation was activated in production.

Proposal: new software QA method, “Scientist”

In hindsight, what GitHub actually did, was proposing a new software QA method based on the scientific approach to gain empirical knowledge by performing experiments. This QA method we start referring to as “Scientist” goes like this:

you have an existing software component running in production, as control;
you have another implementation, hopefully a better one, as candidate.

Then you perform experiments, sending production traffic to both control and candidate, comparing the results, to conclude

whether the candidate is behaving correctly;
whether the candidate is performing better than the control, with “better” being lower latency, better stability, lower memory use, or any other performance metric.

It’s worthwhile to note that this “Scientist” QA method goes through the same steps as the scientific approach of acquiring knowledge:

stating an hypothesis;
making a prediction, for example “using production traffic, candidate will be better than our control”;
setting up an experiment;
running the experiment, comparing results between control and candidates using production traffic;
accepting or rejecting the hypothesis, drawing a conclusion about the quality of the software.

Advantages of Scientist approach

There’s a couple of advantages of the Scientist approach:

it’s a drop-in QA without changing production code to enable testing alternative implementations;
there’s no need to generate test traffic, or anonymize historic traffic sample from production;
there’s no separate test suite, and no need to specify expected results beforehand – the existing code in production acts as your reference;
there’s an ability to iteratively improve candidates towards “good enough” or “even better than control” without users ever noticing that;
there’s the ability to slowly increase (and decrease) traffic to candidate implementations, to expose a bigger segment of production traffic to your new implementation;
there’s a very quick (i.c., instant) feedback loop with very limited risks;
a project is able to catch corner cases that only happen in production that you are not aware of.

Drawbacks of Scientist approach

Obviously, there are also some considerations of the Scientist approach you need to be aware of before you’re employing it:

there might be some additional latency in sending response of controls to the original requestor. The response time could be reduced by asynchronously invoking candidate functions, but using the mechanics of an experiment likely results in some higher request round-trip;
in total, there are more function calls and thus a higher resource utilization for as long as the experiment runs with one or more candidates. This could result in a higher bill for hosting compute infrastructure;
syncing persistent changes by the control with the candidate(s) that might have missed the original requests to control, can be a challenge in itself – making sure that the candidates are also looking to accurate data.

When is a Scientist less applicable for QA?

Lastly, there are a couple of cases in which using the Scientist approach for software QA is less applicable:

when the interface of a service changes – the original request cannot be sent to the candidate without ancillary changes;
when no real-time, live production traffic is available;
when production traffic isn’t covering the breadth of functionality implemented in control and candidates, so when some code parts are significantly neglected in production traffic;
when a control is not (yet) available, when you’re developing a first implementation. The Scientist method requires a control to compare a candidate against.

Curious about this Scientist QA method? Investigating the application of that method in your project? Or do you already have such experiments running in production, and you’ve already gained experience with it?

In any case, feel free to reach out to us. We’d love to hear from you.