Blog

The "Performance Series" Part 1. Test Driven Performance.

09 Oct, 2012

A number of my colleagues and myself recently decided to share our knowledge regarding “performance” on this medium. You are now reading the first blog in a series in which I present a test-driven approach to ensuring proper performance when we deliver our project.

Test driven

First of all note that “test-driven” is (or should be 😉 common in the java coding world. It is, however, applied to the unit-test level only: one writes a unit test that shows a particular feature is not (properly) implemented yet. The test result is “red”. Then one writes the code that “fixes” the test, so now the test succeeds and shows “green”. Finally, one looks at the code and “refactors” the code to ensure aspects like maintainability and readability are met. This software development approach is known as “test driven development” and is sometimes also referred to as “red-green-refactor”.

Test driven performance

Now let us see what happens when we try to apply “test-driven” to a non-functional requirement like “performance”. Obviously, we need a test and the test result needs to be “red” or “green”. There are many aspects in the “performance” area, so let us take one for the sake of our story here: we assume we are building a web-based application and look at its response times. Now our test can be something like “the mean response time of the system when responding to URL such-and-such must be lower than 0.4 seconds. I personally find such a requirement highly interesting as it is time-related! These kind of non-functional requirements are usually given for the final result of the project. But what about during the project?

Test criteria during a project

My claim is that during a project the criteria of non-functional requirements should be changed. Response times of the system should be extremely good at project start as there is hardly any system at all! At the end of the project when almost all development work is done response time only has to be “good enough”. Therefore the criteria should be planned for example by using a picture like this:

Figure 1. Planning a mean response time criterion during a project

What happens when we “break the build”?

During development, we constantly run our test by for instance using a tool like JMeter. We collect mean response times of critical URLs and see if we adhere to the criterion level of the day. One day we “break the build”: we do not meet our criterion, as the test is “red”. Now what? For me this is even more intriguing than the flexible criteria we saw above. In test-driven software development one usually stops all development when the “build is broken”. All tests must show green. In our case my strong advise is: don’t act now, plan a performance tuning activity! During such an activity we tune the system until the test is “green” again. So our failing response time test triggers a planning activity rather than triggers immediate action to fix the problem.

Preventing waste

Suppose we have planned a performance tuning activity, as our test is “red”. How much work do we have to do? How do we minimize the amount of work? Or in other words, how do we prevent waste? If we tune the system such that the test just show “green” there is a good chance it turns “red” next week and we have to introduce a performance tuning activity again. This does not make sense. On the other hand when we optimize way beyond the “green” criterion we tend to do too much work.
The solution is simple: use a lower limit! So when we do not meet the “green” criterion of, say, 0.2 seconds at a given time we optimize until we have reached a 0.15 second response time and then stop optimizing. This leads to a performance planning like this:

Figure 2. Planning a mean response time during a project while preventing waste

Test driven performance in an Agile perspective

Of course the initial performance-planning figure is a very wild guess. There is nothing wrong with such a guess! It is the best we know at that moment. During the project we of course adapt our performance planning. The key thing here is that we constantly attend to system response time as we always have a test at hand showing us “red” or “green”.

Pros and cons

There are two major advantages of the approach sketched above. Obviously, we catch ill design decisions leading to bad response times in an early stage. Therefore project management is in control rather than in the hands of a major project risk, as we are no longer confronted by a bad-performing system in late stages of the project. Secondly, we prevent waste during optimizations due to using a lower limit.
As a possible disadvantage, our approach might very well be more expensive compared to an approach where we inspect the behavior of a system in production only, and rely on quick reactions to fix any issues. My colleague Adriaan Thomas will zoom into this aspect in the next blog of this series.

guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Dave Collier-Brown
Dave Collier-Brown
9 years ago

My experience is that doing performance tests as part of the normal
functional testing pays off in the same way writing traditional tests
does. As soon as you’re green, you stop trying to make it faster.
Actually executing the tests is quite easy: you need two, both done with
something like JMeter. One tests for code-path speed, the second for
scalability.
Lets take a simple example, a web service that has to return an answer
in 1/10 of a second for any load up to 10 users on the one-processor
wimpy little machine that we do our nightly build on. The 0.1 second
response time will be your “red” line value for production.
We set a budget of, for example, 0.08 seconds for the middleware and the
database back-end, and initially write a mock-up for the middleware that
waits 0.08 seconds and then return “success”.
We set up a JMeter script that sends a series of single requests to the
UI from a single user, averaging one per second, and look at all but the
first few samples. That’s the number to plot on your diagram, and
compare with the red and green lines. In this case, the production red
line would be at 0.1 second, and we’d watch out for exceeding it, and
also for trends that suggests we’re going to exceed in in the next
sprint. Either is a hint to schedule some profiling and refactoring.
The second test is for scalability. Programs under load start off fast,
stay pretty fast under increasing load, and then suddenly get slower and
slower, as soon as you exceed some particular load.
If you draw a chart of response time versus load of the program we’re
describing, it will start off almost horizontal for one or two users,
creep up a bit more until you get to eight or so, and then start rising
(getting slower) very quickly. If you keep increasing the load in users,
you’ll find it turns into an almost straight line going up at perhaps 45
degrees, forever. It looks like a hockey-stick: a short horizontal
blade, a curve upwards and a long, straight handle. The curve is
actually a hyperbola drawn between a horizontal and a slanted line.
We want to measure the amount it slows down under load as it approaches
and passes our target number of users, so we set up JMeter to run with
increasing numbers of users until we’re well past the target load.
We’ll probably run from 1 to 15, and plot that. If the program isn’t
scaling well under load, the curve will start curling upwards early, and
exceed 1/10 of a second well before we reach 10 users.
If it does, we have a bottleneck, and we need to plan to do two things
in the next sprint: check that our algorithm is supposed to scale, and
find the slowest part of the program. If we have a bad algorithm, like
bubble sort, then we’d better change it. Otherwise we profile the
program and find out what’s slowing us down.
If it doesn’t degrade much, you know you don’t have to do anything, and
can even chose to spend some of your time budget on slowish features.
At some point in the development, it is wise to do a little bit of
bottleneck-hunting, but not too much. We’re not optimizing, because that
is specializing the program for some particular use case, just removing
performance bugs. Removing bugs is always a good idea, done in moderation.
if you’re too successful at improving performance, you may need to use a
trick that dates back to Multics: put a timer in the UI that keeps it
from returning until at least 0.05 seconds have elapsed. That keeps
users from expecting amazingly fast results all the time, just because
they’ve seen them a few times when everything was quiet.
When the program is approaching shippable, the load/performance curve
we’ve been measuring will be the first part of the capacity planning
effort for the program. Unlike estimates based on CPU/memory/IOPs and
the like, estimates based on time and measured load are usable for
estimating the performance of similar but larger machines.

Explore related posts