Dealing with emergencies in Agile teams

28 Feb, 2011

Every Agile team has to deal with whatever they’ve put out in the wild next to their “regular” work. How to handle the – by definition – unknown load of production emergencies when you’re trying to achieve a stable pace? You can deal with emergencies by performing triage to either reject, defer or accept. You can set up a buffer to absorb some of the uncertainty, and finally you should make sure that you take the time to reduce the number of emergencies by building quality in. If you find you are mostly doing maintenance, you can consider doing Kanban.

The Context

In ye olden days of waterfall projects I never had to deal with that horror of horrors, maintenance. I’d be part of a team building something new, and you could keep going on until the end of the project. It was the maintenance department that would have to deal with the nonsense I had created. Ah, those were the days… all the fun without the hangover afterwards 🙂

But Agile teams, or in fact any team that starts delivering early and often (in my later waterfall days I’d already started to figure out that maintenance pretty much starts after the first two weeks… 🙂 ) deliver long before it’s even possible to hand the project over – if at all. The nature of frequent delivery means that the team has to deal with all issues that arise themselves. The first reason is because they are the ones who can do it, the second is that you want to integrate fixes into the team’s work anyway: they still need to deliver new versions of that same software…

In my consultancy work I’ve seen this issue come up with every single Agile team I’ve known, so this is not a unique situation for a small number of teams. All Agile teams have learn how to deal with this issue!

The Problem

People running in with emergencies

In a Scrum team, the problem will generally surface after one or more Sprints where a number of “production incidents” or similar unplanned mayhem took up so much of the team’s time that they did not achieve their planned Sprint goal. The result is that a team has a hard time planning for the next Sprints. The first problem is that they do not know their “real” Velocity, the second is that they have to somehow factor in the – by definition unpredictable – production incidents.

But watch out, there is a pitfall hidden in the above paragraph. Predictability is not the end goal in Agile! Predictability is important to know when a release is shipped, and to know how to pace the team. But I’ve seen too many cases where teams try to “predict harder” when they should be adapting better. When dealing with the unpredictable, the focus should be on adaptation first, not on more planning beforehand. That would be a return to The Way Of The Waterfall…

The Goal

So there we have it: the goal is to be able to absorb a reasonable amount of uncertainty, striking a balance between robustness and speed.

The Solutions

Before I present some solutions, let me state this right away: if the amount work of unplanned production incidents is significant compared to the “regular” work, there is no way you can achieve sufficient stability. You’ll need to fix the root causes of all those production issues first. More on that later.

Solution 1: Perform Triage – and Reject

The first thing to check is if you want to fix that production issue at all. This is not as silly as it might seem at first. There are so many cases where a production emergency is not an emergency at all, and should not even have been brought in in the first place! Some examples of “noncidents”:

  • Sales storms in with “the deal of the century”: “If we get feature X in NOW, we can win over customer X!”. In my experience this is always due to an uneducated and undisciplined Sales department. The root cause here is that Sales promised things they shouldn’t have, and they need to save their own skin now. It is ALWAYS possible to wait two weeks for a new feature.
  • Some stakeholders “upgrade” normal requests to production emergencies in an attempt to bypass the negotiations around the backlog. “It’s a blocking issue that I can’t get that feature!”. “Oh? Did the system crash? Is something not working?”. “Well… no, but it’s a real blocker for my work!”. That stakeholder may have a genuine need, but that does not make it a production emergency.

So solution 1 is: a strong Product Owner who performs triage on all production issues. If it’s a real production issue then by all means fix it. But I can guarantee that you’ll find a good number of issues that should not be emergencies at all… BTW, a Product Owner performing triage in this way is what James Coplien calls a Firewall in his organizational patterns book.

Solution 2: Perform Triage – and Defer the fix until at least the next Sprint

“We found this really big problem! We need it fixed right now!”. “Sure, we’ll get right on it. How long has this issue been in the system?”. “Well, for over a year, but we just found out about it!”. “It’s been in there for a year? …And you can’t wait two more weeks for a fix?”

Solution 2 is an extension of Solution 1. An emergency might indeed be important to fix, but there’s an important criterion to an emergency: it’s only an emergency if it must be fixed in the current Sprint. If you can defer the problem to next Sprint, there is no problem! The team can pick it up as part of their regular process, plan it, build it, and deliver at the end of next Sprint. Again this is a Product Owner responsibility: next to the decision to reject, a good Product Owner will make sure that everything that can be deferred will be.

Solution 3: Reserve a buffer to deal with unexpected issues

If you’ve done Solutions 1 and 2, whatever you’re left with should be real issues that you have to fix as soon as possible. The best way I know to deal with this is to reserve a buffer of time or story points that is left unplanned. This works especially well if the historical workload of any issues coming up is reasonably stable. You do not know what you’ll be doing, but you know how much effort it will take.

Watch out though, using a buffer can blow up in your face! The first danger is the size of the buffer. If the buffer is a significant percentage of the Sprint, say more that 1/5 of your velocity, then you’ll end up with a big hole in your planning process. So follow Buffer Rule 1: the buffer is not for backlog items. Try to keep the buffer as small as possible.

The second danger with using buffers is what I already discussed in Solution 1: the moment your stakeholder smell a workaround in the regular process, you can be sure they’ll dive onto it. A buffer really, really needs to be protected from unintended use. So perform good triage!

The third danger is buffer overflow. Just like in a computer this leads to blowing up the process. If the buffer is used, you’ll need to track how much of the buffer has been used, otherwise you’ll be in for a surprise at the end of the Sprint.

Solution 4: Fix root causes, improve quality

This solution is presented as number 4 because the first three are in logical order when you’re trying to control the damage, but in the end you’ll want to do the most important thing of all: fix issues so they stay fixed, build in quality so that you don’t have emergencies at all!. Now this is something we should be doing anyway, and is not unique for Agile projects: you want to do this in any project! But there is an extra Buffer Rule that is relevant in this respect (Credit goes to Jeff Sutherland on this one, I learned this rule when we do CSM trainings). Buffer Rule 2: If you overflow the buffer, abort the Sprint. If you have such issues that you can not even keep emergency work limited to a small buffer, you have no business trying to make progress building in features. Abort, use the Sprint to fix underlying root causes, and try again next Sprint. Coincidentally, Buffer Rule 2 also works wonders for all those stakeholders trying to “upgrade” their own agenda: “do you really want that issue fixed now? The team estimates that this is two points of work, and this would overflow the buffer. We would have to abort the Sprint, and you also would not get those other user stories you asked for! Oh… um. Well, I guess it isn’t that much of a problem…” (And it wasn’t… Real story!).

Extra: Size the team right

Team size is not a central focus in dealing with emergencies, but it is a factor to be aware of. A small team performs better because it has less overhead, but it is less robust against losing members. A small team is less robust against things like illness or something that pulls a team member away like… production emergencies maybe?. On a 10 person team losing one person “only” means a hit of about 10% in productivity (this is a simplified calculation of course, this assumes all team members are totally replaceable on a moments notice), in a three person team losing that same person would already mean a whopping 33%! The sweet spot tends to be around 7-9 people. Small enough to reduce overhead, large enough to absorb some production loss.

And finally… consider using Kanban instead of Scrum

If you find that your team is doing more maintenance than “new stuff”, you might consider using Kanban instead. This is because the granularity of Kanban is stories, not Sprints. If there is a production emengency the is already an intrinsic shorter wait for it to be picked up because of this. Kanban is about flow, while Scrum is about iterations. The two styles are close enough that I’ve seen a Scrum team transition into “flow mode” when they scaled down and only did maintenance, and went back to Scrum when a new release was planned, and they scaled up again.

In Conclusion

Every Agile team has to deal with whatever they’ve put out in the wild next to their “regular” work. You can deal with emergencies by performing triage to either reject, defer or accept. You can set up a buffer to absorb some of the uncertainty, and finally you should make sure that you take the time to reduce the number of emergencies by building quality in. If you find you are mostly doing maintenance, you can consider doing Kanban.

Newest Most Voted
Inline Feedbacks
View all comments
Martien van Steenbergen

Wonderful article, Serge. Really love your sketched. They give a personal touch as well as clear things up.
Wish you well.
P.S. My wish is too, that your article gets retrofitted into the Scrum pattern language…

Michael Sahota
11 years ago

Serge, great post. I love the drawings and thorough explanation.
I would even go a step farther and argue that “no interruptions” in Scrum is an anti-pattern –

Jason Fair
11 years ago

Great article. Agree with your recommendations. I specialize in Agile in ERP, and including “unplanned events” in the sprint is imperative to being able to manage expectations with stakeholders as well as deal with integration and dependencies that are inherent with ERP systems.

Jarl Meijer
Jarl Meijer
11 years ago

I really like this overview, and the cartoons!
In my experience a simple question can help to reduce the number of emergency calls in many organisations: “Does the team needs to solve this problem, or can someone else do it as well?”. This certainly holds for non-coding issues like analysis of a problem (“Why did this client did receive only 5 transactions yesterday”), configuration, or other. I often see issues being dropped too easily at the one-who-made-it, the-one-who-is-our-hero or the-team-who-is-doing-many-other-things-for-me. Sometimes a little instruction or an extra autorisation can keep work outside the team, which is a special case of solution number 4.

Jenna Pederson
11 years ago

Great post! This has been something I’ve been struggling with for awhile and have been thinking about this week.
Another gotcha to using a buffer, and maybe this is what you were touching on, is that if stakeholders get wind of there being a buffer for “unplanned” work and it’s even occasionally not all used up, they will expect that you can just deliver more stories instead. There can be some expectation that everything is planned and anything unplanned is not adding value or not growing revenue. This definitely comes back to setting expectations and “protecting” the buffer.

Evelijn van Leeuwen
Evelijn van Leeuwen
11 years ago

Serge, I do like your post!

Thomas Quaidoo
Thomas Quaidoo
11 years ago

Serge I love your article and how it addresses this issue of maintenance, the various elements that may be in play, and the variety of solutions offered. All agile teams employ their own unique flavors and so such a comprehensive analysis and recommendation is extremely helpful.

Rob Watson
11 years ago

Great article – your points on the true definition of an emergency and the key role of a strong product owner are particularly well made.
Personally I wouldn’t use a buffer, for the reasons stated, but I always like to get the product owner to use strict MoSCoW prioritisation for the stories they expect to be delivered in the sprint. You plan the sprint to include all of the “must haves”, and some of the “should haves”. You can still deliver a working product as long as you have all of the “musts”. The “should haves” then become your buffer, and it’s entirely up to the product owner to decide whether the “emergency” is more or less important than a “should have.” If it’s more important than a must have, then by definition the sprint is aborted.

Fabrice Aimetti
11 years ago

Hello Serge,
This post is very interesting. I’ve translated it into french :
Faire face aux urgences dans les équipes Agile


[…] in a bad way. It’s a challenge that a lot of teams face: unplanned change. The blog post dealing with emergencies explains the problem and possible solutions very […]

7 years ago

Great article! And timely.

7 years ago

Another question this article provokes: How often to perform a triage? If it’s truly an emergency issue, do you call ad-hoc triage meetings the moment you hear about these supposed “emergency issues”? What is the communication workflow like for this?
Where I work, it’s normally product management that comes to our team directly and says “We have a hot issue!”. Then it’s just a matter of finding someone to work on it. It’s pretty direct to development that way. However, we plan buckets for these unpredictable “hot” issues. Normally they come from UAT or production field issues.

6 years ago

Hi Serge,
Great write,it was really helpful.Can i ask an extended question- How to return to normalcy or even plan to return to normal planning. It can never be one time, it has to be gradually i suppose.

4 years ago
Reply to  Serge Beaumont

Insightful article! Good feedback here on how “often to perform a triage.” I’d like to pose the question of how much to triage per issue. A lot of times in order to determine the priority of an issue we need to understand what’s going on, at least at some level of detail. And that can take non-trivial time away from sprint commitments. Of course, we don’t want to go overboard though and do full root cause analysis on issues that turn out to be low priority. Where do we draw the line so that we state we’ve done “Just enough Triage” to accurately determine an issue’s priority?

Explore related posts