Blog

Help! Too Many Incidents! – Capacity Assignment Policy In Agile Teams

20 Aug, 2014
Xebia Background Header Wave

As an Agile coach, scrum master, product owner, or team member you probably have been in the situation before in which more work is thrown at the team than the team has capacity to resolve.
In case of work that is already known this basically is a scheduling problem of determining the optimal order that the team will complete the work so as to maximise the business value and outcome. This typically applies to the case that a team is working to build or extend a new product.
The other interesting case is e.g. operational teams that work on items that arrive in an ad hoc way. Examples include production incidents. Work arrives ad hoc and the product owner needs to allocate a certain capacity of the team to certain types of incidents. E.g. should the team work on database related issues, or on front-end related issues?
If the team has more than enough capacity the answer is easy: solve them all! This blog will show how to determine what capacity of the team is best allocated to what type of incident.

What are we trying to solve?

Before going into details, let’s define what problem we want to solve.
Assume that the team recognises various types of incidents, e.g. database related, GUI related, perhaps some more. Each type of incident will have an associated average resolution time. Also, each type will arrive at the team at a certain rate, the input rate. E.g. database related incidents arrive 3 times per month, whereas GUI related incidents occur 4 times per week. Finally, each incident type will have different operational costs assigned to it. The effect of database related incidents might be that 30 users are unable to work. GUI related incidents e.g. affect only part of the application affecting a few users.
At any time, the team has a backlog of incidents to resolve. With this backlog an operational cost is concerned. This operational we want to minimise.
What makes this problem interesting is that we want to minimise this cost under the constraint of having limited number of resources, or capacity. The product owner may wish to deliberately ignore GUI type of incidents and let the team work on database related incidents. Or assign 20% of the capacity to GUI related and 80% of the available capacity to database related incidents?

Types of Work

For each type of work we define the input rate, production rate, cost rate, waiting time, and average resolution time:
λi = average input rate for type ‘i’,
Ci = operational cost rate for type ‘i’,
xi = average resolution time for type ‘i’,
wi = average waiting time for type ‘i’,
si = average time spend in the system for type ‘i’,
μi = average production rate for type ‘i’
Some items get resolved and spend the time s<sub>i</sub> = x<sub>i</sub> + w<sub>i</sub> in the system. Other items never get resolved and spend time s<sub>i</sub> = w<sub>i</sub> in the system.
In the previous blog Little’s Law in 3D the average total operational cost is expressed as:
Average operational cost for type 'i' = ½ Î»<sub>i</sub> C<sub>i</sub> 〈S<sub>i</sub>(S<sub>i</sub>+T)〉
To get the goal cost we need to sum this for all work types ‘i’.

System

The process for work items is that they enter the system (team) as soon as they are found or detected. When they are found these items will contribute immediately to the total operational cost. This stops as soon as they are resolved. For some the product owner decides that the team will start working on them. The point that the team start working on an item the waiting time w<sub>i</sub> is known and on average they spend a time x<sub>i</sub> before it is resolved.
As the team has limited resources, they cannot work on all the items. Over time the average time spent in the system will increase. As shown in the previous blog Why Little’s Law Works…Always Little’s Law still applies when we consider a finite time interval.
This process is depicted below:
new doc 13_2
〈M〉= fixed team capacity,
〈Mi〉= team capacity allocated to working on problems type ‘i’,
〈N〉= total number of items in the system
The total number of items allowed in the ‘green’ area is restricted by the team’s capacity. The team may set a WiP limit to enforce this. In contrast the number of items in the ‘orange’ area is not constrained: incidents flow into the system as they are found and leave the system only after they have been resolved.
Without going into the details, the total operational cost can be rewritten in terms of x<sub>i</sub> and w<sub>i</sub>:
(1) Average operational cost for type 'i' = ½ λ<sub>i</sub> C<sub>i</sub> 〈w<sub>i</sub>(w<sub>i</sub>+T)〉 + μ<sub>i</sub> C<sub>i</sub> 〈x<sub>i</sub>〉 〈w<sub>i</sub>〉 + ½ μ<sub>i</sub> C<sub>i</sub> 〈x<sub>i</sub>(x<sub>i</sub>+T)〉

What are we trying to solve? Again.

Now that I have shown the system, defined exactly what I mean with the variables, I will refine what exactly we will be solving.

Find Mi such that this will minimise (1) under the constraint that the team has a fixed and limited capacity.

Important note
The system we are considering is not stable. Therefore we need to be careful when applying and using Little’s Law. To circumvent necessary conditions for Little’s Law to hold, I will consider the average total operational cost over a finite time interval. This means that we will minimise the average of the cost over the time interval from start to a certain time. As the accumulated cost increases over time the average is not the same as the cost at the end of the time interval.
Note: For our optimisation problem to make sense the system needs to be unstable. For a stable system it follows from Little’s Law that the average input rate for type i is equal to the average production rate for type ‘i’. In case there is no optimisation since we cannot choose those to be different. The ability to choose them differently is the essence of our optimisation problem.

Little’s Law

At this point Little’s Law provides a few relations between the variables M, Mi, N, wi, xi, μi, λi. These relations we can use to find what values of Mi will minimise the average total operational cost.
As described in the previous blog Little’s Law in 3D Little’s Law gives relations for the system as a whole, per work item type and for each subsystem. These relations are:
〈N<sub>i</sub>〉= λ〈s<sub>i</sub>〉
〈N<sub>i</sub>〉 - 〈M<sub>i</sub>} = λ〈w<sub>i</sub>〉
〈M<sub>i</sub>〉 = μ<sub>i</sub> 〈x<sub>i</sub>〉
M<sub>1</sub> + M<sub>2</sub> + ... = M
The latter relation is not derived from Little’s Law but merely states that total capacity of the team is fixed.
Note that Little’s Law also has given us relation (1) above.

Result

Again, without going into the very interesting details of the calculation I will just state the result and show how to use it to calculate the capacities to allocate to certain work item types.
First, for each work item type determine the product between the average input rate (λi) and the average resolution time (xi). The interpretation of this is the average number of new incidents arriving while the team works on resolving an item. Put the result in a row vector and name it ‘V’:
(2) V = (λ<sub>1</sub> x<sub>1</sub>, λ<sub>2</sub> x<sub>2</sub>, ...)
Next, add all at the components of this vector and denote this by ||V||.
Second, multiply the result of the previous step for each item by the quotient of the average resolution time (xi) and the cost rate (Ci). Put the result in a row vector and name it ‘W’:
(3) W = (λ<sub>1</sub> x<sub>1</sub> \frac{x<sub>1</sub>}{C<sub>1</sub>}, λ<sub>2</sub> x<sub>2</sub> \frac{x<sub>2</sub>}{C<sub>2</sub>}, ...)
Again, add all components of this row vector and call this ||W||.
Then, the capacity to allocate to item of type ‘k’ is proportional to:
(4) \frac{M<sub>k</sub>}{M} ∼ W<sub>k</sub> - \frac{1}{M} (W<sub>k</sub> ||V|| - V<sub>k</sub> ||W||)
Here, V<sub>k</sub> denotes the k-th component of the row vector ‘V’. So, V<sub>1</sub> is equal to λ<sub>1</sub> x<sub>1</sub>. Likewise for W<sub>k</sub>.
Finally, because these should add up to 1, each of (4) is divided by the sum of all of them.

Example

If this seems complicated, let’s do a real calculation and see how the formulas of the previous section are applied.
Two types of incidents
As a first example consider a team that collects data on all incidents and types of work. The data collected over time includes the resolution time, dates that the incident occurred and the date the issue was resolved. The product owner assigns a business value to each incident which corresponds to the cost rate of the incident which in this case is measured in the number of (business) uses affected. Any other means of assigning a cost rate will do also.
The team consist of 6 team members, so the team’s capacity M is equal to 12 where each member is allowed to work on a maximum of 2 incidents.
From their data they discover that they have 2 main types of incidents. See the so-called Cycle Time Histogram below.
new doc 13_9
The picture above shows two types of incidents, having typical average resolution times of around 2 days and 2 weeks. Analysis shows that these are related to the GUI and database components respectively. From their data the team determines that they have an average input rate of 6 per week and 2 per month respectively. The average cost rate for each type is 10 per day and 200 per day respectively.
That is, the database related issues have: λ= 2 per month = 2/20 = 1/10 per day, C = 200 per day, and resolution time x = 2 weeks = 10 days. While the GUI related issues have: λ = 6 per week = 6/5 per day, C = 10 per day, and resolution time x = 2 days.
The row vector ‘V’ becomes (product of λ and x):
V = (1/10 <em> 10, 6/5 </em> 2) = (1, 12/5),  ||V|| = 1 + 12/5 = 17/5
The row vector ‘W’ becomes:
W = (1/10 <em> 10 </em> 10 / 200, 6/5 <em> 2 </em> 2 / 10) = (1/20, 12/25), ||W|| = 1/20 + 12/25 = 53/100
Putting this together we obtain the result that a percentage of the team’s capacity should be allocated to resolve database related issues that is equal to:
M<sub>database</sub>/M ∼ 1/20 - 1/12 <em>(1/20 </em> 17/5 - 1 <em> 53/100) = 1/20 + 1/12 </em> 36/100 = 1/20 + 3/100 = 8/100 = 40/500
and a percentage should be allocated to work on GUI related items that is
M<sub>GUI</sub>/M ∼ 12/25 - 1/12 <em>(12/25 </em> 17/5 - 12/5 <em> 53/100) = 12/25 - 1/12 </em> 9/125 = 12/25 - 3/500 = 237/500
Summing these two we get as the sum 277/500. This means that we allocate 40/277 ~ 14% and 237/277 ~ 86% of the team’s capacity to database and GUI work items respectively.
Kanban teams may define a class of service to each of these incident types and put a WiP limit on the database related incident lane of 2 cards and a WiP limit of 10 to the number of cards in the GUI related lane.
Scrum teams may allocate part of the team’s velocity to user stories related to database and GUI related items based on the percentages calculated above.

Conclusion

Starting with the expression for the average total operational cost I have shown that this leads to an interesting optimisation problem in which we ant to determine the optimal allocation of a team’s capacity to different work item type in such a way that it will on average minimise the average total operation cost present in the system.
The division of the team’s capacity over the various work item types is determined by the work item types’ average input rate, resolution time, and cost rate and is proportional to
(4) M<sub>k</sub>/M ∼ W<sub>k</sub> - 1/M (W<sub>k</sub> ||V|| - V<sub>k</sub> ||W||)
The data needed to perform this calculation is easily gathered by teams. Teams may use a cycle time histogram to find appropriate work item types. See this article on control charts for more information.
 

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts