Despite seeming like good ideas at first glance, there are some habits that can derail your DevOps team. In this blog series, we’re unpacking seven of those scenarios so companies can learn how to do DevOps right. In this post, we’ll dive into why “we focus 100% on uptime for everything” is bad habit #3.
When you first see this habit, you might think, ‘yeah, that makes sense! After all, who doesn’t want a 100% available website?
Achieving 100% uptime is very difficult and expensive. The technical details to make it happen require huge investments — for example, making sure that you can run on other servers or in other data centers or do backups or disaster recovery or full back or replication. It’s all doable, but the effort and cost that goes with it are enormous.
For example, if you have a website running on one server, you can also create two fallback servers. If one fails, then the others take over. But the initial investment in this is quite high, and when you need to make changes, you must repeat it for all servers. So building or changing your application is much slower.
And even if you are 100% up, there are still things that you cannot control, which makes it nearly impossible to be 100% up. Therefore making applications highly available is a good thing. But making applications highly available and up running all the time can be a waste.
There is almost no application that needs 100% uptime.
Not every application is equal. If you have an application that is needed during business hours, you don’t need to have 100% uptime. You need to look at your SLAs. What’s appropriate for a retail application? The worst that can happen if you’re down is losing money. If it’s a bank application, it’s probably a little bit more important to have it 99.99% up. And if it’s a medical application monitoring pacemaker, for example, maybe it’s critical that it is up 100% of the time. "Appropriate" is the keyword here. You don’t need to be up all the time for every application. Make a conscious choice.
We had a customer that had a Christmas card application. It sent out Christmas cards. They needed it to be up 99.9% of the time, which is mind-blowing because from January to half of December, nobody needed that application.
So, the first thing we ask is, "Do you need to be 100% up?" Because if you are experimenting with new features or you want to deliver your software more frequently, you need to have room to wiggle.
It’s a strategic decision.
How much uptime you need is a business decision — and uptime comes with a cost. Every percentage adds a lot of work and expense, even going from 99.9% to 99.999% could double your implementation cost and complexity. Does the full application need one number or can different parts of the application have different numbers? Putting one number on everything makes several parts unnecessarily complex.
Let’s take an example of a webshop like Amazon. The space where customers search and click and order products needs to be working all the time; otherwise, you lose money. People who want to buy stuff from your website will go to your competitor. But as soon as they submit and pay, that order is processed somewhere, and you will have time to ship it over the next few days. If that processing system is down for 10 minutes, the order will be processed 10 minutes later — which doesn’t really impact the customer. If they get the email confirmation 10 minutes later rather than instantly it doesn’t matter. That system could have a lower availability, which makes it less expensive and easier to change.
In other words, you can put one number on it: 100% uptime for the full application, or you can explore different versions — maybe it’s not 100%, maybe it’s 99.9, and maybe it’s not everything. Maybe some parts that have user interaction should be really highly available and some background processes, maybe not so much. Splitting it up instead of everything needing to be 100% uptime is a much smarter business decision.
Start talking.
In practice, many engineering teams say they have to make something 99.999% available because they can’t convince the business to lower the number. But that’s because they aren’t having the right conversations. How highly available should we make this application? If the business side just shouts that it needs to be 100% or 99.9% available, that number is really impactful. The main goal should be having that conversation between the engineering team and the business so the business knows the impact of that number. The business needs to be involved — it needs to understand what the impact of uptime has on the engineering team. The higher you make this number, the higher the cost to implement and to operate. So having that discussion is critical.
Disclaimer: Changing is a bigger challenge than reading a blog post!
We want companies that are not quite doing DevOps correctly to understand why. Our team can help identify what your team can do to get the most out of a DevOps way of working. Contact us!
Until then, we hope you will tune in for our next post to find out why "we have a release manager" is the fourth habit of a highly ineffective DevOps team.