Filling the backpack
At the start of your career your backpack with is filled with lots of theory and as your career progresses more and more experience get’s thrown in, perfect. At some point you won’t be learning new things if you keep doing the same role. That’s why people take up different roles and grow in a team. However, the goal of the team’s you’re in often remains similar: develop system X that realizes user stories Y and Z. Many people do lots of roles, but all on the ‘producing’ side of the IT. I personally experienced the value of jumping to (one of) the other side(s) for a period of time. After returning to my original role I became much more effective.
Since I starting in IT I’ve spent 98% of my time on the software producing side of IT. I did development, design, architecture, switched back and forth a bit, but always in the area of building software. I was asked to take a role as technical implementation manager for an IT landscape consisting of 40+ systems and all kinds of integration adapters. Not really my cup of tea, but I took the challenge and hoped to learn a bit (although I was not sure what). The scope I had to deal with went from 1 system and it’s directly related systems, to 40+ systems that together form a whole chain with lots of interdependencies. Due to all these interdependencies it was impossible to upgrade just 1 system. Upgrading 1 system caused a snowball effect that required upgrades across the whole chain. Bad design, true, but a fact we could not change on the spot.
The main target of the implementation manager is to ensure that systems are taking into production in a reliable and predictable way without causing unplanned downtime of the chain. With 10s of systems to be updated in one go, that is not an trivial task. The last go-live of the chain had been a disaster and took weeks to get it “sort of” working. That’s most likely the reason why people offered their condolences when I took the role, there was no way we were going to be successful… The upside of this was that it was easy to do better than last time. To increase the pain of failure, at the time of the failed go-live there were hardly any end customers using the chain, however, when we’d go live there would be lots of end customers using the chain. So, failure was not an option.
In a couple of hours brainstorm we (Toine was my partner in crime) chewed on the goal we got from management and made a outline for how to get there. The goal was to upgrade the whole chain in less than 24 hours (so that we could do it in a weekend and leave room for rollback). To realize this we decided to prepare thoroughly and rehearse the whole exercise multiple times on a production alike environment with the people that would do the migration in the go-live weekend. The only difference being that rehearsals took place during normal office hours and not at night. The plan was summarized in 3 simple timeline diagrams that defined for each timeblock what goal must be achieved. This diagrams were the basis for many talks we had. For the months leading up to the go live, the timeblocks were pretty coarse (e.g. 2 weeks), for the last week and the weekend itself it got more fine grained into hours. Note that we deliberately did not state how the goals had to be achieved, we left that open for the various project teams such that they could define a approach that would work best for them. Requiring the team to prepare did put the bar high for lots of teams. We had to convince people over and over again that preparation was key and that it was worth the investment in hours of their scarce resources. We never drifted away from our vision and plan and the freedom with the how part did take away some resistance. What helped us was that the reputation of the department was at stake (especially after the last failure) and if we’d fail, the reputation of the company since the chain was used by a lot of customers. That helped to convince people, nobody wanted to repeat the last failure and with some successful mini-chain releases we proved our approach and built up some credit.
People that typically only focused on “their” phase in the project, were “invited” to help get the chain live. For example: architects and designers helped to work out implementation playbooks and were on-call or on-site during the go-live weekend. Quality assurance staff had most experience with running the new versions of the systems in the chain, therefore, they were on-site during the go-live to help validate the upgrades and analyze issues. Project team that typically got out of the loop when software was handed over to QA, now wrote the detailed implementation playbooks and had to coach the operational maintenance staff to execute the playbooks correctly. Operational maintenance started to realize that if they worked closely with the project team to rehearse the playbooks, they would have much less aftercare issues.
In the initial rehearsals of the playbooks we focused on sub-chains and later we’d let them all work together in a full chain rehearsal, this helped people to find each other and work together in a more efficient way. One of the biggest risks for meeting the 24 hour deadline were several data migrations that in some cases initially took days to run. A lot of effort was invested to optimize these and also test them on copies of production data. This highlighted numerous issues that would have lead to failure in the go-live weekend, but now we could tackle them upfront.
So what did I learn and what am I using (more) now in my current role as architect?
Communication is key
We had over a 100 people active in the weekend and the rehearsals smoothened the communication between persons. Often people that never really worked together now had to work together to analyze and resolve issues. This now started during the rehearsals when there was much less pressure compare to a go-live weekend. It makes it much easier for people to build relations (compared to a high pressure situation).
Get all stakeholders involved early on
Lots of project teams were involved and it proved really important to get all teams on board as soon as possible. After the initial brainstorm we started to communicate our vision to management, project teams, operation maintenance, the internal customer, etc. We made clear what we expected from them and asked them to start preparing. This was 6 months before go-live and going to production was not on top of their priority list. Still we wanted them to start preparing, make first playbook versions, think about dependencies and timelines, etc. We made sure we checked back with teams on their progress on a regular basis and kept communicating the goal and vision.
Create a vision and share it
We started of with a big hairy goal and a vision on how to get there. This goal and vision directed all our actions and provided clarity to all parties that we had to work with. It took some time to sink in, but by consistently repeating it and being almost religious about it, it became a shared goal of all teams involved. Initially many people had doubts, but as time progressed we could show results of the rehearsals and go-live of mini chains. That slowly got everybody to support the approach. Setting a dot on the horizon and making that a share goal proved to be really important.
A project does not end after the handover to quality assurance
The scope of projects should include go-live and aftercare. One of the hurdles we had to take was to convince PMs that their project team should spent time on preparing for the go-live. This was a hurdle because it was not considered to be part of their project deliverables and therefore, they had no resources to work on it. Luckily we’ve been able to convince all of them, but it did take some time. It would have been much easier if project scope includes go-live and does not stop at a sign off by QA.
Operational maintenance is your friend, involve them, always
Make sure you involve the operational maintenance teams early on and keep them involved. They will have to support the software on a daily basis, so it’s important that their needs are addressed and they must feel confident that the new software won’t make their life harder. Also, these are the guys and gals that have access to machines, they can actually check (e.g. connectivity) or fix things for you. Always bring some coffee when you need a favor from them.
Automate as much as possible
Deployment of software should be automated as much as possible. Main reason we spent a lot of time rehearsing the playbooks was that the upgrades of the systems were done manually. In addition to the long duration it proved to be error prone. Playbooks had to be really detailed and we had to convince people to actually follow the steps in the playbook. There is reason for these steps, so don’t skip them or do them in a different sequence. Unfortunately we’ve not been able to realize automated deployments for all systems, so this is an area for improvement.
Data migrations are a risk, validate them upfront with production data
As part of the go-live a number of large data migrations had to be done. Two risks to address here: the duration and the robustness. Both risks have been addressed by running the data migrations on copies of production data. This gave us insight in the duration, which was too long. So optimization was needed. Also it became clear that production data was not as clean as the test data used by QA to test the data migration. The dirty production data actually broke the migration and extra effort was needed to make the migration more robust and clean up the production data.
Give people freedom, but communicate the goals and explain them
We had to deal with 20+ project teams, departments, external suppliers, etc. There was no way we could align them on a detailed level. So we focused on communicating the goals, vision and plan. This explained what we altogether had to achieve and what the (mostly time) constraints were. Then we gave freedom to the various teams to come up with a solid solution for their system. Of course, during the rehearsals the had to prove that their approach worked and we had to check that it did fit into the overall plan, but giving them freedom and being clear about the goals empowered them to work out efficient solutions for their domain.
Working with internal IT departments can be challenging
Internal IT departments are typically focused on ongoing concern and organized by specialities (network, db, machines, OS). When infrastructure is running, they’ll make sure it keeps running. Supporting projects that require quick response times and change their requirements on short notice is not their cup of tea. This simple requires a different type of people and processes. This was a frustrating area and after several escalations we managed to get a dedicated PM that was integral part of our team and had 1 goal: get our change requests executed on time and cut as much procedural waste as possible. This did not mean that we could bypass all procedures, we now had someone that knew how to speed up things. He also forced us to do bit more upfront thinking and initiate change requests earlier. That was fair and we managed to get a workable situation. The only problem was that is was bound to that one person being, when he got replaced, we started all over again. So, if you want to get things done with these groups, involve them early, try to a 1 point of contact and use his/her network.
I’m glad to be on the ‘software producing’ side of IT again, but also very much appreciate the experience of working on the receiving side. It did teach me some new things and reinforced things I already knew but did not do enough. I’m sure I’ll do such a sabbatical again, not now but let’s see what challenges pop up in the future.