(This was originally posted in Spanish, link here)
When a system goes down, everything around it goes down… Well, maybe not that bad, but sometimes it does like in the song (Spanish reference 😬).
Dealing with an emergency is not something that is usually learned at university, or in bootcamps (but how good would it be, right?) with which real experiences are what teach us.
In recent years I have been part of emergencies related to systems of all colors and flavors, sometimes in very unfavorable contexts. What I learned? Quite! Here are some interesting points to consider…
It is inevitable that nervousness dominates in such a situation, because depending on the system in question, the consequences of being down can be very serious. The ideal is to keep a cool head as much as possible and think clearly. It is easier said than done: the first few times I was in this situation, in addition to suffering the stress that it generates, I also experienced paralysis, and on the other extreme too much haste; both extremes are harmful.
If we have the opportunity to work on the problem with more than one person, both can encourage each other or take over. If we are alone, getting away from the keyboard for a while, clearing our eyes and returning with more energy always works. A tip that never hurts is to check twice before executing a command that can be risky.
Understand the severity and scope
How many people are affected? How affected are they? How long could they stay in that situation? In short, how serious is the problem? All valid questions to ask ourselves when diving into the resolution.
Sometimes, we don't have answers to all these questions; That is why it is necessary to actively work on the observability of our system: knowing user data (for example, where they are located, and what devices they use), inspecting general health issues (such as error rate, or CPU and memory values), and metrics. more particular / business (conversion rate, users succeeding in doing X task, etc).
If more than one person is working to resolve the incident, it is important to stay in sync. Do not step on each other and at the same time make the most of each person's time. Break the incident into parts to conquer it.
In the best-managed emergencies, one thing always happens: someone on the team takes a leadership role with the ability to see the problem from a broader perspective, while other people may be actively working on the resolution. This leadership can last throughout the emergency or it can be more situational and other people take it (for example, if we work with people in other time zones and at some point there is a handover). What is key is to try to be explicit and assertive in communication.
Another thing that characterizes good handling of an emergency is, especially in those that last several hours or days, making frequent check-ups with the people involved. Stop the ball, regroup, and think a bit strategically.
Maintain communications and record of what is happening
It is logical that as we try to solve the problem we make changes (which can have a positive or negative impact). For example, changing an environment variable or restarting some service.
For example, if I change an environment variable, I recommend saving the old value, and restoring it if that doesn't solve the problem. If we restart a service, and we know it's going to take x minutes, we can let people know who might be affected. Also, if there are more people developing, it may be necessary to notify them not to make new deployments or changes that could further alter the situation. Remember that the goal is to set as many variables as possible.
In my experience, this works quite well when a team is formed to resolve the emergency, and one person assumes that role. Then the rest is focused on solving the problem itself, and the designated person serves as a link with the rest of the organization and/or affected users.
Don't jump to conclusions
One of the most important tips I would like to highlight. Sometimes, by wishing that the problem was solved, we can convince ourselves that a solution is the final one. Hypotheses will surely arise (eg: “it must be the version of the database that we have in production” ) It is important to validate these hypotheses before trying a solution.
Also, understand the impact it could have when we make a change. Let us remember that we must try, as much as possible, not to make the situation worse by altering more things. Our role is like that of a surgeon: we are operating with a high level of risk, but we must make use of all the signals that the body (in our case, the system) gives us that we can monitor, and we must know that many things that we try will to have consequences.
Decision: fix, patch, workaround, or rollback
At a very high level, an emergency could be attacked with one of these 4 approaches:
- Fix: try to fix the problem in the correct and final way. For example, if the problem is a bug in the code, write and put into production the code that fixes it, ideally with the corresponding automated tests that allow us to preventively detect if it happens again.
- Patch: try a temporary solution, which can potentially get us out of the emergency faster. This leaves us with technical debt that we will have to resolve once we are back to normal.
- Workaround: Not solving the problem, but resorting to a solution that we can live with while the problem is being solved. For example: if an automated process is down, can we continue to do it but manually?
- Rollback: restore the system to its previous state, in order to be able to work with more peace of mind on the problem. There are cases where this can be done (for example, when we suspect that a recent change caused the problem), others where it cannot (for example, if we are facing a DoS attack)
It is not necessary to choose one, if several people are involved several solutions can be explored in parallel; generally, we are going to choose the one that gets us out of the emergency faster.
Once the emergency is resolved, and back in the normal rhythm of work, it is immediately interesting to reflect on why what happened happened, how we could have detected it before it became a problem, and what we could do better in the future. .
For this, I recommend post-mortem sessions, which is a way of structuring this analysis in such a way that, on the one hand, it leaves a record of what happened (the severity, who was involved, the times and milestones since the problem was detected until the was completely solved) and also serves as a space to reflect without looking for blame, and wonder why the number of times necessary.
- Where possible, define a work team with clear and explicit roles
- Maintain communication with the rest of the people involved in the problem (this includes users too!)
- Keep calm and don't rush with solutions
- Have an “outside the box” thinking and think about possible short-term solutions
- Try not to aggravate the situation by introducing more changes
- After solving the problem, analyze in detail what happened and think about opportunities for improvement