Randomness (via

Entries in Troubleshooting (1)


Dealing with Disaster

Michael Lopp has a nice post about how he has learned to manage “sky is falling” situations. If you work in IT, you have been there (and as a long time consultant I’ve been there at least twenty times too many), but have you handled the situation effectively?

The tendency all too often in these situations is to “shotgun” it, try anything and everything, because everyone is breathing down your neck and it needs to be fixed RIGHT NOW. In many cases though this can make things worse.

As Lopp says:

Action feels like progress, but undirected action is not progress, nor is it a plan. You’re going to barge into the office and start barking orders because that is what everyone expects, but if your orders are not shaped by what you’re really attempting to do, you are just scurrying people around aimlessly. Yes, you get lucky. Yes, everyone breathes a sigh of relief when you show up with your impressive sense of purpose, but in my experience when my direction doesn’t map to intent, I’m usually getting no closer to propping up the sky.

At the root of the situation is the fact that you very likely have an incomplete picture of what the real problem is. The person you are directly interacting with is probably riding you about their specific issue and you get hyperfocused on a symptom or secondary effect, while not clearly seeing the big picture. There are likely other symptoms that aren’t being communicated effectively.

The trick then is to resist the urge to act on each and every impulse and instead take the time to get a complete picture of what’s really going on.

This takes incredible DISCIPLINE, precisely because so many people are anxious to have a resolution RIGHT NOW.

So, as Lopp suggests, take the time to establish a “war room”, a central base for collecting information. Your “war room” may be as simple as a single whiteboard, or a sheet of paper. Create a diagram of the systems impacted. On the side, add a list of reported issues (likely symptoms) and a distribution list [1] for sending updates. If the issue is big enough of course you will likely set up a true “war room” as a central point of data collection, and where various troubleshooting activities can be managed from. Be patient in collecting that information until you have a good mental model of the situation.

A sorry attempt at flowcharting the war room
A sorry attempt at flowcharting the “war room”

Have patience to act only when your plan has been vetted by people you trust and a course of action becomes clear. Yes, this will be hard. Yes, you will have customers, co-workers, and bosses anxious to do something. But, act too quickly with incomplete information and you likely will end up making the situation worse.

In the face of disaster, it’s the wise person who does not act until they know. Unfucking the situation is a bandaid, understanding what you’re truly trying to fix is a cure.

- Klaus

  1. Divide the distribution list into “tactical” contacts – people who are actively involved in troubleshooting, testing, or otherwise need regular technical updates; and “executive” contacts – those who only need summary updates on overall status. Include anyone who shows interest or expresses “need to know” to the appropriate list.  ↩