Running fire drills for outage readiness

Bright red entrance to a NYC firehouse. The fire engine is backed in and ready to roll.

No one wants to experience a disaster, but worse than having a disaster is being unprepared for a disaster.

In school, we did regular fire drills. The alarm would go off, and the class would line up. The teacher led the class outside to a common meeting area and then counted off the students to ensure the entire class was present. We practiced this so that in a real emergency, we would be ready and less likely to do the wrong thing. Fire drills are an important way to protect people from disasters.

No one wants to experience a site outage, but worse than having an outage is being unprepared for an outage.

I have extreme anxiety over the idea of getting a notification that a client site is down from a monitoring service or, worse, from the client. Despite the fear associated with an outage, we need to consider how we can best approach it.

We can do the same thing for site outages that we did to prepare for fires in school as children. We do a fire drill to protect our clients should disaster strike.

Planning and running a fire drill for your website

What are the steps to addressing an outage? How will you know what to do when it happens? By running drills, it makes it more likely to do the right thing when disaster strikes. Here are a few things we do for fire drills:

  1. Create an outage plan if you do not already have one. At Reaktiv, we have documented steps to follow during an outage. This helps ensure the entire team pulls in the same direction and moves us toward rapid resolution. These steps include things like checking error logs, hosting status, and deploy history. The goal is to get the site up as quickly as possible, so the best initial action might be to revert a deploy and then figure out what happened after getting the site up.
  2. Plan the outage drill. What is going to happen? Will the site be out because of a plugin update? Will the site be down because of some code? Will the client push a content change that broke the home page? In addition to outlining the disaster scenario, we will also need to assign team members to the roles of the client and host. With the disaster scenario planned, we can now schedule the drill.
  3. Run the drill. Notify the team in the same ways they might be notified in a real emergency. Post to Slack, send an email, and otherwise bring the team in for the drill.
  4. Have the team work on the problem. This is the most important part. The team should follow the steps in our outage plan to work on the problem and identify a solution. Even if this is more of a thought exercise, let them think it through and follow the plan.
  5. Have an after-action summary. Go over how we did and what might be done better the next time.

By running fire drills on a regular basis, the team can more quickly address real outages when they occur with less fear of what might happen.

Unexpected Benefit

When we run a fire drill, a team member plays the part of the client while another plays the part of hosting support. By doing so, we have discovered an unexpected benefit. The team member playing the client is able to communicate with the team responding to the fire drill. But they are not in the channel where the team is working the drill or the channel where the team is communicating with “support.” We found that this level of isolation helps build empathy for the client experience – waiting for updates can be frustrating.

This is why the first priority in our disaster recovery plan is “communicate, communicate, communicate!” It is so important that the client knows what is happening.

By playing the client’s part, the team member experiences how it feels to wonder what is happening and what stage the response is at. Has the team assembled? Do they have a solution? Will this be fixed in the next five minutes or days? This insight can help our team become better partners with our clients.

Always improving

The other benefit of running a fire drill is finding areas for improvement. We have a great plan, but doing drills lets us objectively review not just the outage, but the way we handle outages. In previous drills, we found that our internal communication could experience technical issues, so we have to account for backup communication plans. We also found that the team does a great job of taking various roles, but it may benefit from swapping roles to get fresh eyes at times. Additionally, having team members who aren’t on a project be available to assist during an outage ensures all roles can be covered.

By running drills, our team is able to learn and grow without waiting for a disaster. This is one of the things that makes Reaktiv a great partner for our clients.

Get the latest from Reaktiv