Be honest and graceful when things go wrong

Launching your own application may be the first time you’ve ever been responsible for your company’s mission-critical systems. It may also be your first time interacting with customers. There’s a good chance that eventually something bad will happen–and it won’t be easy to tell your customers. But with a little guidance, you can be ready for it.

We’re only human, and we all make mistakes. Hopefully your mistakes will be small and won’t significantly affect your customers, but let’s talk tactics in case your mistakes do upset your customers at some point. Whether it’s something as fleeting as downtime or as serious as losing customer data, something will go wrong, so it’s best to be prepared. This is one of the few areas where I believe premature optimization is not only acceptable but necessary.

To start with, you’ll need some sort of status website. It should be ready and available at status.yourdomain.com, and it should be hosted on a server completely separate from your application–if your main server goes down, you mustn’t lose your status page too. Make sure people in the know can post to your Twitter account or other social media to help keep your customers in the loop.

Don’t leave your customers in the dark while you’re fixing the problem. The moment something significant goes wrong, let your customers know–you don’t have to have answers right away, but tell them you’re aware and working on it. Do this through Twitter or your status site, or even make an announcement within your application (if it’s still online). Huge corporations might be able to get by with poor communication, but you’ll be much better off if you can keep your customers informed while you’re putting things right.

Don’t overreact: even the worst mistakes rarely lead to the worst-case scenarios your imagination is flooded with. In all likelihood, you’ll have a handful of angry customers, a batch of frustrated customers, and a large majority of understanding customers, as long as you handle the situation honestly and effectively.

Take a deep breath and fix it. Once you’ve let your customers know about the problem, get to work and don’t worry about anything else. A few customers might contact you during this time because they may not have seen your announcements. If you’re on your own, don’t reply right away unless you have time. You’ll be able to reply after everything is fixed–and with far more useful information. If you can, send out a quick reply pointing them to your status updates while you work on the problem.

Once the storm has passed, be honest, clear, and precise–no matter what. There’s no benefit to even the slightest sugarcoating. If you gloss over what happened, you’ll only make things worse. I’ve read quite a few postmortems, and the only ones that go well are those that are down-to-earth and honest. The bad ones are never transparent and sincere. Explain exactly what went wrong and where you messed up. Include any relevant technical details you think your customers would want.

Don’t forget to clearly outline the steps you’re taking to prevent this kind of problem in the future. It’s OK if you don’t have a perfect answer straightaway, but you should have a plan you can share with your customers within twenty-four to forty-eight hours. Communication makes all the difference.

Anecdote: My Big Mistake with Sifter

We had begun talking about upgrading our infrastructure, and were exploring the idea of moving hosts and improving our backups while continuing to work on some improvements to the application. In the hope it might buy us some time and lessen the urgency of the looming infrastructure upgrades, I decided I’d increase the size of our virtual machine.

I chose poorly.

When I reviewed our performance the next morning, I noticed the upgraded virtual machine hadn’t made much difference, so we decided to revert back to the original virtual machine. All I was thinking about was the downtime we were facing–a quick revert would just need a reboot, and we’d be down for less than a minute. Leaving the upgraded virtual machine in place and resizing downward later would mean twenty to thirty minutes of downtime. I chose the former since it would minimize downtime. But within seconds of making that decision, I realized I’d made a mistake.

When we reverted to the old copy of our virtual machine, we overwrote all the customer data that had been created on the new virtual machine overnight–about eleven hours’ worth. We managed to recover about three hours’ worth from our backups, but the remaining eight hours of lost data coincided with Europe’s peak business hours, so some of our customers were seriously affected. I’d never had such a sinking feeling like that before. My initial (over)reaction was that our customers would leave in droves, and Sifter wouldn’t survive. Fortunately, that was far from the case.

We immediately went into recovery mode, and we remained transparent as we talked with our customers about our mistake, the consequences, and our plans. We issued a month’s credit to every affected customer, losing a fair amount of revenue as a result. It wasn’t the lost revenue that got to me–I was disappointed with myself. Thousands of people trust us with their data, and I had let them down. But over the next couple of days, I learned just how wonderful and understanding our customers could be. To the best of my knowledge, I don’t think anyone canceled as a direct result of the data loss, and most were very supportive.

As a result, we dramatically improved Sifter’s infrastructure and backups. Losing customer data was a painful lesson, but the bigger lesson was that by responding effectively and being transparent with customers, the fall out was no where near as bad as I originally expected.