Set up redundant backup systems, and test those systems regularly. Ideally, that would be the end of this chapter, but it’s not enough. It’s not uncommon for companies, even well-established companies, to drop the ball with backups. In some ways, it’s understandable. It’s the type of abstract problem that humans simply aren’t good at prioritizing because there are no immediate or guaranteed consequences for not dealing with it. As a result, the process doesn’t always get the attention it deserves.
I personally made this mistake with Sifter in the early days. I lost about eight hours of data because I only had nightly backups in place. Even though we were able to successfully restore from backups, those backups were incomplete. The bookmarking service Magnolia also suffered issues with backups that led to its demise. More recently, GitLab experienced multiple failures with its backup process that led to lost data. There are plenty of stories, but the key takeaway is that this isn’t a theoretical problem. Make sure your application isn’t another casualty of faulty backups.
Set up automated backups. By whatever means you do it, ensure it runs automatically and doesn’t rely on manual input. Backups must be fully automatic. Period. You also need to design this process such that if it fails, it fails loudly. Emails. Text messages. Phone calls. Backup failures are as critical as actual downtime.
Beyond having backups, there are two further key points to remember: first, backup frequency; second, backup location. With frequency, you should maintain some combination and amount of hourly, daily, weekly, and monthly backups. You’ll also want a replica database.
Think of these as your lines of defense. If something goes wrong with your database server, your replica database is your first line of defense. It’s going to be the most current, and it’s already loaded and connected to your production environment. However, because the replica database just mirrors your production database, there are categories of database issues which, if your primary database has problems, will be propagated to the replica database. In those situations, you’ll have to go to your regular backups.
Once your replica database has issues and you’ve had to turn to your regular backups, there’s a good chance that some of those regular backups have been corrupted as well. If you noticed the problem and caught it in under an hour, you might be all right. But if it’s been a day or two, you may have to go further back than you’d like. In this case, you will have lost some data. Remember, losing some data is a lot better than losing all of the data.
That covers backup frequency at a high level, but we still need to talk about location. This is a bit simpler: never put all of your backups in one place. Disasters happen. Whether natural or business disasters, it’s entirely possible for your primary data center to go completely offline and be unreachable. If your backups are there, you’re out of luck. If your backups are off site at another location, though, you can rebuild.
Another critical aspect of those backup snapshots is encryption. Whenever you create snapshots of your database, and especially when you save those snapshots outside of your production environment, ensure they’re encrypted. This will create an extra layer of security, but it also means you’ll need to securely store the secret to decrypt those backups if you ever need them.
In addition to automatically creating the backups, you need to automatically test the backups and ensure they’re working. There are few things more insidious than backups being silently incorrect. If the process runs without errors but the data is incorrect, it’s a tragedy waiting to happen. The best form of this is automatically loading the backups somewhere and ensuring that the data is changing and accurate.
The final step to setting up a solid backup process is documenting it. If it’s not well documented, it’s not complete. If the time comes when you need your backups, you’re going to have more important things to focus on than remembering how your backup and restore processes work. You need to be confident that detailed instructions are available so you don’t have to waste time figuring out how to get back on track.
Creating backups isn’t enough. You have to diversify in both frequency and location to have full coverage in an emergency. And you must keep those backups secure with a well-documented recovery process. This is some of the most boring work you can do, but it’s absolutely critical.
I advise you to publish a sterilized version of your backup and restore process on a security page publicly available on your website. If you’re not proud enough of the effort you’ve invested in the process to make it public, you’re not doing enough.
Related Reading
Automatically test your database backups Marco Arment details his process for automatically testing database backups and ensuring that he pays attention to whether or not they’re still working.