Securing Your Vault for When Disaster Strikes Series
Welcome to the first in a series on securing your Vault to help when disaster strikes. This first one will cover some worse case scenarios that have occurred over the years.
I’d like to make a few statements that most System Administrators will hopefully already know:
- · Backing up to the same drive used for Production
…is most definitely not a disaster recovery solution! - · Backing up to the same RAID used for Production
…is not a disaster recovery solution! - · RAID
…is not a disaster recovery solution! - · Having a single copy
…is not a disaster recovery solution! - · If you do not have an offline copy
…you do not have a disaster recovery solution! - · If you do not have an offsite copy
…you do not have a disaster recovery solution! - · If you have never successfully restored from your backup
…you do not have a disaster recovery solution!
These statements are well known throughout System Administrator land as they do not only apply to Vault. However, a lot of time Vault gets implemented and maintained at smaller companies who might not have dedicated System Administrators and are unaware of why they exist.
Below I will describe some unfortunate scenarios, in which customers have lost time and data due to disaster.
The Power Went Out
This case began early one Monday morning as the worst cases often do. The client had a weekly backup that started each Friday night. When they would check the status on Monday mornings the Task Scheduler would typically show a successful backup. Until this particular Monday morning, that is.
Unfortunately bad weekend storms had blown through and the computer acting as their Vault server was a repurposed desktop that was not plugged into an uninterruptable power supply (UPS).
As the power went out, the drive heads ended up crashing onto the platters.
Unfortunately, the client in question was backing up onto the same drive as their production Vault. The customer lost the entire contents of their Vault. They got lucky in that a few users had never cleaned out their local workspaces. They ended up losing all history and meta-data, but for the most part they were able to recover 80%-90% of the latest versions of their Vault content.
So, not only did they lose content, but they lost a few days’ worth of time, and incurred the cost to merge all the workspaces back into one and reload into a new Vault.
But I Have RAID!
For those not aware, Redundant Array of Independent Disks, or RAID, is a method used, mostly on servers, to allow for fault tolerance (by mirroring/parity checks), speed improvements (by striping), and storage space increase (by spanning) across multiple hard drives.
Unfortunately, we’ve had a few customers who thought that having a fault tolerant RAID was good enough from a backup standpoint. Let me be clear, it’s not.
In this case the customer had a RAID 5 configuration. This configuration can only recover from one failed hard drive at a time. The customer had failed to realize that something was wrong with their RAID, they did mention after the fact that things had been running slow (RAID was running in degraded mode due to the failed disk), but thought it was network related.
Before they determined the slowness was due to one of the drives having failed, a second drive failed.
There is no recovering the dataset in this case.
Again, the customer was able to recover some of their most recent versions of content from various user local workspaces. However, they lost all history and meta data.
What’s Bitcoin?
In the last few years we have seen a significant increase in crypto ransomware spreading around the world. The larger companies are usually safe as they have good practices in place. The smaller companies are usually the ones hit hardest by these wretched things.
We had a customer who had done an okay job of planning for issues and implemented RAID 10, as well as had their ADMS Console configured to back up to a second computer.
Unfortunately, they hadn’t backed up that second computer to an offline copy. So, when their systems got infected, not only did their production system get encrypted, but also their backup.
Bitcoin is not cheap! At the time this post was being written, 1 Bitcoin = $6,738 US.
3-2-1 NOOO!
One of the most widely used disaster recovery strategies is the 3-2-1 strategy. This means having 3 total copies of your data, 2 local copies on different mediums, and at least 1 copy offsite.
The only issue to 3-2-1 is that a lot of times we fall into complacency that our strategy is working as we see the data move. The issue is that we are not always aware of subtle issues that can creep in.
We had a case where the initial backup was most likely okay (though not guaranteed) but was copied to the second backup, then that copy was uploaded to a cloud backup.
The only problem, which was found while trying to recover from an actual disaster, was that somewhere in the process data was being lost. It was minor and due to a bad Network Interface Card, but it was enough to cause major pains when the Vault was restored.
Data rot had crept in causing many files to lose just enough information to become unusable or very problematic to use.
Had the customer attempted to restore from their various backup solutions prior to the need to restore for real, they would have found the issue ahead of time as Vault uses checksums to validate files after a restore.
The Takeaway
Hopefully after hearing about these unfortunate occurrences afflicting others, you now fully understand the importance of not only planning for disaster recovery but understand the importance of testing that plan periodically.
The following posts in this series will continue to cover various aspects of disaster recovery with Vault.
Comments