In an effort to expand the capacity of a file server supporting hundreds of users, healthcare IT manager Neil Smith plugged an external 14 drive enclosure into the outside port of his RAID controller (which had a SCSI channel already being used internally as part of a drive array spanning multiple channels).
Unfortunately, when he added the enclosure, the original RAID configuration was lost – more than 400,000 files (about 250 GB) of data lost. The IT team attempted to rebuild, but when it didn’t complete after running through the night, they soon discovered it was overwriting the original array with a new one that included all the drives in the server as well as in the enclosure.
Fortunately, a data recovery company was able to connect remotely and restore more than 99% of the 400,000+ files from the re-configured and overwritten RAID set, but this illustrates a central paradox of data storage: as the complexity and sophistication of storage increases, so too does the rate of hardware, software and operator failures.
In fact, even with all the advancements in storage technology, only about 20% of back-up jobs are successful (according to Enterprise Strategy Group).
Each year hundreds of new data storage products and technologies meant to make the job faster and easier are introduced, but with so many categories and options to consider, the complexity of storage instead causes confusion – which ultimately leads to lost time and the loss of the very data such new enhancements are meant to avoid.
Hence the question for most IT professionals who have invested hundreds of thousands of dollars in state-of-the-art storage technology remains, “How can data loss still happen and what am I supposed to do about it?”
Why Backups Still Fail
In a perfect world, a company would build their storage infrastructure from scratch using any of the new storage solutions and standardize on certain vendors or options. If everything remained unchanged, some incredibly powerful, rock-solid results could be achieved.
However, in the real world storage is messy. Nothing remains constant – newly created data is added at an unyielding pace while new regulations, such as Sarbanes-Oxley, mandate changes in data retention procedure. Since companies can rarely justify starting over from scratch, most tend to add storage in incremental stages – introducing new elements from different vendors at different times – hence the complexity of storage.
All this complexity can lead to a variety of backup failures that can catch companies unprepared to deal with the ramifications of data loss. One reason why backups fail is due to bad media. If a company has their backup tapes sitting on a shelf for years, the tapes could become damaged and unreadable. This is a common occurrence if backup tapes are not stored properly. Another reason why backups fail has to do with companies losing track of the software with which those backups were created. For a restore to be successful, most software packages require that the exact environment still be available. Finally, backups fail due to corruption in the backup process. Many times companies will change their data footprint but not change their backup procedure to keep up – so they are not backing up what they think they are. Without regular testing, all of these reasons are likely sources of failure.
What to Do When Your Backup Fails
No matter how much a company tries to speed operations and guard against problems with new products and technology, the threat of data loss remains and backup and storage techniques do not always provide the necessary recovery. When an hour of down time can result in millions of dollars lost, including data recovery in your overall disaster plan is critical, and may be the only way to restore business continuity quickly and efficiently. When a data loss situation occurs, time is the most critical component. Decisions about the most prudent course of action must be made quickly, which is why administrators must understand when to repair, when to restore and when to recover data.
When to Repair
This is as simple as running file repair tools (such as fsck or CHKDSK – file repair tools attempt to repair broken links in the file system through very specific knowledge of how that file system is supposed to look) in read-only mode first, since running the actual repair on a system with many errors could overwrite data and make the problem worse. Depending on the results of the read-only diagnosis, the administrator can make an informed decision to repair or recover. If they find a limited amount of errors, it is probably fine to go ahead and fix them as the repair tool will yield good results.
Note: if your hard drive makes strange noises at any point, immediately skip to the recovery option.
When to Restore
The first question an admin should ask is how fresh their last backup is and will a restore get them to the point where they can effectively continue with normal operations. There is a significant difference between data from the last backup and data from the point of failure, so it is important to make that distinction right away. Only a recovery can help if critical data has never been backed up. Another important question is how long it will take to complete the restore – if the necessary time is too long they might need to look at other options. A final consideration is how much data are they trying to restore. Restoring several terabytes of data, for example, will take a long time from tape backups.
When to Recover
The decision to recover comes down to whether or not a company’s data loss situation is critical and how much downtime they can afford. If they don’t have enough time to schedule the restore process, it is probably best to move forward with recovery. Recovery is also the best method if backups turn out to be too old or there is some type of corruption. The bottom line is, if other options are attempted and those options fail, it is best to contact a recovery company immediately. Some administrators will try multiple restores or repairs before trying recovery and will actually cause more damage to the data.
Despite this company’s and your best practices, one thing is clear – no matter how much time and money a company spends planning, creating and maintaining their storage environment, with the complexity of storage, the threat of data loss remains. In the end, the only answer to the question of “how data loss still happens and what you should do about it” is to ensure data recovery is included in your plan.