Data Loss in Australian Virtual Machines is Very Real (1)

Monday, 2 June 2014 by Marcel Mascunan

I don't want to alarm anyone, but last year, research by Ontrack showed that 40 per cent of companies with virtual environments have experienced data loss. I have no doubt this news will come as a surprise to many readers because the same research found that 80 per cent of respondents didn't believe data loss could be a risk when data is stored in a virtual environment. Big mistake.

Despite the robust nature of virtual environments, the many layers, and the use of backup and disaster recovery solutions to protect them, accidents still happen and the causes may be human or mechanical.  High among the problems we frequently deal with are failures caused by file-system corruption, deleted virtual machines, internal virtual disk corruption, RAID or other storage and server hardware failures, as well as deleted or corrupt files contained within virtualised storage systems.

The good news is that recovery is often possible. In Australia, Ontrack is uniquely equipped to perform VMware recoveries remotely. When disaster strikes, a specialist engineer can start work on recovery within the hour and our global team is available to work 24/7 on emergency cases if the data loss affects business operations or the company is facing great financial, operational or reputation losses.

So how does recovery work? Following are four real-life case studies of VMware data recovery for Australian customers ranging from IT hosting companies to mid-size businesses.

Missing VMDK in VMWare on RAID 5

In this case the RAID hardware was restarted when a hardware vendor was troubleshooting an issue on the client’s RAID. This caused an ungraceful shut-down when the VMWare hosts were still running. And when the RAID was brought back online, the VMDK for a critical Virtual Machine was missing. The client contacted Ontrack to start a Remote Data Recovery session.

Initial inspection showed that some areas of the VMFS file system had become corrupted, including the inode for the required VMDK file. Using proprietary tools, Ontrack engineers were able to recover and sequence the data fragments from the damaged VMDK file. Further investigation inside this file showed that some additional corruption existed inside the internal EXT (Linux) file system. Thanks to targeted repairs at the Guest File System level, the engineer was able to rebuild the internal volume and recover all available data.

Result: All critical data was recovered intact, with only some mild damage to the Directory Tree.

Accidental deletion of critical Virtual Machine

On a Friday afternoon the customer mistakenly deleted a critical Virtual Machine from a VMFS datastore. A Ontrack engineer was able to locate and sequence the fragments of data to rebuild the deleted Virtual Disk. Once rebuilt, the internal file system was inspected and found to contain no errors. This virtual disk was recovered both as a bootable VMDK file, as well as extracting the NTFS data to external storage.

Result: A full recovery of data was achieved.

Failed RAID drives containing a VMware Datastore

This case required an in-lab data recovery because the eight-drive RAID 5 contained two mechanically failed drives. The RAID array contained a VMware Datastore hosting approximately 14 Virtual Machines. Prior to contacting Ontrack, the customer had attempted to rebuild the RAID with no success.

Once we received the drives in our Brisbane cleanroom, a total of nine drives were imaged to Ontrack’s servers. During the process, three of these drives were found to contain I/O errors. The data was mapped and the RAID was rebuilt. At this stage some corruption was discovered as a result of an earlier incorrect rebuild. The engineer was able to identify the best configuration, including rebuilding a degraded drive from the Parity on remaining drives.

Next, some mild VMFS corruption was repaired to allow access to the Virtual Machines. In the following step, seven critical Virtual Disks were examined, several of which were found to contain light corruption. The engineers continued with File System error repairs where possible and data was extracted from all seven critical Virtual Disks, including several SQL databases. Finally, bootable VMDK files were provided for the machines which were found to contain no structure damage.

Result: In spite of the complex nature this was a very successful recovery, with all but a handful of the customer’s data recovered successfully.

Power failure results in lost VM

This unlucky customer “lost” a critical VM after a power failure. The customer connected the Datastore which they believed had held the missing Virtual Machine to Ontrack's Remote Data Recovery server, however initial inspection showed that the wrong Datastore was being examined. Our engineer  was able to examine some log files contained on the incorrect Datastore to direct the customer to the correct LUN.

Once the correct Datastore was presented, it was discovered that the critical Virtual Disk contained several snapshots that had become disassociated after the VM was powered up without the snapshots in place. As a result, several weeks of data were missing from the Virtual Disk. Ontrack quickly located the orphaned snapshots and force-merged them to the base file. Next, some mild File System damage was repaired and all resulting data was extracted, including a SQL database.

Result: All critical data was recovered successfully and returned to a very relieved customer.

Next week I'll take this topic further by explaining the steps you can take to reduce the likelihood of data loss in your virtualised environment.