Preventing Data Loss During RAID and Drive Rebuilds – Part 2

Monday, April 22, 2013 by David Logue

This is a follow up to the original article on RAID vs. Drive Rebuilds. Several questions came up about RAID and drive rebuilds and preventing data loss since the original posting, and Part 2 will attempt to address those questions.

One of our readers noted:  “In your first example, parity is missing from stripe 4. You didn’t mention how that stripe can get rebuilt if there’s no parity.”

Degraded RAID 5 Data Loss

The parity is missing from stripe 4 because it should be on the missing or damaged drive.  In other words, in a healthy array it would be at the top of stripe 4.  As to how it is rebuilt, in this example all of the data is intact and the parity sector on HDD 1 in Stripe 4 would be rebuilt by XOR’ing the data from drives 2-4 (P4 = XOR (D9, D8, D7).  See below for a picture of the rebuilt drive.

RAID 5 Rebuilt Drive

Another question that came up multiple times since the original post relates to other ways drive or RAID data loss or data damage can occur.

One of our readers asked:  “Your second example shows how the data can be lost if the wrong type of rebuild is done, “such as” rebuild parity. Is that the only case? “Such as” kind of implies you could do other rebuilds that would get you in trouble.”

Rebuilds that can cause data loss

There are several types of rebuilds that can happen where data can be lost. Below is a list of some of the types of rebuilds that can cause data loss.

1.  Rebuild parity with zeroed drive (parity overwritten)

2.  Rebuild parity with degraded drive (forced online and parity overwritten)

3.  Rebuild parity with drives out of order (parity and data overwritten)

4.  Rebuild RAID with missing drive (parity and data overwritten)

5.  Rebuild RAID with different stripe size (parity and data overwritten)

6.  Rebuild RAID with different configuration (parity and data overwritten)

As an example, one of the most common RAID data loss cases we see is when parity is updated with a zeroed disk in the RAID configuration (RAID rebuild instead of HDD rebuild). This type of rebuild effectively destroys the original parity and prevents a drive rebuild.  Once the parity is overwritten, the missing user data from the damaged or missing HDD cannot be recreated.

Another scenario where data could be lost is a disordered RAID array, especially during a RAID rebuild. Parity rebuilds on drives that are out of order can end up overwriting good user data.

RAID 5 Disordered Array

In the example above, the data that was originally on HDD 3 on stripe 1 is now overwritten with new parity. The parity that is on HDD 4 in stripe 1 is now treated like user data instead of parity causing logical corruption. Furthermore, the data that is on HDD2 in stripe 1 is skewed, also contributing to the logical volume corruption. All of the areas marked in red would be damaged.

Even if a parity rebuild is not done, there would still be logical volume corruption. This logical corruption often triggers volume repair tools to run (CHKDSK, FSCK, etc.). These repair utilities will try to “fix” the logical corruption when the damage is really at the RAID level, causing even more damage such as deleting metadata and making the system unrecoverable.

Another scenario is where a RAID is rebuilt after a two-drive failure using a degraded drive that has been forced online and a new drive. This rebuild with this combination will overwrite the “good” parity with new “bad” parity, often making the system unrecoverable or the data unusable.

The final example to illustrate is where the RAID configuration changes and parity and data areas are overwritten with the new configuration.

Let’s assume for this example that we have a RAID 5 array with a stripe size of 64K. The OS will read the data from the stripes starting with HDD1 and the data represented by M1. Then, it will proceed to M2 and then to D1 and so on.

RAID 5 NTFS Volume

If the array controller loses the configuration and the user forces the wrong configuration, data damage will occur. In our example, the user has forced a new configuration with a 32K stripe size, effectively splitting the data in half.

RAID 5 New Configuration

The OS will read the first half of the first section of metadata represented as M1.1. Then, the OS will jump to the next disk in the stripe and read the first half of the next section of metadata represented as M2.1. This will cause logical corruption, making the data unusable. Often this will trigger volume repair tools to run and “repair” the logical damage, which in turn can cause additional damage and even make the volume unrecoverable.

How to safely recover from this type of data loss

So how do you protect yourself in the event you run into a situation like this?  Here some tips on preventing this type of drive or RAID data loss :

  1. Image the drives before attempting a rebuild. That way if the rebuild is unsuccessful, your data is protected. Make sure the imaging program you choose allows for a forensic or sector/block- level image of the disk.
  2. Restore backups to a different volume. This ensures that all important files on the backup are good before possibly overwriting data on the active volume.
  3. If there is a RAID problem, test the backup by restoring it to a different location or image each drive from the RAID before attempting a rebuild. Sometimes a RAID rebuild does not work correctly and can make the problem worse.
  4. Do not create any new files on the disk requiring recovery or continue to run applications until the important data is recovered. New files can overwrite the files that need recovery.
  5. Do not run FSCK or CHKDSK file system repair tools on a virtual disk unless a good backup has been validated by restoring it to a different volume. These repair tools assume that there is a good backup of the data and can overwrite file pointers to make a file system consistent. If desired, these tools can be run in read-only mode to find any major corruption before repairs are made.
  6. Do not delete any additional files prior to a data recovery of deleted data. Deleting files includes moving files from the source to another volume.  A move is simply a copy then delete. If you need a copy of the data from the source, make sure to copy it and not move it. Additional deleted files can complicate the data recovery.
  7. Do not try data recovery software unless you are sure it will not write anything to the disk that needs recovery. Some recovery software will attempt to write to the source disk and could damage later recovery attempts.
  8. Contact a data recovery professional before attempting the recovery on your own. A professional can outline the possible impacts your plan will have on the recoverability of the data and offer suggestions for self-recovery.