Go to Top

Preventing Data Loss During RAID and Drive Rebuilds – Part 2

RAID data loss

This is a follow up to the original article on RAID vs. Drive Rebuilds. Several questions came up about RAID and drive rebuilds and preventing data loss since the original posting, and Part 2 will attempt to address those questions.

One of our readers noted:  “In your first example, parity is missing from stripe 4. You didn’t mention how that stripe can get rebuilt if there’s no parity.”

Degraded RAID 5 Data Loss

The parity is missing from stripe 4 because it should be on the missing or damaged drive.  In other words, in a healthy array it would be at the top of stripe 4.  As to how it is rebuilt, in this example all of the data is intact and the parity sector on HDD 1 in Stripe 4 would be rebuilt by XOR’ing the data from drives 2-4 (P4 = XOR (D9, D8, D7).  See below for a picture of the rebuilt drive.

RAID 5 Rebuilt Drive

Another question that came up multiple times since the original post relates to other ways drive or RAID data loss or data damage can occur.

One of our readers asked:  “Your second example shows how the data can be lost if the wrong type of rebuild is done, “such as” rebuild parity. Is that the only case? “Such as” kind of implies you could do other rebuilds that would get you in trouble.”

Rebuilds that can cause data loss

There are several types of rebuilds that can happen where data can be lost. Below is a list of some of the types of rebuilds that can cause data loss.

1.  Rebuild parity with zeroed drive (parity overwritten)

2.  Rebuild parity with degraded drive (forced online and parity overwritten)

3.  Rebuild parity with drives out of order (parity and data overwritten)

4.  Rebuild RAID with missing drive (parity and data overwritten)

5.  Rebuild RAID with different stripe size (parity and data overwritten)

6.  Rebuild RAID with different configuration (parity and data overwritten)

As an example, one of the most common RAID data loss cases we see is when parity is updated with a zeroed disk in the RAID configuration (RAID rebuild instead of HDD rebuild). This type of rebuild effectively destroys the original parity and prevents a drive rebuild.  Once the parity is overwritten, the missing user data from the damaged or missing HDD cannot be recreated.

Another scenario where data could be lost is a disordered RAID array, especially during a RAID rebuild. Parity rebuilds on drives that are out of order can end up overwriting good user data.

RAID 5 Disordered Array

In the example above, the data that was originally on HDD 3 on stripe 1 is now overwritten with new parity. The parity that is on HDD 4 in stripe 1 is now treated like user data instead of parity causing logical corruption. Furthermore, the data that is on HDD2 in stripe 1 is skewed, also contributing to the logical volume corruption. All of the areas marked in red would be damaged.

Even if a parity rebuild is not done, there would still be logical volume corruption. This logical corruption often triggers volume repair tools to run (CHKDSK, FSCK, etc.). These repair utilities will try to “fix” the logical corruption when the damage is really at the RAID level, causing even more damage such as deleting metadata and making the system unrecoverable.

Another scenario is where a RAID is rebuilt after a two-drive failure using a degraded drive that has been forced online and a new drive. This rebuild with this combination will overwrite the “good” parity with new “bad” parity, often making the system unrecoverable or the data unusable.

The final example to illustrate is where the RAID configuration changes and parity and data areas are overwritten with the new configuration.

Let’s assume for this example that we have a RAID 5 array with a stripe size of 64K. The OS will read the data from the stripes starting with HDD1 and the data represented by M1. Then, it will proceed to M2 and then to D1 and so on.

RAID 5 NTFS Volume

If the array controller loses the configuration and the user forces the wrong configuration, data damage will occur. In our example, the user has forced a new configuration with a 32K stripe size, effectively splitting the data in half.

RAID 5 New Configuration

The OS will read the first half of the first section of metadata represented as M1.1. Then, the OS will jump to the next disk in the stripe and read the first half of the next section of metadata represented as M2.1. This will cause logical corruption, making the data unusable. Often this will trigger volume repair tools to run and “repair” the logical damage, which in turn can cause additional damage and even make the volume unrecoverable.

How to safely recover from this type of data loss

So how do you protect yourself in the event you run into a situation like this?  Here some tips on preventing this type of drive or RAID data loss :

  1. Image the drives before attempting a rebuild. That way if the rebuild is unsuccessful, your data is protected. Make sure the imaging program you choose allows for a forensic or sector/block- level image of the disk.
  2. Restore backups to a different volume. This ensures that all important files on the backup are good before possibly overwriting data on the active volume.
  3. If there is a RAID problem, test the backup by restoring it to a different location or image each drive from the RAID before attempting a rebuild. Sometimes a RAID rebuild does not work correctly and can make the problem worse.
  4. Do not create any new files on the disk requiring recovery or continue to run applications until the important data is recovered. New files can overwrite the files that need recovery.
  5. Do not run FSCK or CHKDSK file system repair tools on a virtual disk unless a good backup has been validated by restoring it to a different volume. These repair tools assume that there is a good backup of the data and can overwrite file pointers to make a file system consistent. If desired, these tools can be run in read-only mode to find any major corruption before repairs are made.
  6. Do not delete any additional files prior to a data recovery of deleted data. Deleting files includes moving files from the source to another volume.  A move is simply a copy then delete. If you need a copy of the data from the source, make sure to copy it and not move it. Additional deleted files can complicate the data recovery.
  7. Do not try data recovery software unless you are sure it will not write anything to the disk that needs recovery. Some recovery software will attempt to write to the source disk and could damage later recovery attempts.
  8. Contact a data recovery professional before attempting the recovery on your own. A professional can outline the possible impacts your plan will have on the recoverability of the data and offer suggestions for self-recovery.

Data Recovery RFQ

5 Responses to "Preventing Data Loss During RAID and Drive Rebuilds – Part 2"

  • Geri
    12th April 2014 - 8:43 pm Reply

    very good hints for users of raid. i am using ssd (win7 64bit) and 4 wd green hdds as raid 1+0. i’m very happy and satisfied with speed and performance… i get a message from intel raid software, that 1 of my 4 hdds are wrong. i thought it’s not a big deal, let’s repair it. after succesfully reboot microsofts chkdsk deleted some files… because it started automatically, i stopped it too late 🙁 i reed after that your hint about disable this microsoft function. intel software told me, there is no problem with hdd and raid, but in ms explorer some files and folder are deleted, some are not to access. in the properties of hdd i have the same used and free space of that volume. after that, intels raid manager told me the problem of that once hdd again. i buyed a new one and replaced and repaired the raid… some folders still lost and some unaccaccable
    and there are two points i want to tell You:
    a) under ms win7, i use total commander to copy files. in total commander i can access one folder, i can see and open some files, which can’t be opened by ms explorer…
    b) i reed about linux live (which i used once before at not booting windows to rescue copy my data), now i can boot with ubuntu 13? by dvd, AND i can’t access the folder, which worked in total commander BUT i can show jpg images from another folder, where the images were in ms win7 half green half image! unbelievable
    i must accept, my files are lost, i have a very old back up because of raid 10, which i trusted so much. now, in my next config i would use raid 0 and 2 external backup hdds for security…
    if You Sir or someone having good ideas to my problem, let me know. it’s very interesting how computer works and it will be still a phenomene for me in the future.

  • Jacob Wilson
    22nd October 2014 - 2:34 am Reply

    Very nice comparison and I really like the knowledge the article gave me but external hard drive data recovery is my latest issue and I am willing to read something about that.

  • Branson
    5th March 2015 - 5:01 pm Reply

    Thanks so much for this post! I’ve been doing a lot of research on this, and this was a really great read. Keep up the great work on this blog!

Leave a Reply

Your email address will not be published. Required fields are marked *