RAID systems and the reality of their reliability

29 July 2013 by Sam Wiltshire

RAID systems (Redundant Array of Independent Drives) are groups of hard drives that are connected to the network in order to ensure high reliability of data. This technology is now being used by more than 98% of all servers. However, we must be aware that these systems are not completely fail-safe and are still subject to everyday problems such as hard disk or controller failures and, of course, human error.

Given that the hard disks are in arrays, it isn’t unusual that they all come from the same production batch and so could quite easily have the same manufacturers fault and hence fail simultaneously. For this reason, when a drive in a RAID system does seem to have problems, it should be removed as quickly as possible, even if the server continues running smoothly. A problem with one drive can have a domino effect on other drives and even the whole system. Also another issue to keep in mind is voltage surges. They can easily wipe out several disks simultaneously, including the ghost copies.

The use of replacement parts can also be risky business. Parts that are said to be identical may still have slight differences. Here at Kroll Ontrack, we performed a test. We changed the plate of one faulty hard disk for another, and the disk would not boot. Then we replaced the plate for one that was ‘identical’, and still had no success. Rebooting the drive still failed to solve the problem. After changing the BIOS settings, it seemed that the controller still recognised the drive as having a faulty disk. A subsequent reboot of the server resulted in the data that was stored on the system was no longer readable. For any business this could be a major problem. We strongly advise having a reliable backup system in place, even when using supposedly fail-safe RAID servers.

Rescue attempts by those with insufficient knowledge in the field are most likely to cause more problems than they solve. If the operation is not performed correctly, it could lead to disk failure and even loss of data. The average IT department would then be well over its head, and any unsuccessful attempt by inexperienced technicians to recover the data could make much harder (or impossible!) work for the external professionals that are hence turned to. In any case, self-recovery attempts should be avoided at all costs. With this sort of problem, you should always turn to experts: professionals with invaluable experience and specialised tools are more likely to be successful in data recovery and can sometimes even recover the data over an internet connection.