HDD MTBF: Hard drive failure prediction

Monday, 25 May 2015 by Jennifer Duits

At Ontrack, we’re well aware that data loss can affect anyone. For many of us, it comes in the form of hard disk drive (HDD) failure; mechanical and electronic defects that render the information stored therein unreadable. There are dozens of possible causes for this type of malfunction, ranging from logical software errors to physical damage. We can’t forget to mention that all storage devices have a finite lifespan.

Most of us could name some of the tell-tale signs that a hard drive is about to fail. For example, if your HDD shifts from a pleasant whirring noise to grinding, it’s a safe bet that it’s about to quit. In addition, if access to data seems to slow or it starts acting strange (corrupted data and missing files) are reliable indicators of hard drive failure.

Unfortunately, these aren’t what you’d call scientific metrics to detect a HDD malfunction. While it’s one thing to watch for oddities from your individual laptop or tower, it’s another to apply the same methodology to a redundant array of independent disks (RAID) environment in a remote data centre.

So how can consumer and business users predict when their hard drives are about to fail? Well, their first step might be to check the manufacturers’ estimates of their storage device lifespans. These estimates are usually listed as a mean time between failures (MTBF) rating. This is a common benchmark for hard drives, but what does it really mean and how is it calculated?

What is mean time between failures?

The MTBF rating is just as it sounds, the average period of time between one inherent failure and the next in a single component’s lifespan. In other words, if a machine or part malfunctions and is afterwards repaired, its MTBF figure is the number of hours it can be expected to run as normal before it breaks down again.

With consumer hard drives, it’s not uncommon to see MTBFs of around 300,000 hours. That’s 12,500 days, or a little over 34 years. Meanwhile, enterprise-grade HDDs advertise MTBFs of up to 1.5 million hours, which is the best part of 175 years. Could you imagine if these MTBFs were real-world expectations of hard drive longevity and reliability? It would be an IT manager’s dream come true!

Unfortunately, there’s a variance between the MTBF metric and real-world lifespans. The MTBF metric has a long and distinguished lineage in military and aerospace engineering. The figures are derived from error rates in statistically significant number of drives running for weeks or months at a time.

Corresponding, studies have demonstrated that MTBFs typically promise much lower failure rates than actually occur in real-world performance. In 2007, researchers at Carnegie Mellon University investigated a sample of 100,000 HDDs with manufacturer-provided MTBF ranges of one million to 1.5 million hours. This translates to an annual failure rate (AFR) of 0.88 percent, but their study found that AFRs in the field “typically exceed one percent, with two to four percent common and up to 13 percent observed in some systems”.

Manufacturers aren’t turning a blind eye to this discrepancy. Recently, both Seagate and Western Digital have phased out using the metric for their HDDs.

So with MTBFs proven to be an unreliable indicator of hard drive health, how else can we predict the end of a storage device’s lifespan? In our next blog, we’ll discuss the pros and cons of using SMART tools to detect when a HDD is on the verge of quitting.