Using SMART tools to predict HDD failure

by Sam Wiltshire 26 May 2015

Grinding and thrashing noises are a reliable indicator that a HDD is about to give up the ghost.

In a previous blog -  MTBF: Can it help predict hard drive failure? - we noted that the best-known methods for predicting hard disk drive (HDD) failure aren't what you'd call scientific. Grinding and thrashing noises are a reliable indicator that a HDD is about to give up the ghost, for example, but that's cold comfort when your drives are sitting in a remote data centre and therefore out of earshot.

Hard drive manufacturers, meanwhile, often seem like they're being deliberately misleading when estimating the longevity of their storage devices. They use a metric called Mean Time Between Failures (MTBF), extrapolated from running large numbers of drives for weeks and months at a time, which can give readings as high as 1.5 million hours - almost 200 years - for enterprise-grade HDDs. The methodology is sound, but the outcome has little in common with the average lifespan of a hard drive in the field.

Most of those manufacturers do, however, also offer a more sophisticated method for predicting HDD failure. Specifically, their devices come preinstalled with a set of firmware tools called Self Monitoring, Analysis and Reporting Technology (SMART), which communicates metrics on hard drive performance back to the operating system. This data can then be viewed and analysed via software, providing IT administrators with greater insights into the health of their storage equipment than would otherwise be possible.

The metrics tracked by SMART tools - called attributes - vary from manufacturer to manufacturer, but typical examples include the number of hours the drive has been switched on, the time it takes for the spindle to reach operational speed and the count of reallocated sectors.

Checking your SMART data

Checking your storage devices' SMART data is generally pretty simple. It's possible to buy software expressly designed for the purpose, which might be judicious if you're looking to gain meaningful insights from that data, but it's not a prerequisite: if you're using Windows, you can get a quick and dirty rundown of your HDD's SMART attributes and their readings via the command prompt.

Of course, if you're after a way to track and analyse SMART data more proactively, there are various tools available on multiple platforms and at different price points. One example is Ontrack EasyRecovery, and if you're serious about using SMART tools to monitor the health of your hard drives and plan replacements, this is the way to go.

The reliability of SMART tools

You may have noticed that we've yet to discuss whether or not SMART tools are, in fact, a reliable indicator of hard drive health. So, are they? The answer is yes and no. While some SMART attributes are widely felt to be useful in predicting HDD failure, it's also commonly accepted that the system is not without its limitations.

Most notably, SMART can't predict 100 per cent of hard drive crashes, because not all hard drive crashes are predictable in the first place. While errors that arise from regular mechanical wear and tear tend to show up as abnormal SMART readings, sudden electronic malfunctions and component failures do not. To put this in perspective, a 2007 Google study of 100,000 consumer-grade HDDs found that fewer than three-quarters (64 per cent) of failures over a nine-month period were not flagged up in their SMART tools beforehand.

Another factor that makes SMART attributes themselves less useful is how they vary from manufacturer to manufacturer, even in terms of the ways that common attributes are measured. So a Seagate device and a Western Digital device of equivalent health may give completely different readings for their seek error rates, for example.

Last November, cloud backup provider Backblaze published a fascinating study on the wildy varying usability of different SMART attributes. Based on readings from almost 40,000 hard drives storing 100 petabytes of customer data, it concluded that out of 70 available attributes, only five were actually reliable indicators of HDD failure. "We would love to use more - ideally the drive vendors would tell us exactly what the SMART attributes mean," wrote engineer Brian Beach.

In conclusion

In reality, then, SMART tools don't provide a 100 per cent accurate way to know how and when one of your HDDs is about to meet its maker. Sure, they can predict some types of hard drive failures, supposing you know where to look, but others can occur without a single abnormal reading.

As such, no savvy storage user would ever rely on SMART alone - or any other predictive system - to prevent the loss of their data and to plan for business continuity. The nature of electromechanical devices means it's always best to bet on a mix of defences: redundancy, back-up and data recovery, not just SMART.