With Cloud, Virtualization and Software Defined Networks getting center stage, it’s easy to forget that hardware failure can still bring it all down. BackBlaze recently did an analysis of the failure rates of 25,000 disk drives and came up with some interesting insights. First, the findings:
- Failure rate of ~5% for the first 18-months, mostly caused due to manufacturer defects.
- For the next 1.5 years, drives fail LESS, at about 1.4% per year.
- After 3 years though, failures rates skyrocket to 11.8% per year.
Take a moment and think about your current acquisition process. If you typically follow a bulk purchase of servers and storage assets, the third year after each purchase can be quite disruptive. At nearly 12% failure rate, it makes a compelling case for proactively planning an investment in replacements by the end of year 2.
These numbers also make it clear that your monitoring strategy cannot ignore hardware. A new generation of monitoring players have recently emerged that have promoted the concept that application and user experience monitoring are all that’s needed. Unfortunately, this completely misses opportunities to get ahead of these types of failures. Rarely does a drive suddenly fail with no warning – potentially weeks earlier the drive’s SMART checks are going to start sounding alarms, and OS-level checks may also be detecting issues. But if neither of these are being monitored, the opportunity to get ahead of the issue is lost.