Monday, February 2, 2009

Failure Trends in a Large Disk Drive Population

1. What is the problem? Is the problem real?
Given the widespread usage of magnetic media, mostly hard drives, for storage, getting an understanding of the robustness of these components helps in the design, deployment and maintenance of storage systems. To that end, this paper talks about failure statistics of disk drives collected from Google’s disk drives in service. The problem is very real, and the solution takes steps in the right direction – large deployment and statistical correlation.

2. What is the solution's main idea (nugget)?
Information about temperature, activity levels and many SMART parameters were collected from the drives every few minutes. This was mined to find correlations with failures. The key findings are:
Activity Levels: This is weekly averages of read/write bandwidth per drive. Utilization is confusingly related to failure rates – very young (under an year) and very old (five years) roughly indicate higher failure rates on high utilization while the rest don’t (sometimes the relation is, surprisingly, inverse proportionality!).
Temperature: The study debunks widely and intuitively held beliefs that temperature is an important cause of failure. The data shows an inverse relation at low and average temperatures between failure rates and temperature. Only at very high temperatures (> 45 C) for older drives (> 3 years) is the relation directly proportional.
SMART Parameters:
(a) Scan Errors: This is defined as error in reading sectors in the drive. Drives with scan errors are ten times more likely to fail.
(b) Reallocation Counts: This happens with a sector has read/write error, and the faulty sector number is remapped to a new physical sector. This is an indication of drive surface wear. Drives are 14 times more likely to fail within 60 days than drives without reallocation counts.
(c) Probational Counts: This is sort of a warning of possible problems and is a good indicator of failures.
Overall, SMART parameters aren’t a great indicator of failures. Over 56% of the failed drives have no count in any of the four strong SMART signals. Better parameters are needed.

3. Why is solution different from previous work?
a. The data is from a deployment of size that is unprecedented in literature. Add to it, these were from a live and popular service, so makes it a very good basis for building models.
b. Debunks popularly held beliefs on the correlations of failure rates with utilization and temperature. Also, questions the utility of SMART parameters as effective warning mechanisms.

4. Do you think the work will be influential in 10 years?
Yes, very much. Given the size of the deployment and its questioning of intuitive relations, this work is likely to lead to building better models for disk failures.

5. Others: Some questions I had…
a. Is there anything that happens when you put a big bunch of disks together in a deployment, that is not apparent when they are by themselves? Any magnetic influences etc.? Maybe that might be the reason why some of the correlations don’t hold in big deployments while they are fine in “testing conditions”. This might be a reason to actually move towards environment-based models, a model suited to server farms, dusty households, stand-alone deployments etc.
b. The highly influential Google systems papers makes me wonder about the larger question of whether data from the industry (that is often confidential) is the right (and possibly, the only!) way to do data-oriented research…

No comments:

Post a Comment