Monday, February 2, 2009

Crash: Data center Horror Stories

This article talks about some of the crashes that have occurred at datacenters. It aims to, albeit a little dramatically, bring out the important point about the complexity and enormity of the task of designing datacenters. Some of the examples include cabling errors, insulation faults, and a lack of clear understanding among people who design failure-management strategies about the possible types and scale of collapses. Lots of the faults are due to errors in the construction process (physical) that make it important for experts in different fields to put their knowledge together during the design, inspection, operation and maintenance of the datacenter. There are some general guidelines in good datacenter design and operations that include adhering to standards and careful inspection of every step but the key is to remember that every datacenter is unique with its own problems/complexity. Custom solutions that are carefully thoughtout is the best approach.

Questions:
1. Is the datanceter-building industry a little young, with most data about building and operational experiences confidential? In that sense, with time, will the shared collective knowledge help in dramatically increasing our capability go guard against such mishaps? Are there parallels in other fields?
2. What is the trade-off between failure resilience and cost of achieving it? Is it preposterous to suggest a strategy, “I will have some standard failure-proof strategies…if once a while (with a very low probability), my datacenter goes down, tough luck until I restore it!”?
3. How much can software techniques help in working around some of these problems?

No comments:

Post a Comment