Wednesday, February 18, 2009

Dynamo: Amazon’s Highly Available Key-value Store

1.       What is the problem? Is the problem real?

The problem of providing a storage system made of commodity (and hence failure-prone) machines for high availability and performance requirements, where the workload has plenty of write requests. The problem is real and is faced by most companies that support services similar to Amazon.

 

2.       What is the solution's main idea (nugget)?

Data is partitioned and replicated using consistent hashing (of Chord-fame). Given that writes constitute a significant portion of their workload, guaranteeing availability while maintaining consistency is a big challenge. Dynamo uses a quorum-based consistency protocol where a minimum number of nodes must be up for a successful read or write. This parameter is configurable according to application requirements of availability. Object versioning helps in guaranteeing availability, and the divergent versions are reconciled during reads.

 

3.       Why is solution different from previous work?

The contribution of this paper is not terribly new protocols, but that of building and managing gigantic-scale systems. That said, there is a key difference – this storage system is actually tuned for write-intensive workloads. That is a significantly harder problem than read-intensive workloads in terms of guaranteeing availability and maintaining consistency at the same time. Unlike the Google papers, this one seems to favor the classical decentralized design as opposed to a simpler centralized system.

 

4.       Does the paper (or do you) identify any fundamental/hard trade-offs?

They trade-off consistency for availability, while achieving eventual consistency. This is because of write-intensive workloads. The other trade-off is in terms of how transparent the system is to the applications. Dynamo expects applications to be intelligent and deal with inconsistency in the manner they deem fit. This represents a shift towards applications becoming more complex and would have a significant impact on the way they are designed.

 

5.       Do you think the work will be influential in 10 years? Why or why not?

Sure! Dynamo represents a very good implementation of a lot of principles w.r.t. distributed systems and DB systems in terms of picking the right trade-offs and working effectively. With internet services becoming a big deal, their assumptions of workloads and solutions are all very pertinent.

 

6.       Others:

a.       While the idea of a knob for availability (R, W, N) seems good at the outset, it seemed against the general principle of this paper – I will tell you what works! I think setting these knobs is not an easy task and maybe for good reason they haven’t let out the secret of how their applications set it. But this can be crucial towards the performance of the system.

 

Wednesday, February 11, 2009

The Chubby lock service for loosely-coupled distributed systems

1. What is the problem? Is the problem real?
The problem is of providing a lock service for a distributed system. Chubby provides such a centralized lock service for O (10000) servers. This is obviously a very real (and historical) problem in distributed systems with multiple solutions. And not only is the problem real, its implementation with a real large system is another beast in itself – the details in this paper go a long way in understanding a practically deployed and effective lock service for a large distributed system.

2. What is the solution's main idea (nugget)?
Chubby is a centralized lock service (as opposed to a distributed client-based one) for Google’s distributed systems. A Chubby cell consists of a periodically-elected master and a small set of replicas. Locks can be exclusive or shared, but are not mandatory (for maintenance reasons as well no real value-add in a typical production environment). Caching helps in scalability – cache entries are invalidated on a modification as opposed to an update scheme as the latter might burden the clients with an unbounded number of updates. The master has the overhead of maintaining the cache list of every client. Blocking keep-alives are used to maintain sessions between clients and masters.

3. Why is solution different from previous work?
This paper, self-admittedly, is about the engineering experience and not new protocols. The details in this paper are highly valuable, but its biggest contribution is to describe the design of a centralized lock service in a large operational distributed system taking care of client’s implementation complexities, and corner cases for scaling and failures.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?
Being an implementation experience, this paper is about multiple trade-offs that are sort of classic in systems research – ease of deployment and implementation vs. correctness. (1) Centralized lock service vs. client-only library for distributed consensus, (2) No mandatory locks for ease of administration and it anyway doesn’t fix the problem, (3) Blocking modifications to ensure cache invalidation and consistency: caching is important for scalability, and this ensures simplicity at the expense of slow modifications.

5. Do you think the work will be influential in 10 years? Why or why not?
Without a doubt! Being used in a production environment as large as Google’s is influential enough. Handling of the practical issues to achieve scalability and clarity means that its adoption is imminent. And its simplicity provides a clean basis to add modifications on top as the requirements of services change.

6. Others:
Some unclear things:
a. For a large number of clients, is this invalidation strategy actually scalable? They provide no numbers to back it, other than mentioning that a Chubby service once had 90,000 clients. More quantitative details would have helped.
b. Why is the keep-alive scheme blocking? What happens if the master returns the call immediately, and the put the onus on the client to send its keep-alive messages before timing out?
c. Multiple Chubby cells: In this case, is there an overlap in the set of servers being served by different Chubby cells? If yes, how do they maintain consistency among the cells themselves?
Thoughts:
a. Are updates a better idea than invalidation sometimes? We could have a time period (and possibly a frequency) for updates limiting clients to not be indefinitely and unboundedly bombarded. This will make the modifications return quicker, as opposed to the current blocking model.
b. How about multiple masters for a set of servers, with a multicast protocol like SRM for maintaining consistency among the masters? This would help in scalability.

Sunday, February 8, 2009

eBay: Scale!

1. What is the problem? Is the problem real?
As a site that caters 2 billion page views a day resulting in $60 billion transactions every year, eBay definitely places a problem of scaling up to the demands. Maintaining such large-scale systems is a very real problem and is faced by most companies in the business of web-based services.

2. What is the solution's main idea (nugget)?
The solution to this is based on five basic principles: partition everything, introduce asynchrony, automatically adjust configurations and learn various characteristics including user behavior, be prepared for failure and handle them gracefully, and have a spectrum of consistency and apply appropriately. While these are fairly well-known principles now in the distributed systems world, their relevance cannot be tested in more demanding circumstances than modern datacenters. Their deployment includes third-party clouds too, making the problem harder and necessitating some checks to avoid some common erroneous assumptions about modeling, portability and utilization of resources.

3. Why is solution different from previous work?
In addition to these guarantees, the slides also talk about addressing emerging challenges about energy efficiency as well as workload characterization useful for prediction of “floods”. Modeling workloads is an especially useful research problem that has implications in resource provisioning and quality of service, energy-efficient designs and statistical multiplexing of resources.

4. Do you think the work will be influential in 10 years? Why or why not?
These are all very definite and real problems to solve. Principles as well as solutions that arise out of this work will definitely be influential, both immediately as well as in future.

5. Others:
Third-party clouds is pretty interesting – but as a service dealing with large number of customers, I wonder what are the privacy implications.

Wednesday, February 4, 2009

DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

1. What is the problem? Is the problem real?
Low bisection bandwidth, a real problem but with the same questions as for the UCSD fat-tree paper.

2. What is the solution's main idea (nugget)?
Using this recursive structure called D-cell, they can set up many paths between nodes. These multiple paths can be leveraged to get scalability, load-balancing and other nice properties.

3. Why is solution different from previous work?
Can scale to a million servers easily with high fault-tolerance, using commodity switches.

4. Do you think the work will be influential in 10 years? Why or why not?
Seems unlikely, though I can’t point my finger at exactly why not! Somehow the deployability story in this paper doesn’t seem strong and has a “if we were to do it from scratch” kind of a feel. Such designs have traditionally proved hard to deploy and hence haven’t been very influential, at least directly. Maybe ideas from this paper would get into other designs…

5. Others:
Of the three, this one seemed very confusing to read! There is too much of pseudo-code that is not immediately clear. Sure, a picture is worth a thousand words!

A Policy-aware Switching Layer for Data Centers

1. What is the problem? Is the problem real?
Datacenter management is hard and the scale makes errors in configurations inevitable. In addition, there are no explicit protocols or mechanisms to deploy middle-boxes in them resulting in operators overloading existing mechanisms and coming up with ad hoc and error-prone techniques. Unlike in the Internet where people have proposed similar techniques for doing things in a “clean” way when the “clumsy” techniques had sort of become familiar and operational, this paper hits the problem at the right time. Datacenters management is still evolving and admins do make errors. The problem is very real.

2. What is the solution's main idea (nugget)?
The physical network path not the way through which the actual traversal of middle-boxes happen. They introduce a new kind of programmable switch, pswitch, that can take in policies and ensure correct middle-box traversals.

3. Why is solution different from previous work?
Previous papers have not looked at the mess in configuring datacenters, and this paper proposes a neat solution based on the principles of policy and indirection from some well known previous papers (i3, middle-boxes are not harmful) in the context of datacenters. I haven’t read the Ethane paper, so I am not sure what is the exact difference between PLayer and Ethane.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?
There is an increased latency in the network as the pswitch has to look-up the policy for every frame. This is probably fine in largely over-provisioned datacenters that anyway have very small latencies and a small increase is insignificant. With the trend moving towards extracting every cent of investment in the datacenter, dealing with this increased latency might be important.

5. Do you think the work will be influential in 10 years? Why or why not?
As I said earlier, this paper addresses the architectural clumsiness problem of datacenters at the right time, not after people have mastered a clumsy but effective way. Principles in this paper are sure to influence datacenter management.

6. Others:
I have discussed this with Dilip earlier – a user-study with network admins will help in evolving the right language for specifying policies.

A Scalable, Commodity Data Center Network Architecture

1. What is the problem? Is the problem real?
The common tree-based architecture of switches/routers in datacenters is inefficient and result in oversubscription and relatively low effective bandwidths. This is definitely a real problem, but it would have been nice if the paper had shown some numbers to back it as opposed to just listing some examples.

2. What is the solution's main idea (nugget)?
The paper proposes constructing a fat-tree out of inexpensive, commodity switches that can achieve full bisection bandwidth.

3. Why is solution different from previous work?
The smart thing about this paper is to use the fat-tree topology in the datacenter. Such topologies have been used in the past in super-computers and even telephone networks. Identification of its relevance to datacenters enable them to achieve high bandwidths using cheap commodity switches. Also, unlike most other solutions that increase performance, this thing actually seems to result in reduced power consumption which is a big plus in its favour.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?
This topology makes load balancing harder than before, which have well-known consequences at the transport layer. The paper does talk about a centralized scheduler, but I think in practice, it might be more complicated than what the paper paints. The actual effects with out-of-order delivery etc. need to be checked with realistic workloads.

5. Do you think the work will be influential in 10 years? Why or why not?
Definitely. I think the solution is definitely attractive, has no major complicated requirements and anything that constructs a high performance system out of commodity hardware is bound to be influential. At the very least, the ideas of fat-trees and using commodity hardware for datacenter architectures are here to stay.

6. Others:
As mentioned earlier, I really want to know the bandwidth requirements of the current datacenter applications and whether they are hitting the limit w.r.t. what can be provided. While it does seem like we can always do with more bandwidth, we should ensure we are not solving the wrong problem in datacenters.

Monday, February 2, 2009

Crash: Data center Horror Stories

This article talks about some of the crashes that have occurred at datacenters. It aims to, albeit a little dramatically, bring out the important point about the complexity and enormity of the task of designing datacenters. Some of the examples include cabling errors, insulation faults, and a lack of clear understanding among people who design failure-management strategies about the possible types and scale of collapses. Lots of the faults are due to errors in the construction process (physical) that make it important for experts in different fields to put their knowledge together during the design, inspection, operation and maintenance of the datacenter. There are some general guidelines in good datacenter design and operations that include adhering to standards and careful inspection of every step but the key is to remember that every datacenter is unique with its own problems/complexity. Custom solutions that are carefully thoughtout is the best approach.

Questions:
1. Is the datanceter-building industry a little young, with most data about building and operational experiences confidential? In that sense, with time, will the shared collective knowledge help in dramatically increasing our capability go guard against such mishaps? Are there parallels in other fields?
2. What is the trade-off between failure resilience and cost of achieving it? Is it preposterous to suggest a strategy, “I will have some standard failure-proof strategies…if once a while (with a very low probability), my datacenter goes down, tough luck until I restore it!”?
3. How much can software techniques help in working around some of these problems?

Failure Trends in a Large Disk Drive Population

1. What is the problem? Is the problem real?
Given the widespread usage of magnetic media, mostly hard drives, for storage, getting an understanding of the robustness of these components helps in the design, deployment and maintenance of storage systems. To that end, this paper talks about failure statistics of disk drives collected from Google’s disk drives in service. The problem is very real, and the solution takes steps in the right direction – large deployment and statistical correlation.

2. What is the solution's main idea (nugget)?
Information about temperature, activity levels and many SMART parameters were collected from the drives every few minutes. This was mined to find correlations with failures. The key findings are:
Activity Levels: This is weekly averages of read/write bandwidth per drive. Utilization is confusingly related to failure rates – very young (under an year) and very old (five years) roughly indicate higher failure rates on high utilization while the rest don’t (sometimes the relation is, surprisingly, inverse proportionality!).
Temperature: The study debunks widely and intuitively held beliefs that temperature is an important cause of failure. The data shows an inverse relation at low and average temperatures between failure rates and temperature. Only at very high temperatures (> 45 C) for older drives (> 3 years) is the relation directly proportional.
SMART Parameters:
(a) Scan Errors: This is defined as error in reading sectors in the drive. Drives with scan errors are ten times more likely to fail.
(b) Reallocation Counts: This happens with a sector has read/write error, and the faulty sector number is remapped to a new physical sector. This is an indication of drive surface wear. Drives are 14 times more likely to fail within 60 days than drives without reallocation counts.
(c) Probational Counts: This is sort of a warning of possible problems and is a good indicator of failures.
Overall, SMART parameters aren’t a great indicator of failures. Over 56% of the failed drives have no count in any of the four strong SMART signals. Better parameters are needed.

3. Why is solution different from previous work?
a. The data is from a deployment of size that is unprecedented in literature. Add to it, these were from a live and popular service, so makes it a very good basis for building models.
b. Debunks popularly held beliefs on the correlations of failure rates with utilization and temperature. Also, questions the utility of SMART parameters as effective warning mechanisms.

4. Do you think the work will be influential in 10 years?
Yes, very much. Given the size of the deployment and its questioning of intuitive relations, this work is likely to lead to building better models for disk failures.

5. Others: Some questions I had…
a. Is there anything that happens when you put a big bunch of disks together in a deployment, that is not apparent when they are by themselves? Any magnetic influences etc.? Maybe that might be the reason why some of the correlations don’t hold in big deployments while they are fine in “testing conditions”. This might be a reason to actually move towards environment-based models, a model suited to server farms, dusty households, stand-alone deployments etc.
b. The highly influential Google systems papers makes me wonder about the larger question of whether data from the industry (that is often confidential) is the right (and possibly, the only!) way to do data-oriented research…