1. What is the problem? Is the problem real?
The problem is of providing a lock service for a distributed system. Chubby provides such a centralized lock service for O (10000) servers. This is obviously a very real (and historical) problem in distributed systems with multiple solutions. And not only is the problem real, its implementation with a  real large system is another beast in itself – the details in this paper go a long way in understanding a practically deployed and effective lock service for a large distributed system.
2. What is the solution's main idea (nugget)?
Chubby is a centralized lock service (as opposed to a distributed client-based one) for Google’s distributed systems. A Chubby cell consists of a periodically-elected master and a small set of replicas. Locks can be exclusive or shared, but are not mandatory (for maintenance reasons as well no real value-add in a typical production environment). Caching helps in scalability – cache entries are invalidated on a modification as opposed to an update scheme as the latter might burden the clients with an unbounded number of updates. The master has the overhead of maintaining the cache list of every client. Blocking keep-alives are used to maintain sessions between clients and masters.
3. Why is solution different from previous work?
This paper, self-admittedly, is about the engineering experience and not new protocols. The details in this paper are highly valuable, but its biggest contribution is to describe the design of a centralized lock service in a large operational distributed system taking care of client’s implementation complexities, and corner cases for scaling and failures.
4. Does the paper (or do you) identify any fundamental/hard trade-offs?
Being an implementation experience, this paper is about multiple trade-offs that are sort of classic in systems research – ease of deployment and implementation vs. correctness. (1) Centralized lock service vs. client-only library for distributed consensus, (2) No mandatory locks for ease of administration and it anyway doesn’t fix the problem, (3) Blocking modifications to ensure cache invalidation and consistency: caching is important for scalability, and this ensures simplicity at the expense of slow modifications.
5. Do you think the work will be influential in 10 years? Why or why not?
Without a doubt! Being used in a production environment as large as Google’s is influential enough. Handling of the practical issues to achieve scalability and clarity means that its adoption is imminent. And its simplicity provides a clean basis to add modifications on top as the requirements of services change.
6. Others:
Some unclear things:
a. For a large number of clients, is this invalidation strategy actually scalable? They provide no numbers to back it, other than mentioning that a Chubby service once had 90,000 clients. More quantitative details would have helped.
b. Why is the keep-alive scheme blocking? What happens if the master returns the call immediately, and the put the onus on the client to send its keep-alive messages before timing out?
c. Multiple Chubby cells: In this case, is there an overlap in the set of servers being served by different Chubby cells? If yes, how do they maintain consistency among the cells themselves?
Thoughts:
a. Are updates a better idea than invalidation sometimes? We could have a time period (and possibly a frequency) for updates limiting clients to not be indefinitely and unboundedly bombarded. This will make the modifications return quicker, as opposed to the current blocking model.
b. How about multiple masters for a set of servers, with a multicast protocol like SRM for maintaining consistency among the masters? This would help in scalability.
Subscribe to:
Post Comments (Atom)
 
Nice post. Have you heard about CloudSlam 09 which is 1st annual conference. I got a wonderful experience through the conference.
ReplyDelete