Wednesday, March 4, 2009

MapReduce



1.       What is the problem? Is the problem real?

Processing large amounts of large data is crucial for enabling innovation in Internet companies like Google, Microsoft etc. MapReduce is a system for enabling this processing on large distributed systems. This paradigm captures a large number of applications including count of url access frequency, reverse web-link graph and inverted index calculation. With more of users’ activities migrating to the web, logging and the consequent sizes of the logs will obviously increase necessitating the need for such systems. 

2.       What is the solution's main idea (nugget)?

The user-defined map function converts the input records to intermediate key/value pairs which in turn in is converted to the output pairs using the user-defined reduce function. The system in this paper automatically executes the MapReduce operation on large clusters of commodity PCs and is resilient towards machine failures and stragglers. The input data is split into M parts and starts copies of the program on a cluster with one of the machines appointed as the master, the rest being workers. The master picks idle workers to assign map tasks. The intermediate key/value pairs are buffered in R-partitioned local memory of the workers. Reduce workers pick up the intermediate data using RPC calls, sorts it, applies the Reduce function and write their output to a global file. The output is a set of R files. The master keeps track of the state of all the workers and pings them periodically to check if they are alive. The map tasks that are allotted to a failed worker is reset and allotted to another worker. Network bandwidth is conserved by attempting to schedule a map task on a machine that has the data locally or in the same switch.  The system produces very good performance under varying failure conditions and is heavily used in the Google’s daily operations. 

3.       Why is solution different from previous work?

Programming models that enable automatic parallelization is not novel in itself, but this paper’s core contribution is a simplified but powerful abstraction derived from prior work, and automatically parallelizing and executing it over large distributed systems transparently providing fault tolerance. 

4.       Does the paper (or do you) identify any fundamental/hard trade-offs?

One, it is not a general purpose programming model. They trade off generality and restrict the programming model to enable automatic parallelization over a large distributed system and transparently provide fault-tolerance. Second, they stick to a simple design and trade off some efficiency – (i) the master is not replicated (if it fails, bad luck!), and (ii) intermediate state from workers are not collected leading to back-up tasks starting afresh and doing redundant work. 

5.       Do you think the work will be influential in 10 years? Why or why not?

This has already become widely used within Google (I hear the first thing an intern learns in Google is to write a MapReduce task!), and is definitely influential for companies that provide internet services. Of course, innovation will continue to provide more efficient implementations of this paradigm. 

6.       Others:

A couple of suggestions/doubts:

a.       Worker Machines:

                                                               i.      Why isn’t the intermediate values from the map workers stored in a temporary global file, with possibly a status log too (on how much it has read from the input etc.)? It is a complete waste of resources and time when a worker fails, especially when it has almost or fully completed its work.

                                                             ii.      Stragglers: Before allotting the work to backup machines, it can get whatever intermediate output the straggler has produced and allot just the remaining to the backup worker. This seems all the more do-able because the straggler is just slow and not failed.

·         The picking of backup machines could be more sophisticated by keeping track of the workers’ performance history (if available).

b.      Locality for network bandwidth conservation:

                                                               i.      How does the master know if two machines are in the same network switch? Keeping track of it seems a daunting task…is there a smart naming/addressing strategy?

                                                             ii.      In addition to the network proximity, the load on the current worker should also be taken into account.

                                                            iii.      Finally, the proximity of the client and the output files should also be taken into account.

 

No comments:

Post a Comment