Monday, March 30, 2009

DTrace

1.       What is the problem? Is the problem real?

Dynamic instrumentation of production systems to understand performance bugs. The dynamic part is key as it implies near-zero overhead and no application restart. The problem is real and DTrace will be a highly useful tool. 

2.       What is the solution's main idea (nugget)?

DTrace provides a framework to instrument code using their API. Instrumentation is done via probes. Probes have a condition (predicate) and its satisfaction results in an action being performed. The tracing code is kept “safe” using appropriate mechanisms in the DIF virtual machine. 

3.       Why is solution different from previous work?

It is unclear to me what the key differences w.r.t. prior work are. The section on related work makes me believe that DTrace’s primary difference is the framework as a whole – a neat mix of multiple ideas from similar projects.

Does the paper (or do you) identify any fundamental/hard trade-offs?

This may not per se be a trade-off but it is more a generality vs. simplicity point. Predicates is a mechanism for filtering out “uninteresting” events, but the problem for production engineers might be to identify such events. Complaints to the effect of “get me something that just works” might lead to follow-on work on top of DTrace. Nonetheless, DTrace’s generality will turn out to be an enabler for such work.

Do you think the work will be influential in 10 years? Why or why not?

Yes, I think so. I suspect corporations might already be having their own tracing frameworks that contain key ideas from DTrace.

Wednesday, March 4, 2009

Dryad

Dryad seems to be Microsoft's answer to MapReduce, and logically is an extension. MapReduce identified a key paradigm that was super-simple. But as mentioned in the Pig Latin paper, and in Chris's talk on Monday, this paradigm is not quite sufficient. Dryad allows for much more general computational DAGs with vertices implementing arbitrary computations and using any number of inputs and outputs.

Dryad trades off simplicity for generality. This is a much more complex programming model, requiring the developers to understand the structure of the computation and properties of the system resources. This system will definitely be influential but not in its current form. This somehow doesn't quite have the same elegance of MapReduce. I think to be adopted on a wide-scale, the system will have to be refined to produce simpler higher-level "macros" capturing typical queries. But this work will lead towards identification and building of such programming models.

MapReduce



1.       What is the problem? Is the problem real?

Processing large amounts of large data is crucial for enabling innovation in Internet companies like Google, Microsoft etc. MapReduce is a system for enabling this processing on large distributed systems. This paradigm captures a large number of applications including count of url access frequency, reverse web-link graph and inverted index calculation. With more of users’ activities migrating to the web, logging and the consequent sizes of the logs will obviously increase necessitating the need for such systems. 

2.       What is the solution's main idea (nugget)?

The user-defined map function converts the input records to intermediate key/value pairs which in turn in is converted to the output pairs using the user-defined reduce function. The system in this paper automatically executes the MapReduce operation on large clusters of commodity PCs and is resilient towards machine failures and stragglers. The input data is split into M parts and starts copies of the program on a cluster with one of the machines appointed as the master, the rest being workers. The master picks idle workers to assign map tasks. The intermediate key/value pairs are buffered in R-partitioned local memory of the workers. Reduce workers pick up the intermediate data using RPC calls, sorts it, applies the Reduce function and write their output to a global file. The output is a set of R files. The master keeps track of the state of all the workers and pings them periodically to check if they are alive. The map tasks that are allotted to a failed worker is reset and allotted to another worker. Network bandwidth is conserved by attempting to schedule a map task on a machine that has the data locally or in the same switch.  The system produces very good performance under varying failure conditions and is heavily used in the Google’s daily operations. 

3.       Why is solution different from previous work?

Programming models that enable automatic parallelization is not novel in itself, but this paper’s core contribution is a simplified but powerful abstraction derived from prior work, and automatically parallelizing and executing it over large distributed systems transparently providing fault tolerance. 

4.       Does the paper (or do you) identify any fundamental/hard trade-offs?

One, it is not a general purpose programming model. They trade off generality and restrict the programming model to enable automatic parallelization over a large distributed system and transparently provide fault-tolerance. Second, they stick to a simple design and trade off some efficiency – (i) the master is not replicated (if it fails, bad luck!), and (ii) intermediate state from workers are not collected leading to back-up tasks starting afresh and doing redundant work. 

5.       Do you think the work will be influential in 10 years? Why or why not?

This has already become widely used within Google (I hear the first thing an intern learns in Google is to write a MapReduce task!), and is definitely influential for companies that provide internet services. Of course, innovation will continue to provide more efficient implementations of this paradigm. 

6.       Others:

A couple of suggestions/doubts:

a.       Worker Machines:

                                                               i.      Why isn’t the intermediate values from the map workers stored in a temporary global file, with possibly a status log too (on how much it has read from the input etc.)? It is a complete waste of resources and time when a worker fails, especially when it has almost or fully completed its work.

                                                             ii.      Stragglers: Before allotting the work to backup machines, it can get whatever intermediate output the straggler has produced and allot just the remaining to the backup worker. This seems all the more do-able because the straggler is just slow and not failed.

·         The picking of backup machines could be more sophisticated by keeping track of the workers’ performance history (if available).

b.      Locality for network bandwidth conservation:

                                                               i.      How does the master know if two machines are in the same network switch? Keeping track of it seems a daunting task…is there a smart naming/addressing strategy?

                                                             ii.      In addition to the network proximity, the load on the current worker should also be taken into account.

                                                            iii.      Finally, the proximity of the client and the output files should also be taken into account.