Monday, April 20, 2009

WheelFS

1.       What is the problem? Is the problem real?

Applications have varying requirements of consistency, replication, availability etc. and often would like to have control over these settings. WheelFS is a solution to that end. This is a very practical problem with different systems wanting to turn these knobs differently.

2.       What is the solution's main idea (nugget)?

The key contribution of this paper is introducing a set of cues using which applications can better specify the properties they would prefer, in the functioning of the distributed file system. These cues can better control the trade-offs between consistency, availability and placement.

3.       Why is solution different from previous work?

The solution is different from previous work in that WheelFS provides explicit high-level knobs to control the functioning of the system. But I did find it hard to clearly point out the contrast w.r.t. prior work.

4.       Do you think the work will be influential in 10 years? Why or why not?

Yes, I definitely think these principles will be influential. But to be fair, many of these “cues” have already been deployed in specialized settings and it would be interesting to see how people react to this “generic” architecture. In my opinion, preference would still be for a specialized solution that is tailored for the scenario in hand, and works.

 

Scaling Out – Facebook

1.       What is the problem? Is the problem real?

As web operations increase in demand and scale, a single datacenter is no longer sufficient to handle the load. Also, being in the same physical location makes it a single point of failure – transient (power failures, network failures) or permanent (earthquake). The problem is how to maintain consistency among the datacenters without compromising on responsiveness or correctness. Very real problem!

 

2.       What is the solution's main idea (nugget)?

In a standard (web server, memcache, database) architecture, the issue with multiple datacenters is the replication lag when data is updated. The modified value has to be passed on to all the databases, and stale data has to be removed from cache. The solution addresses exactly that by adding extra information in the replication stream, thereby removing data from the cache when the corresponding database is updated. Also, since only the master database can accept writes, the layer 7 load balancers make a call on where to send a user (master or slave databases) to depending on the URI (which indicates the operation).

 

3.       Why is solution different from previous work?

It is a practical implementation of some of the distributed systems concepts in maintaining consistency, and has been working well for nine months.

 

4.       Does the paper (or do you) identify any fundamental/hard trade-offs?

(i) Simplicity vs. performance: By having just one master database that can accept writes, it simplifies write operations and update propagations, but this might result in extra latency for users when they want to write, and probably scaling issues as well.

(ii) Performance vs. Correctness: During the replication lag, users are still directed to the old data. Only the master needs to update the data for write to return, making it more responsive. This is in contrast to a scheme that finishes writes only after all databases have committed the values. This scheme is suited for Facebook as the workload is probably not write-intensive.

 

5.       Do you think the work will be influential in 10 years? Why or why not?

Even if not exactly this, principles of CAP will be influential in the vastly growing world of web operations.


Wednesday, April 15, 2009

Portable Cloud Computing, Google AppEngine

I will club the other articles for this class together as they touch upon the same theme. 

Two commercial options for using the cloud are available now – Amazon’s S3/EC2 and Google’s AppEngine. The former sort of provides just the machines and resources, and lets the user do whatever he wants. The latter is a more structured approach and allows users a set of APIs to use the cloud facilities (like Google Query Language (GQL) to access the datastore etc. and host applications). While the AppEngine is particularly attractive because it automatically gives applications access to all the nice scalability features, there is a warning that this has the potential to tie applications to the Google API for using clouds. For example, you cannot take an EC2 service and run it on AppEngine, while the reverse is true. While the AppDrop does help in porting AppEngine applications to run flawlessly on EC2, it comes at the cost of scalability. True, someone can still hack in and provide all the database and scalability support, but this is an ugly and potentially dangerous way to move forward. 

This calls for the community to take stock of the situation and push towards a standard and open cloud API, with open source implementations. If you are looking for an inspirational model, there is always LAMP! :-)

The Open Cloud Manifesto

1..       What is the problem? Is the problem real?

Cloud computing is in its infancy now, and users of the cloud range from big corporations to small users relying on the cloud for “hosting” abilities. This paper aims to start a discussion to understand the benefits and risks of cloud computing – very real problem!

 2.       What is the solution's main idea (nugget)?

It is important for the community to come up with a set of open standards that enable innovation below the API with different organizations deploying different techniques, but not tying down the application to any particular interface. Applications should be able to seamlessly “shift” across clouds. Also, if clouds were to become a “service”, it is imperative that there are tight security guarantees as well proper metering and monitoring systems.

 3.       Does the paper (or do you) identify any fundamental/hard trade-offs?

While third-party cloud providers (even if proprietary) greatly reduce the overhead of startups, they have the long-term effect of possibly tying down the application to the specific set of interfaces needed to use the cloud. Likewise, the security guarantees of data leakage etc. provided by third-party clouds are not strong. This makes the prospect of being tied to a particular cloud provider even more shaky.

 4.       Do you think the work will be influential in 10 years? Why or why not?

I think this will be influential. The emergence of an open standard for cloud providers seems imperative, more so because the deployment seems to be progressing hand-in-hand with a reasonable revenue model. Also, the fact that this paper pushes towards good monitoring and metering means that it is serious about this being commercially viable.

 5.       Others:

a. Third-party clouds being shared by different corporations/users presents a great opportunity to reduce power wastage.

Monday, March 30, 2009

DTrace

1.       What is the problem? Is the problem real?

Dynamic instrumentation of production systems to understand performance bugs. The dynamic part is key as it implies near-zero overhead and no application restart. The problem is real and DTrace will be a highly useful tool. 

2.       What is the solution's main idea (nugget)?

DTrace provides a framework to instrument code using their API. Instrumentation is done via probes. Probes have a condition (predicate) and its satisfaction results in an action being performed. The tracing code is kept “safe” using appropriate mechanisms in the DIF virtual machine. 

3.       Why is solution different from previous work?

It is unclear to me what the key differences w.r.t. prior work are. The section on related work makes me believe that DTrace’s primary difference is the framework as a whole – a neat mix of multiple ideas from similar projects.

Does the paper (or do you) identify any fundamental/hard trade-offs?

This may not per se be a trade-off but it is more a generality vs. simplicity point. Predicates is a mechanism for filtering out “uninteresting” events, but the problem for production engineers might be to identify such events. Complaints to the effect of “get me something that just works” might lead to follow-on work on top of DTrace. Nonetheless, DTrace’s generality will turn out to be an enabler for such work.

Do you think the work will be influential in 10 years? Why or why not?

Yes, I think so. I suspect corporations might already be having their own tracing frameworks that contain key ideas from DTrace.

Wednesday, March 4, 2009

Dryad

Dryad seems to be Microsoft's answer to MapReduce, and logically is an extension. MapReduce identified a key paradigm that was super-simple. But as mentioned in the Pig Latin paper, and in Chris's talk on Monday, this paradigm is not quite sufficient. Dryad allows for much more general computational DAGs with vertices implementing arbitrary computations and using any number of inputs and outputs.

Dryad trades off simplicity for generality. This is a much more complex programming model, requiring the developers to understand the structure of the computation and properties of the system resources. This system will definitely be influential but not in its current form. This somehow doesn't quite have the same elegance of MapReduce. I think to be adopted on a wide-scale, the system will have to be refined to produce simpler higher-level "macros" capturing typical queries. But this work will lead towards identification and building of such programming models.

MapReduce



1.       What is the problem? Is the problem real?

Processing large amounts of large data is crucial for enabling innovation in Internet companies like Google, Microsoft etc. MapReduce is a system for enabling this processing on large distributed systems. This paradigm captures a large number of applications including count of url access frequency, reverse web-link graph and inverted index calculation. With more of users’ activities migrating to the web, logging and the consequent sizes of the logs will obviously increase necessitating the need for such systems. 

2.       What is the solution's main idea (nugget)?

The user-defined map function converts the input records to intermediate key/value pairs which in turn in is converted to the output pairs using the user-defined reduce function. The system in this paper automatically executes the MapReduce operation on large clusters of commodity PCs and is resilient towards machine failures and stragglers. The input data is split into M parts and starts copies of the program on a cluster with one of the machines appointed as the master, the rest being workers. The master picks idle workers to assign map tasks. The intermediate key/value pairs are buffered in R-partitioned local memory of the workers. Reduce workers pick up the intermediate data using RPC calls, sorts it, applies the Reduce function and write their output to a global file. The output is a set of R files. The master keeps track of the state of all the workers and pings them periodically to check if they are alive. The map tasks that are allotted to a failed worker is reset and allotted to another worker. Network bandwidth is conserved by attempting to schedule a map task on a machine that has the data locally or in the same switch.  The system produces very good performance under varying failure conditions and is heavily used in the Google’s daily operations. 

3.       Why is solution different from previous work?

Programming models that enable automatic parallelization is not novel in itself, but this paper’s core contribution is a simplified but powerful abstraction derived from prior work, and automatically parallelizing and executing it over large distributed systems transparently providing fault tolerance. 

4.       Does the paper (or do you) identify any fundamental/hard trade-offs?

One, it is not a general purpose programming model. They trade off generality and restrict the programming model to enable automatic parallelization over a large distributed system and transparently provide fault-tolerance. Second, they stick to a simple design and trade off some efficiency – (i) the master is not replicated (if it fails, bad luck!), and (ii) intermediate state from workers are not collected leading to back-up tasks starting afresh and doing redundant work. 

5.       Do you think the work will be influential in 10 years? Why or why not?

This has already become widely used within Google (I hear the first thing an intern learns in Google is to write a MapReduce task!), and is definitely influential for companies that provide internet services. Of course, innovation will continue to provide more efficient implementations of this paradigm. 

6.       Others:

A couple of suggestions/doubts:

a.       Worker Machines:

                                                               i.      Why isn’t the intermediate values from the map workers stored in a temporary global file, with possibly a status log too (on how much it has read from the input etc.)? It is a complete waste of resources and time when a worker fails, especially when it has almost or fully completed its work.

                                                             ii.      Stragglers: Before allotting the work to backup machines, it can get whatever intermediate output the straggler has produced and allot just the remaining to the backup worker. This seems all the more do-able because the straggler is just slow and not failed.

·         The picking of backup machines could be more sophisticated by keeping track of the workers’ performance history (if available).

b.      Locality for network bandwidth conservation:

                                                               i.      How does the master know if two machines are in the same network switch? Keeping track of it seems a daunting task…is there a smart naming/addressing strategy?

                                                             ii.      In addition to the network proximity, the load on the current worker should also be taken into account.

                                                            iii.      Finally, the proximity of the client and the output files should also be taken into account.