Cloud Computing: 2009

Monday, April 20, 2009

WheelFS

1. What is the problem? Is the problem real?

Applications have varying requirements of consistency, replication, availability etc. and often would like to have control over these settings. WheelFS is a solution to that end. This is a very practical problem with different systems wanting to turn these knobs differently.

2. What is the solution's main idea (nugget)?

The key contribution of this paper is introducing a set of cues using which applications can better specify the properties they would prefer, in the functioning of the distributed file system. These cues can better control the trade-offs between consistency, availability and placement.

3. Why is solution different from previous work?

The solution is different from previous work in that WheelFS provides explicit high-level knobs to control the functioning of the system. But I did find it hard to clearly point out the contrast w.r.t. prior work.

4. Do you think the work will be influential in 10 years? Why or why not?

Yes, I definitely think these principles will be influential. But to be fair, many of these “cues” have already been deployed in specialized settings and it would be interesting to see how people react to this “generic” architecture. In my opinion, preference would still be for a specialized solution that is tailored for the scenario in hand, and works.

Scaling Out – Facebook

1. What is the problem? Is the problem real?

As web operations increase in demand and scale, a single datacenter is no longer sufficient to handle the load. Also, being in the same physical location makes it a single point of failure – transient (power failures, network failures) or permanent (earthquake). The problem is how to maintain consistency among the datacenters without compromising on responsiveness or correctness. Very real problem!

2. What is the solution's main idea (nugget)?

In a standard (web server, memcache, database) architecture, the issue with multiple datacenters is the replication lag when data is updated. The modified value has to be passed on to all the databases, and stale data has to be removed from cache. The solution addresses exactly that by adding extra information in the replication stream, thereby removing data from the cache when the corresponding database is updated. Also, since only the master database can accept writes, the layer 7 load balancers make a call on where to send a user (master or slave databases) to depending on the URI (which indicates the operation).

3. Why is solution different from previous work?

It is a practical implementation of some of the distributed systems concepts in maintaining consistency, and has been working well for nine months.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?

(i) Simplicity vs. performance: By having just one master database that can accept writes, it simplifies write operations and update propagations, but this might result in extra latency for users when they want to write, and probably scaling issues as well.

(ii) Performance vs. Correctness: During the replication lag, users are still directed to the old data. Only the master needs to update the data for write to return, making it more responsive. This is in contrast to a scheme that finishes writes only after all databases have committed the values. This scheme is suited for Facebook as the workload is probably not write-intensive.

5. Do you think the work will be influential in 10 years? Why or why not?

Even if not exactly this, principles of CAP will be influential in the vastly growing world of web operations.

Wednesday, April 15, 2009

Portable Cloud Computing, Google AppEngine

I will club the other articles for this class together as they touch upon the same theme.

Two commercial options for using the cloud are available now – Amazon’s S3/EC2 and Google’s AppEngine. The former sort of provides just the machines and resources, and lets the user do whatever he wants. The latter is a more structured approach and allows users a set of APIs to use the cloud facilities (like Google Query Language (GQL) to access the datastore etc. and host applications). While the AppEngine is particularly attractive because it automatically gives applications access to all the nice scalability features, there is a warning that this has the potential to tie applications to the Google API for using clouds. For example, you cannot take an EC2 service and run it on AppEngine, while the reverse is true. While the AppDrop does help in porting AppEngine applications to run flawlessly on EC2, it comes at the cost of scalability. True, someone can still hack in and provide all the database and scalability support, but this is an ugly and potentially dangerous way to move forward.

This calls for the community to take stock of the situation and push towards a standard and open cloud API, with open source implementations. If you are looking for an inspirational model, there is always LAMP! :-)

The Open Cloud Manifesto

1.. What is the problem? Is the problem real?

Cloud computing is in its infancy now, and users of the cloud range from big corporations to small users relying on the cloud for “hosting” abilities. This paper aims to start a discussion to understand the benefits and risks of cloud computing – very real problem!

2. What is the solution's main idea (nugget)?

It is important for the community to come up with a set of open standards that enable innovation below the API with different organizations deploying different techniques, but not tying down the application to any particular interface. Applications should be able to seamlessly “shift” across clouds. Also, if clouds were to become a “service”, it is imperative that there are tight security guarantees as well proper metering and monitoring systems.

3. Does the paper (or do you) identify any fundamental/hard trade-offs?

While third-party cloud providers (even if proprietary) greatly reduce the overhead of startups, they have the long-term effect of possibly tying down the application to the specific set of interfaces needed to use the cloud. Likewise, the security guarantees of data leakage etc. provided by third-party clouds are not strong. This makes the prospect of being tied to a particular cloud provider even more shaky.

4. Do you think the work will be influential in 10 years? Why or why not?

I think this will be influential. The emergence of an open standard for cloud providers seems imperative, more so because the deployment seems to be progressing hand-in-hand with a reasonable revenue model. Also, the fact that this paper pushes towards good monitoring and metering means that it is serious about this being commercially viable.

5. Others:

a. Third-party clouds being shared by different corporations/users presents a great opportunity to reduce power wastage.

Monday, March 30, 2009

DTrace

1. What is the problem? Is the problem real?

Dynamic instrumentation of production systems to understand performance bugs. The dynamic part is key as it implies near-zero overhead and no application restart. The problem is real and DTrace will be a highly useful tool.

2. What is the solution's main idea (nugget)?

DTrace provides a framework to instrument code using their API. Instrumentation is done via probes. Probes have a condition (predicate) and its satisfaction results in an action being performed. The tracing code is kept “safe” using appropriate mechanisms in the DIF virtual machine.

3. Why is solution different from previous work?

It is unclear to me what the key differences w.r.t. prior work are. The section on related work makes me believe that DTrace’s primary difference is the framework as a whole – a neat mix of multiple ideas from similar projects.

Does the paper (or do you) identify any fundamental/hard trade-offs?

This may not per se be a trade-off but it is more a generality vs. simplicity point. Predicates is a mechanism for filtering out “uninteresting” events, but the problem for production engineers might be to identify such events. Complaints to the effect of “get me something that just works” might lead to follow-on work on top of DTrace. Nonetheless, DTrace’s generality will turn out to be an enabler for such work.

Do you think the work will be influential in 10 years? Why or why not?

Yes, I think so. I suspect corporations might already be having their own tracing frameworks that contain key ideas from DTrace.

Wednesday, March 4, 2009

Dryad

Dryad seems to be Microsoft's answer to MapReduce, and logically is an extension. MapReduce identified a key paradigm that was super-simple. But as mentioned in the Pig Latin paper, and in Chris's talk on Monday, this paradigm is not quite sufficient. Dryad allows for much more general computational DAGs with vertices implementing arbitrary computations and using any number of inputs and outputs.

Dryad trades off simplicity for generality. This is a much more complex programming model, requiring the developers to understand the structure of the computation and properties of the system resources. This system will definitely be influential but not in its current form. This somehow doesn't quite have the same elegance of MapReduce. I think to be adopted on a wide-scale, the system will have to be refined to produce simpler higher-level "macros" capturing typical queries. But this work will lead towards identification and building of such programming models.

MapReduce

1. What is the problem? Is the problem real?

Processing large amounts of large data is crucial for enabling innovation in Internet companies like Google, Microsoft etc. MapReduce is a system for enabling this processing on large distributed systems. This paradigm captures a large number of applications including count of url access frequency, reverse web-link graph and inverted index calculation. With more of users’ activities migrating to the web, logging and the consequent sizes of the logs will obviously increase necessitating the need for such systems.

2. What is the solution's main idea (nugget)?

The user-defined map function converts the input records to intermediate key/value pairs which in turn in is converted to the output pairs using the user-defined reduce function. The system in this paper automatically executes the MapReduce operation on large clusters of commodity PCs and is resilient towards machine failures and stragglers. The input data is split into M parts and starts copies of the program on a cluster with one of the machines appointed as the master, the rest being workers. The master picks idle workers to assign map tasks. The intermediate key/value pairs are buffered in R-partitioned local memory of the workers. Reduce workers pick up the intermediate data using RPC calls, sorts it, applies the Reduce function and write their output to a global file. The output is a set of R files. The master keeps track of the state of all the workers and pings them periodically to check if they are alive. The map tasks that are allotted to a failed worker is reset and allotted to another worker. Network bandwidth is conserved by attempting to schedule a map task on a machine that has the data locally or in the same switch. The system produces very good performance under varying failure conditions and is heavily used in the Google’s daily operations.

3. Why is solution different from previous work?

Programming models that enable automatic parallelization is not novel in itself, but this paper’s core contribution is a simplified but powerful abstraction derived from prior work, and automatically parallelizing and executing it over large distributed systems transparently providing fault tolerance.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?

One, it is not a general purpose programming model. They trade off generality and restrict the programming model to enable automatic parallelization over a large distributed system and transparently provide fault-tolerance. Second, they stick to a simple design and trade off some efficiency – (i) the master is not replicated (if it fails, bad luck!), and (ii) intermediate state from workers are not collected leading to back-up tasks starting afresh and doing redundant work.

5. Do you think the work will be influential in 10 years? Why or why not?

This has already become widely used within Google (I hear the first thing an intern learns in Google is to write a MapReduce task!), and is definitely influential for companies that provide internet services. Of course, innovation will continue to provide more efficient implementations of this paradigm.

6. Others:

A couple of suggestions/doubts:

a. Worker Machines:

i. Why isn’t the intermediate values from the map workers stored in a temporary global file, with possibly a status log too (on how much it has read from the input etc.)? It is a complete waste of resources and time when a worker fails, especially when it has almost or fully completed its work.

ii. Stragglers: Before allotting the work to backup machines, it can get whatever intermediate output the straggler has produced and allot just the remaining to the backup worker. This seems all the more do-able because the straggler is just slow and not failed.

· The picking of backup machines could be more sophisticated by keeping track of the workers’ performance history (if available).

b. Locality for network bandwidth conservation:

i. How does the master know if two machines are in the same network switch? Keeping track of it seems a daunting task…is there a smart naming/addressing strategy?

ii. In addition to the network proximity, the load on the current worker should also be taken into account.

iii. Finally, the proximity of the client and the output files should also be taken into account.

Wednesday, February 18, 2009

Dynamo: Amazon’s Highly Available Key-value Store

1. What is the problem? Is the problem real?

The problem of providing a storage system made of commodity (and hence failure-prone) machines for high availability and performance requirements, where the workload has plenty of write requests. The problem is real and is faced by most companies that support services similar to Amazon.

2. What is the solution's main idea (nugget)?

Data is partitioned and replicated using consistent hashing (of Chord-fame). Given that writes constitute a significant portion of their workload, guaranteeing availability while maintaining consistency is a big challenge. Dynamo uses a quorum-based consistency protocol where a minimum number of nodes must be up for a successful read or write. This parameter is configurable according to application requirements of availability. Object versioning helps in guaranteeing availability, and the divergent versions are reconciled during reads.

3. Why is solution different from previous work?

The contribution of this paper is not terribly new protocols, but that of building and managing gigantic-scale systems. That said, there is a key difference – this storage system is actually tuned for write-intensive workloads. That is a significantly harder problem than read-intensive workloads in terms of guaranteeing availability and maintaining consistency at the same time. Unlike the Google papers, this one seems to favor the classical decentralized design as opposed to a simpler centralized system.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?

They trade-off consistency for availability, while achieving eventual consistency. This is because of write-intensive workloads. The other trade-off is in terms of how transparent the system is to the applications. Dynamo expects applications to be intelligent and deal with inconsistency in the manner they deem fit. This represents a shift towards applications becoming more complex and would have a significant impact on the way they are designed.

5. Do you think the work will be influential in 10 years? Why or why not?

Sure! Dynamo represents a very good implementation of a lot of principles w.r.t. distributed systems and DB systems in terms of picking the right trade-offs and working effectively. With internet services becoming a big deal, their assumptions of workloads and solutions are all very pertinent.

6. Others:

a. While the idea of a knob for availability (R, W, N) seems good at the outset, it seemed against the general principle of this paper – I will tell you what works! I think setting these knobs is not an easy task and maybe for good reason they haven’t let out the secret of how their applications set it. But this can be crucial towards the performance of the system.

Wednesday, February 11, 2009

The Chubby lock service for loosely-coupled distributed systems

1. What is the problem? Is the problem real?
The problem is of providing a lock service for a distributed system. Chubby provides such a centralized lock service for O (10000) servers. This is obviously a very real (and historical) problem in distributed systems with multiple solutions. And not only is the problem real, its implementation with a real large system is another beast in itself – the details in this paper go a long way in understanding a practically deployed and effective lock service for a large distributed system.

2. What is the solution's main idea (nugget)?
Chubby is a centralized lock service (as opposed to a distributed client-based one) for Google’s distributed systems. A Chubby cell consists of a periodically-elected master and a small set of replicas. Locks can be exclusive or shared, but are not mandatory (for maintenance reasons as well no real value-add in a typical production environment). Caching helps in scalability – cache entries are invalidated on a modification as opposed to an update scheme as the latter might burden the clients with an unbounded number of updates. The master has the overhead of maintaining the cache list of every client. Blocking keep-alives are used to maintain sessions between clients and masters.

3. Why is solution different from previous work?
This paper, self-admittedly, is about the engineering experience and not new protocols. The details in this paper are highly valuable, but its biggest contribution is to describe the design of a centralized lock service in a large operational distributed system taking care of client’s implementation complexities, and corner cases for scaling and failures.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?
Being an implementation experience, this paper is about multiple trade-offs that are sort of classic in systems research – ease of deployment and implementation vs. correctness. (1) Centralized lock service vs. client-only library for distributed consensus, (2) No mandatory locks for ease of administration and it anyway doesn’t fix the problem, (3) Blocking modifications to ensure cache invalidation and consistency: caching is important for scalability, and this ensures simplicity at the expense of slow modifications.

5. Do you think the work will be influential in 10 years? Why or why not?
Without a doubt! Being used in a production environment as large as Google’s is influential enough. Handling of the practical issues to achieve scalability and clarity means that its adoption is imminent. And its simplicity provides a clean basis to add modifications on top as the requirements of services change.

6. Others:
Some unclear things:
a. For a large number of clients, is this invalidation strategy actually scalable? They provide no numbers to back it, other than mentioning that a Chubby service once had 90,000 clients. More quantitative details would have helped.
b. Why is the keep-alive scheme blocking? What happens if the master returns the call immediately, and the put the onus on the client to send its keep-alive messages before timing out?
c. Multiple Chubby cells: In this case, is there an overlap in the set of servers being served by different Chubby cells? If yes, how do they maintain consistency among the cells themselves?
Thoughts:
a. Are updates a better idea than invalidation sometimes? We could have a time period (and possibly a frequency) for updates limiting clients to not be indefinitely and unboundedly bombarded. This will make the modifications return quicker, as opposed to the current blocking model.
b. How about multiple masters for a set of servers, with a multicast protocol like SRM for maintaining consistency among the masters? This would help in scalability.

Sunday, February 8, 2009

eBay: Scale!

1. What is the problem? Is the problem real?
As a site that caters 2 billion page views a day resulting in $60 billion transactions every year, eBay definitely places a problem of scaling up to the demands. Maintaining such large-scale systems is a very real problem and is faced by most companies in the business of web-based services.

2. What is the solution's main idea (nugget)?
The solution to this is based on five basic principles: partition everything, introduce asynchrony, automatically adjust configurations and learn various characteristics including user behavior, be prepared for failure and handle them gracefully, and have a spectrum of consistency and apply appropriately. While these are fairly well-known principles now in the distributed systems world, their relevance cannot be tested in more demanding circumstances than modern datacenters. Their deployment includes third-party clouds too, making the problem harder and necessitating some checks to avoid some common erroneous assumptions about modeling, portability and utilization of resources.

3. Why is solution different from previous work?
In addition to these guarantees, the slides also talk about addressing emerging challenges about energy efficiency as well as workload characterization useful for prediction of “floods”. Modeling workloads is an especially useful research problem that has implications in resource provisioning and quality of service, energy-efficient designs and statistical multiplexing of resources.

4. Do you think the work will be influential in 10 years? Why or why not?
These are all very definite and real problems to solve. Principles as well as solutions that arise out of this work will definitely be influential, both immediately as well as in future.

5. Others:
Third-party clouds is pretty interesting – but as a service dealing with large number of customers, I wonder what are the privacy implications.

Wednesday, February 4, 2009

DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

1. What is the problem? Is the problem real?
Low bisection bandwidth, a real problem but with the same questions as for the UCSD fat-tree paper.

2. What is the solution's main idea (nugget)?
Using this recursive structure called D-cell, they can set up many paths between nodes. These multiple paths can be leveraged to get scalability, load-balancing and other nice properties.

3. Why is solution different from previous work?
Can scale to a million servers easily with high fault-tolerance, using commodity switches.

4. Do you think the work will be influential in 10 years? Why or why not?
Seems unlikely, though I can’t point my finger at exactly why not! Somehow the deployability story in this paper doesn’t seem strong and has a “if we were to do it from scratch” kind of a feel. Such designs have traditionally proved hard to deploy and hence haven’t been very influential, at least directly. Maybe ideas from this paper would get into other designs…

5. Others:
Of the three, this one seemed very confusing to read! There is too much of pseudo-code that is not immediately clear. Sure, a picture is worth a thousand words!

A Policy-aware Switching Layer for Data Centers

1. What is the problem? Is the problem real?
Datacenter management is hard and the scale makes errors in configurations inevitable. In addition, there are no explicit protocols or mechanisms to deploy middle-boxes in them resulting in operators overloading existing mechanisms and coming up with ad hoc and error-prone techniques. Unlike in the Internet where people have proposed similar techniques for doing things in a “clean” way when the “clumsy” techniques had sort of become familiar and operational, this paper hits the problem at the right time. Datacenters management is still evolving and admins do make errors. The problem is very real.

2. What is the solution's main idea (nugget)?
The physical network path not the way through which the actual traversal of middle-boxes happen. They introduce a new kind of programmable switch, pswitch, that can take in policies and ensure correct middle-box traversals.

3. Why is solution different from previous work?
Previous papers have not looked at the mess in configuring datacenters, and this paper proposes a neat solution based on the principles of policy and indirection from some well known previous papers (i3, middle-boxes are not harmful) in the context of datacenters. I haven’t read the Ethane paper, so I am not sure what is the exact difference between PLayer and Ethane.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?
There is an increased latency in the network as the pswitch has to look-up the policy for every frame. This is probably fine in largely over-provisioned datacenters that anyway have very small latencies and a small increase is insignificant. With the trend moving towards extracting every cent of investment in the datacenter, dealing with this increased latency might be important.

5. Do you think the work will be influential in 10 years? Why or why not?
As I said earlier, this paper addresses the architectural clumsiness problem of datacenters at the right time, not after people have mastered a clumsy but effective way. Principles in this paper are sure to influence datacenter management.

6. Others:
I have discussed this with Dilip earlier – a user-study with network admins will help in evolving the right language for specifying policies.

A Scalable, Commodity Data Center Network Architecture

1. What is the problem? Is the problem real?
The common tree-based architecture of switches/routers in datacenters is inefficient and result in oversubscription and relatively low effective bandwidths. This is definitely a real problem, but it would have been nice if the paper had shown some numbers to back it as opposed to just listing some examples.

2. What is the solution's main idea (nugget)?
The paper proposes constructing a fat-tree out of inexpensive, commodity switches that can achieve full bisection bandwidth.

3. Why is solution different from previous work?
The smart thing about this paper is to use the fat-tree topology in the datacenter. Such topologies have been used in the past in super-computers and even telephone networks. Identification of its relevance to datacenters enable them to achieve high bandwidths using cheap commodity switches. Also, unlike most other solutions that increase performance, this thing actually seems to result in reduced power consumption which is a big plus in its favour.

4. Does the paper (or do you) identify any fundamental/hard trade-offs?
This topology makes load balancing harder than before, which have well-known consequences at the transport layer. The paper does talk about a centralized scheduler, but I think in practice, it might be more complicated than what the paper paints. The actual effects with out-of-order delivery etc. need to be checked with realistic workloads.

5. Do you think the work will be influential in 10 years? Why or why not?
Definitely. I think the solution is definitely attractive, has no major complicated requirements and anything that constructs a high performance system out of commodity hardware is bound to be influential. At the very least, the ideas of fat-trees and using commodity hardware for datacenter architectures are here to stay.

6. Others:
As mentioned earlier, I really want to know the bandwidth requirements of the current datacenter applications and whether they are hitting the limit w.r.t. what can be provided. While it does seem like we can always do with more bandwidth, we should ensure we are not solving the wrong problem in datacenters.

Monday, February 2, 2009

Crash: Data center Horror Stories

This article talks about some of the crashes that have occurred at datacenters. It aims to, albeit a little dramatically, bring out the important point about the complexity and enormity of the task of designing datacenters. Some of the examples include cabling errors, insulation faults, and a lack of clear understanding among people who design failure-management strategies about the possible types and scale of collapses. Lots of the faults are due to errors in the construction process (physical) that make it important for experts in different fields to put their knowledge together during the design, inspection, operation and maintenance of the datacenter. There are some general guidelines in good datacenter design and operations that include adhering to standards and careful inspection of every step but the key is to remember that every datacenter is unique with its own problems/complexity. Custom solutions that are carefully thoughtout is the best approach.

Questions:
1. Is the datanceter-building industry a little young, with most data about building and operational experiences confidential? In that sense, with time, will the shared collective knowledge help in dramatically increasing our capability go guard against such mishaps? Are there parallels in other fields?
2. What is the trade-off between failure resilience and cost of achieving it? Is it preposterous to suggest a strategy, “I will have some standard failure-proof strategies…if once a while (with a very low probability), my datacenter goes down, tough luck until I restore it!”?
3. How much can software techniques help in working around some of these problems?

Failure Trends in a Large Disk Drive Population

1. What is the problem? Is the problem real?
Given the widespread usage of magnetic media, mostly hard drives, for storage, getting an understanding of the robustness of these components helps in the design, deployment and maintenance of storage systems. To that end, this paper talks about failure statistics of disk drives collected from Google’s disk drives in service. The problem is very real, and the solution takes steps in the right direction – large deployment and statistical correlation.

2. What is the solution's main idea (nugget)?
Information about temperature, activity levels and many SMART parameters were collected from the drives every few minutes. This was mined to find correlations with failures. The key findings are:
Activity Levels: This is weekly averages of read/write bandwidth per drive. Utilization is confusingly related to failure rates – very young (under an year) and very old (five years) roughly indicate higher failure rates on high utilization while the rest don’t (sometimes the relation is, surprisingly, inverse proportionality!).
Temperature: The study debunks widely and intuitively held beliefs that temperature is an important cause of failure. The data shows an inverse relation at low and average temperatures between failure rates and temperature. Only at very high temperatures (> 45 C) for older drives (> 3 years) is the relation directly proportional.
SMART Parameters:
(a) Scan Errors: This is defined as error in reading sectors in the drive. Drives with scan errors are ten times more likely to fail.
(b) Reallocation Counts: This happens with a sector has read/write error, and the faulty sector number is remapped to a new physical sector. This is an indication of drive surface wear. Drives are 14 times more likely to fail within 60 days than drives without reallocation counts.
(c) Probational Counts: This is sort of a warning of possible problems and is a good indicator of failures.
Overall, SMART parameters aren’t a great indicator of failures. Over 56% of the failed drives have no count in any of the four strong SMART signals. Better parameters are needed.

3. Why is solution different from previous work?
a. The data is from a deployment of size that is unprecedented in literature. Add to it, these were from a live and popular service, so makes it a very good basis for building models.
b. Debunks popularly held beliefs on the correlations of failure rates with utilization and temperature. Also, questions the utility of SMART parameters as effective warning mechanisms.

4. Do you think the work will be influential in 10 years?
Yes, very much. Given the size of the deployment and its questioning of intuitive relations, this work is likely to lead to building better models for disk failures.

5. Others: Some questions I had…
a. Is there anything that happens when you put a big bunch of disks together in a deployment, that is not apparent when they are by themselves? Any magnetic influences etc.? Maybe that might be the reason why some of the correlations don’t hold in big deployments while they are fine in “testing conditions”. This might be a reason to actually move towards environment-based models, a model suited to server farms, dusty households, stand-alone deployments etc.
b. The highly influential Google systems papers makes me wonder about the larger question of whether data from the industry (that is often confidential) is the right (and possibly, the only!) way to do data-oriented research…

Cloud Computing

Monday, April 20, 2009

WheelFS

Scaling Out – Facebook

Wednesday, April 15, 2009

Portable Cloud Computing, Google AppEngine

The Open Cloud Manifesto

Monday, March 30, 2009

DTrace

Wednesday, March 4, 2009

Dryad

MapReduce

Wednesday, February 18, 2009

Dynamo: Amazon’s Highly Available Key-value Store

Wednesday, February 11, 2009

The Chubby lock service for loosely-coupled distributed systems

Sunday, February 8, 2009

eBay: Scale!

Wednesday, February 4, 2009

DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

A Policy-aware Switching Layer for Data Centers

A Scalable, Commodity Data Center Network Architecture

Monday, February 2, 2009

Crash: Data center Horror Stories

Failure Trends in a Large Disk Drive Population

Followers

Blog Archive

About Me