Monday, November 24, 2008

2008-11-25 Readings - Datacenter Networking


A POLICY-AWARE SWITCHING LAYER FOR DATA CENTERS

Dilip Antony Joseph, Arsalan Tavakoli, Ion Stoica

Summary:

Presents the idea of a policy aware, layer-two PLayer, as well as policy aware pswitches. Middle boxes are located off path, and policy and reacheability are separated. pswitches explicitly direct traffic to off-path middle boxes according to policy. The bulk of the paper talks about implementation details. Prototype implemented in Click, evaluations done on small topologies. Include mechanisms to allow policy specifications and middle box instance selection.

Discussions & Criticisms:

Policy fluctuates a lot, but router architecture necessarily need to have infrequent changes. But middle boxes are needed. Hmm. How to balance the two ... Maybe some kind of funky application layer overlay stuff? Use overlay to fake routes to middle boxes and define traversal order? Only down side may be inefficient use of network bandwidth. However, adoption should be easier than pushing a whole new router architecture.

Maybe the answer is autoconf applications? Once upon a time it was a pain to (mis)configure compilers and such too. But then autoconf came along ...

Scalability behavior unknown?

Dealing with a complexity problem by adding another abstraction layer and more complexity??? Seems like the fundamental problem with the current setup is that middle boxes have to be on the path of the traffic, and placed in order of desired order of traversal. Once we solved that, we solved all problems.

Hmmm. Is the performance of Click anywhere near production switch quality? Basically all research stuff on routers use Click. Paper admits that software routers are nowhere near line speed. But if anything requires what the paper calls "fork lift upgrade" of all routers, it's going to be a hard sell to Cisco and potential barrier to adoption :( Results from NetFPGA might be interesting.



IMPROVING MapReduce PERFORMANCE IN HETEROGENEOUS ENVIRONMENTS

Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica

Summary:

Outlines a new way to schedule speculative tasks for Hadoop. Schedules speculative execution based not on progress rate but on remaining expected running time. Outlines assumptions implicit in Hadoop scheduling and how the assumptions break down in heterogenous environments. Implements the LATE (Longest Approx Time to End) scheduler. Critical insights include make decisions early in the job, use finishing times instead of progress rates in decisions, avoid speculative execution on slow nodes, and have speculative task caps. Evaluation done on EC2 (hundreds of nodes/VMs) and local testbed (R cluster? tens of nodes). Results show visible improvement over native Hadoop speculative scheduler for all setups. Improvement can be as high as factor of 2 decrease in finishing time.

Discussions & Criticisms:

MapReduce goodness :)

This looks adoptable. Anyone who felt the benefits are worth the effort yet?

Have only one question/reservation. MapReduce is a general programming paradigm and Hadoop is only one implementation of it. Doing stuff with Hadoop is fine and relevant, but people might say maybe Hadoop just sucks, and a better implementation of MapReduce would get rid of all Hadoop problems. I get asked that about my energy efficiency of MapReduce work. So what is the fundamental inefficiency in MapReduce going form homogeneous to heterogeneous environments? When I say MapReduce I mean the underlying programming paradigm, not just the Hadoop implementation. Yea ... sounds like all the wrong assumptions are Hadoop wrong assumptions, and the insight from the paper would lead to a better implementation of MapReduce.

Could it be that a more fundamental solution to MapReduce is to assign not tasks of homogeneous size, but assign task size depending on historical work rate? Or assign different sized tasks based on known system config params like disk speed, CPU speed, network speed etc.? Sounds like that will increase not just the speculative finishing time but the finishing time of all tasks.

Not sure if tuning Hadoop like crazy is the best way to go long term. I got persuaded to take a broader, more fundamental perspective ...



1 comment:

Matei Zaharia said...

The nice thing about Hadoop is that it's getting better very quickly over time. When we started looking at it last September, there were things that really sucked, and we thought "yeah, I can make a system that's faster/easier to use/more stable than this". Now... I'm not so sure. Performance has gone up very dramatically. HDFS is really reliable, and MapReduce is getting better. Hive and Pig provide higher-level programming on Hadoop. I think we're getting to the point where anyone who wants to build a new MapReduce implementation will need to justify not using Hadoop.