Monday, November 24, 2008

2008-11-25 Readings - Datacenter Networking


Dilip Antony Joseph, Arsalan Tavakoli, Ion Stoica


Presents the idea of a policy aware, layer-two PLayer, as well as policy aware pswitches. Middle boxes are located off path, and policy and reacheability are separated. pswitches explicitly direct traffic to off-path middle boxes according to policy. The bulk of the paper talks about implementation details. Prototype implemented in Click, evaluations done on small topologies. Include mechanisms to allow policy specifications and middle box instance selection.

Discussions & Criticisms:

Policy fluctuates a lot, but router architecture necessarily need to have infrequent changes. But middle boxes are needed. Hmm. How to balance the two ... Maybe some kind of funky application layer overlay stuff? Use overlay to fake routes to middle boxes and define traversal order? Only down side may be inefficient use of network bandwidth. However, adoption should be easier than pushing a whole new router architecture.

Maybe the answer is autoconf applications? Once upon a time it was a pain to (mis)configure compilers and such too. But then autoconf came along ...

Scalability behavior unknown?

Dealing with a complexity problem by adding another abstraction layer and more complexity??? Seems like the fundamental problem with the current setup is that middle boxes have to be on the path of the traffic, and placed in order of desired order of traversal. Once we solved that, we solved all problems.

Hmmm. Is the performance of Click anywhere near production switch quality? Basically all research stuff on routers use Click. Paper admits that software routers are nowhere near line speed. But if anything requires what the paper calls "fork lift upgrade" of all routers, it's going to be a hard sell to Cisco and potential barrier to adoption :( Results from NetFPGA might be interesting.


Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica


Outlines a new way to schedule speculative tasks for Hadoop. Schedules speculative execution based not on progress rate but on remaining expected running time. Outlines assumptions implicit in Hadoop scheduling and how the assumptions break down in heterogenous environments. Implements the LATE (Longest Approx Time to End) scheduler. Critical insights include make decisions early in the job, use finishing times instead of progress rates in decisions, avoid speculative execution on slow nodes, and have speculative task caps. Evaluation done on EC2 (hundreds of nodes/VMs) and local testbed (R cluster? tens of nodes). Results show visible improvement over native Hadoop speculative scheduler for all setups. Improvement can be as high as factor of 2 decrease in finishing time.

Discussions & Criticisms:

MapReduce goodness :)

This looks adoptable. Anyone who felt the benefits are worth the effort yet?

Have only one question/reservation. MapReduce is a general programming paradigm and Hadoop is only one implementation of it. Doing stuff with Hadoop is fine and relevant, but people might say maybe Hadoop just sucks, and a better implementation of MapReduce would get rid of all Hadoop problems. I get asked that about my energy efficiency of MapReduce work. So what is the fundamental inefficiency in MapReduce going form homogeneous to heterogeneous environments? When I say MapReduce I mean the underlying programming paradigm, not just the Hadoop implementation. Yea ... sounds like all the wrong assumptions are Hadoop wrong assumptions, and the insight from the paper would lead to a better implementation of MapReduce.

Could it be that a more fundamental solution to MapReduce is to assign not tasks of homogeneous size, but assign task size depending on historical work rate? Or assign different sized tasks based on known system config params like disk speed, CPU speed, network speed etc.? Sounds like that will increase not just the speculative finishing time but the finishing time of all tasks.

Not sure if tuning Hadoop like crazy is the best way to go long term. I got persuaded to take a broader, more fundamental perspective ...

Wednesday, November 19, 2008

2008-11-20 Readings - Multicast


Might be helpful to briefly review network level multicast.


Sally Floyd, Van Jacobson, Steven McCanne, Ching-Gung Liu, Lixia Zhang


Talks about the stuff in the title (well duh the title is long enough). Includes a REALLY nice description of the problems facing multicast, and why many of the unicast techniques cannot be migrated. Describe one example of multicast framework in LBL's wb system. Big ideas include receivers are responsible for detecting error loss and requesting retransmission, requesting retransmission using not sequence numbers, but address identifying the data.

Discussions & Criticisms:

Yay Sally Floyd and Van Jacobson! But what's with the long title?

I tend to really like the underlying philosophy/world-view/scientific-outlook that permeate many of their papers. Here the money line is "One cannot make a single reliable multicast delivery scheme that simultaneously meets the functionality, scalability and efficiency requirements of all applications." Perhaps a very necessary bucket of cold water that we should pour on ourselves every time we attempted one-size-fits-all solutions. I think that as engineers, we would inevitably be super excited about the stuff we build; but as scientists in search of why systems work well instead of just what systems we can build, we should constantly question and be skeptical about our own findings/systems/views. Anyhow. I'm actually a big fan of sound scientific method and good experiment design. Probably the residual trauma of having parents who are experimental chemists/chemical engineers. Hence hats-off to Floyd and Jacobson again.

I feel this work is a much more complete/thorough work than the other paper. I actually read them in the reverse order. There're discussions on different topologies, trying to isolate the bare skeleton functionality from bonus features, as well as a generally more convincing presentation of the data and the thinking behind it.

One thing I am confused about - is their method really application layer or are they kind of faking transport layers stuff in the application layer? In a sense the answer perhaps matters only for Internet architecture purists.

Also, where do the routers and switches fit into the system picture?

wb. Yea. Far cry from the collaboration stuff we have now. SharePoint anyone? :)

More seriously though, the receiver responsible for retransmission view is actually totally new to me. Very innovative.

The last set of papers and this one has also caused me to think perhaps the right way to design the network is not to organize it by operations we want to do on the communication channel, but the operations we want to do on the data that flows through the channel. The data addressing scheme they brought up is totally along the data centric path.

Turning devil's advocate, I would actually question the general need for multicast. Multicast makes sense only for any live, streaming data transfers. Else one-to-many general content distribution can be done in any number of ways, given that "liveness" is not an issue. So really the killer live streaming apps that could burn the network with bandwidth are live webcast, gaming, anything else? Live webcast we can always deliver through the cable network :). Gaming has intrinsic bandwidth limits based on our reaction time/sensory limits. So yea. Seems multicast is something that's common enough to warrant that it be done right, but not so common that I won't ask why do we need it ...


Suman Banerjee, Bobby Bhattacharjee, Christopher Kommareddy


Presents the NICE application-layer multicast protocol. The protocol is in charge of arranging the end hosts into a hierarchy that has some nice properties. The hierarchy is created and maintained in a distributed fashion. Nodes maintain soft state. Some nice properties include O(logN) control overhead, O(logN) "stretch" bound, and O(logN) hierarchy diameter. Include simulation results, as well as WAN implementation comparative measurements against the Narada protocol.

Discussions & Criticisms:

So there's the NICE multicast protocol and the NICE project. What does the NICE project do that demanded the construction of such ... elaborate protocol?

I actually found this paper to be good and bad at the same time. It's evident that the scheme they proposed is quite elaborate. They put a lot of effort into it and really did a lot of work, both theory wise in proving some qualities about it and experiments wise in simulating and measuring a real implementation.

However I'm quite disturbed by the seeming lack of any thinking behind their architecture nor any talk of how they came up with the design. The paper is presented as here's the pain point, here's a system, here are some properties about it, and here are our simulation and implementation measurement results. Somewhat feel like "Fiat Lux!/NICE!/whatever!" And I am left to wonder how the hell did they conjure up the system like the big bang.

The unease just mentioned leads to other reservations. What about other topologies? Hypercube? If it's really low bandwidth and delay variation tolerant within certain bounds, what's wrong with faking multicast with unicast and stagger the transmission time? The hierarchy seems to be very even and symmetrical. Any note on distinguishing the source of the data? Usually for each data streaming application there is/are a small number of data sources but a large number of data sinks. Their system doesn't seem to account for that.

Also I wonder what's really Akamai's secret sauce. We did talk about Akamai earlier in the semester in the context of DNS. Perhaps now would be a good time to revisit the convo? How do all the stock ticker systems and ESPN game ticker systems do it? For ESPN they could have problems during Super Bowl or major sporting events. And I imagine some stock ticker system somewhere would be near collapse during times of high volume trading.

Methodology wise all their results are dependent on comparison against Narada. What is that???? What if Narada just sucks ....

Tuesday, November 18, 2008

2008-11-18 Readings - Data-oriented networking and DTNs


Niraj Tolia, Michael Kaminsky, David G. Andersen, and Swapnil Patil


Presents the DOT architecture. Basically an abstration layer between transport and application layers that allow innovations in data transfer to occur separately from innovations in application functionality. Reasonable API. Implemented as various plugins. Does not impose excessive overhead. Able to rewrite mail program to use DOT.

Discussions & Criticisms:

I'm hesitant to say this but the paper is actually one of the very few papers where I disagree with not just the methodology or the significance of the results or the relevance/impact of the work. Here I actually disagree with the big picture philosophical outlook of the paper.

In my mind the greatest motivation for DOT is to facilitate innovation in data transfer independent of application specific details, and to allow cross-application services related to data (e.g. caching, middle-box type processing etc.). Isn't it the explicit purpose of transport layer protocols to facilitate data transfer independent of application layer details?

Most of the experiments done in the paper are to show that DOT does not suck or impose excessive overhead. But there are no results to demonstrate innovation is facilitated where it has been impossible, or made easier where it has been difficult. Neither was the cross-application services realized nor enough effort put into demonstrating that such services are possible with the existing system architecture and API. So in my mind the work did not address the very valid pain points that it brought up.

That said, I do subscribe to the view that a data-centric perspective on the network would be enlightening. After all the network exist because we use it to communicate data. I also buy the benefits of separating "content negotiation" from "data transfer". It's just that the approach in the paper seems to be a very odd way to approach these set of topics.

Will bring skepticism to class and see what everyone says.


Kevin Fall


Sets the context and define the problems faced by "challenged" Internets. Examples of such networks include terrestrial mobile, exotic media like satellite, ad-hoc networks, or sensor networks. Basic argument is that the real-time interactive message exchange assumed by most existing protocols is broken for challenged Internets. Outlines potential architecture to resolve the problems. Does not contain architecture realization and performance study.

Discussions & Criticisms:

There has been another series of similar work by one of Eric Brewer's students, recently presented at dissertation, on Internet for developing regions. There, the context of developing regions present a whole host of unexpected stuff that creates a very challenged Internet. It also uses the old "sneaker net" idea of directly delivering data (the DOT paper had a version of this idea re portable storage too).

Does a good job of setting the context and framing the problem.

I think the real value of this series of work may be to that the "unchallenged" Internet poses a small subset of the "challenged" Internet. So the techniques for "challenged" Internet should enlighten efforts to make the regular Internet more robust.

There are no implementation or performance evaluation. So I guess the paper is more of a philosophical outlook type paper. I think the measure of success for anything dealing with "challenged" Internet should just be that things work. They don't even need to work well. We either have non-working Internet, or we have "challenged" Internet where things suck but kinda work. Better than nothing.

Again, I do have some reservations that instead of some big and fuzzy new architecture, the right response to "challenged" Internet may be to overhaul the existing IP/TCP/UDP protocols and make them more able to withstand "challenges". This way the everything-converge-to-IP thing would allow all the exotic context the paper brings up to benefit. Else we'd have to go through the pain of implementing one-size-fits-all solutions that require consensus across the pretty well segregated exotic networks fields. Or have each field implement their own DTN solutions, which is essentially taking the DTN philosophy but continue on the existing path of devicing/improving custom protocols for each type of "challenged" networks.

Thursday, November 13, 2008

2008-11-13 Readings - Measurement and Tracing


Vern Paxson


Presents results from large scale Internet tracing study. Order 20,000 TCP bulk transfers of 100 KB between 35 Internet sites. Measurements at Poisson intervals to extract time averages. Analysis network "pathologies" including out of order delivery, packet replication, and packet corruption that's not due to noise. Also finds the bottleneck bandwidth for each pair of links. Presents distributions of packet loss and packet delays. Methodological refinements over previous studies in various areas.

Discussions & Criticisms:

Does Vern always do his research by himself?

The bottleneck link findings - I would be more interested in where the bottleneck links are. Knowing where they are would allow us to figure out ways of relieving/removing the bottlenecks. Also what are T1 circuits?

The packet loss rates are quite high. And the claim is that TCP works nevertheless?

Figure 9 looks very similar to the graphs we have from DC Metro measurements. Probably worth going back to those graphs again and see if it's indeed the same phenomenon.

Interesting how the data is from 1995, but the publication is in 1999. Does it take 4 years to crunch through the data?

Internet in 1995 is order 5 million hosts. In 2008 it's order 500 million hosts. How big of an effort would it be to repeat a similar measurement study?

Also, 35 hosts out of 5 million is almost nothing. Vern noted several findings different from other studies in the same things. Fundamental shortcoming of any attempt to get a snap shot at the Internet?

Packet replication and corruption - hard to imagine what could happen ...

Infinite variance in loss outage duration! I wonder if that's still the case.

The Poisson sampling thing is really nice for measuring long term phenomena with a small amount of data. Don't see that often anymore. Did people forget the early 1990s papers?


Rodrigo Fonseca, George Porter, Randy H. Katz, Scott Shenker, Ion Stoica


Presents the architecture and implementation of the X-Trace infrastructure. Allows tracing of causal interactions across multiple network layers. Does so through book keeping in metadata that gets propagated through the network stack. Uses pushDown() and pushNext() operations for this purpose. Metadata easily goes into extensible headers for HTTP/TCP/IP protocols and the like. Offline reconstruction of call trees. Takes care of administrative control - people who request X-Trace doesn't necessarily see the traced data - data goes to the admin of each administrative domain. Needs modifications to protocol/clients/nodes.

Discusisons & Criticisms:

Requires modifying various protocols/clients/nodes. Already noted multiple times in the paper as a shortcoming. Probably a necessary thing if we were to really get comprehensive traces. I think the whole ugly business of tracing is not like ... shortcoming of the trace tools that they must modify various things, but a shortcoming in the original network design architecture that accounting/tracing was a very low priority. Sure the top priority of the DARPA project was blah blah survivability, but tracing/accounting/debugging capabilities shoudl arguably not even be a priority, but a general system design principle. Without these capabilities, we end up having systems that are hard to understand and hard to improve.

One could give the counter argument - screw accounting and tracing, what's important is having systems that work, and who cares about understanding why they work. In that view we'll often end up re-inventing the wheel and stumbling from revolutionary change to revolutionary change, with very little common evolutionary theme connecting it all.

Real life is somewhere in the middle .... Hence X-Trace I guess ...

I don't understand fundamentally what's preventing us from having a task graph? It's just metadata and task identifiers, right? So the metadata could be designed as extensible and carry multiple task identifiers? Also do something like XML name spaces and have each X-Trace metadata conform to a common format, but extensible beyond that? Could be nice for future evolution as well.

Paper did note that the key measurement for X-Trace's success is when it can help real life admins and developers debug/maintain their stuff. So one year since the paper was published, any happy stories in that regard?

Tuesday, November 11, 2008

2008-11-06 Readings - Names, Identifiers, and Network Architecture


Michael Walfish, Jeremy Stribling, Maxwell Krohn, Hari Balakrishnan, Robert Morris, and Scott Shenker


Presents the Delegation-Oriented Architecture (DOA), that not only allows, but also facilitates, the deployment of middleboxes. Basic idea is very simple. Instead of doing node ID on IP addresses & ports, do it using another layout of abstraction of EID (Endpoint ID). Implementation requires DHT for EID translation, addition to the OS kernel space to do the EID lookup, and addition to the network stack headers to have DOA header between IP header and transport header. Their initial implementation has tolerable overhead.

Discussions & Criticisms:

minor point, the name DOA isn't all that cute or inspiring or blandly technological. Usually I don't even notice the name but when I was reading the paper and my roommate was watching some disturbing junk on Youtube, the screams he made pretty much sounded like DOOOOOUUUUUUAAAAAAHHHH!!!!! Had the name being Goa then at least some people would recognize it as some place in Indian, albeit associated with the short-lived Portugese colonial presence there ...

Second, major point. Continuing the train of thought from the DNS paper, there's no chance this technology would get adoptedin real life. It solves a somewhat artificial problem - the complaints from Internet architecture purists (the authors themselves) about the harmfulness of middle boxes and how it violates various Internet design principles. I would have preferred something snappy and blunt, e.g. Scott Shenker coming out and saying "We think middle boxes are no longer full of crap." He made and equivalent statement about his prior work in QOS and that message stuck more than any paper would. Also, the way it's presented, it requires too many architectural changes in the underlying network implementation. So costs > benefits.

There might be an economy of scale business argument for concentrating the management of middle boxes into an expert third-party, and remove the redundant expertise in that area from across multiple businesses/organizations. But for most organizations, the control over the potentially critical functionality of the middle boxes far negates any saving in efficiency. Nevertheless the "outsourced" middle boxes idea is worth keeping in mind and has the potential to completely disrupt the on-the-path middle boxes business.

The overhead results in the paper are not yet fully convincing. Latency overhead could get large for large administrative domains with large DHT lookup table. Packet header overhead could get large for small chatty type packets. Might need lots of caching in the style of DNS.

Also it's not clear to me on which nodes the DHT would reside. For example, if we use Chord, who would be the "other" Chord nodes?

Potential security exploits in DOA are obvious.


Ion Stoica, Daniel Adkins, Shelley Zhuang, Scott Shenker, Sonesh Surana


Outlines the Internet Indirection Infrastructure (i3). Motivated by the current inefficiencies in multicast and anycast, and the network architecture's sole focus on unicast. Basic idea is to have (ID, data) and (ID address) pairs to separate sending and receiving. The senders and receivers rendevouz at some point in the network. Implemented in an overlay network (Chord used for the particular implementation). Each node stores sufficient state to facilitate rendevouz. Each node stores state for ID allocated in Chord style. Lots of implementation details.

Discussions & Criticisms:

Not sure if we're solving a real life problem here. Most multicast/unicast would be large content provider to many individual end-consumer. The multi-cast tree/Akamai approach seems to work fine. Mobility would be a better motivation. But perhaps that's because the home agen/foreign agent business is really ugly. How do cell phones manage mobility? Perhaps IP/voice convergence should learn from cell phones in terms of mobility?

The architecture would inherit all the deficiencies of Chord. In particular the hop count routing metric is broken (ETX, ETT would be better?).

Much of the implementation ugliness needed arised out of the assumption that multi-cast and unicast traffic can be treated well in the same infrastructure. I think they are fundamentally different. But not sure if the best solution is the current solution (faking multicast using clever unicast through Akamai and such), or something else. Mobility we should consider borrowing ideas from cell phones.

Security threat model is horrendous. You have Chord nodes storing routing information and act as points of transit. These are end consumers, and has a significantly lower level of trust than routers etc. managed by ISPs and backbones. Can eavesdrop/re-route/spoof like crazy.

Also potentially creating bottlenecks at the network edge - using the low capacity of the network edge to do stuff that should arguably be done in the network core.

Still, a nice idea that more or less solves what it sets out to solve under the constraints it places on itself.

2008-11-04 Readings - DNS and the Web


Paul V. Mockapetris, Kevin J. Dunlap


Talks about the original motivation for and implementation of DNS. Motivated by how the HOSTS.txt fails to scale. Original system requirements include scale, distributed management, no limit for names & data associated with names, interoperable across DARPA Internet and other networks. Additional requirements that helped transition/adoption include independence on network topology, avoid focing a single OS or organizational style. Generally minimalist design to allow greatest implementation flexibility. Caching responses is really important, including caching negative responses. Adoption resistance mainly arised from different assumptions regarding how systems should work. Also protocol compliance tests should assume people do enough to get this working and do no more, and thus must include mechanisms that require demonstration of the required functionality.

Discussions & Criticisms:

Overall a very interesting story of how a core component of the Internet got developed. The motivation of the system was pretty obvious, and I don't think anyone would dispute that DNS is needed or that DNS has been successful. What got me going is the adoption story, perhaps a model for designing future technologies that are likely to be adopted on a large scale.

I think there are several criteria for the DNS success story to be repeated. 1. The technology must tackle a critical problem that makes both research and industry say "Oh crap we're screwed", and resolve the problem with significant demonstrable benefits. 2. The technology must be minimalist in design, and make no assumptions about system topology, architecture, organizational style, policy priorities. 3. The initial technology must be minimalist in design, resolve the problem at hand and do not try to force solutions for other problems (the debate on MX records could easily have blown out of control and had MX been central to DNS, the MX debate could have scuttled DNS adoption). Once adoption is well underway, layer-on functionality can be introduced. 4. The techonology must work under the assumption that real life implementers will achieve minimal functionality and does very little testing/tuning.

Would appreciate comments re the above list.

I also find it interesting to think about the naming resolution/distribution problems for the Internet (DNS) and for the phone network (basically through human contact).


Jaeyeon Jung, Emil Sit, Hari Balakrishnan


Presents DNS trace data and associated analysis. Data comes from around 2000-2001. Three continuous chunks on the order of days. Trace from link connecting MIT LCS and KAIST to the WAN. Also has trace-driven simulations to investigate caching strategies and the effect of changing the cache TTL. Key results include a large % of no answer or error results - the former takes up a large chunk of traffic due to overly persistent retries. Also, long-tail name access patterns suggest low cache TTL does not impede performance, and shared cache among clients do not improve performance.

Discussions & Criticisms:

First, the data collection method. I appreciate that as a first-of-its-kind study, they had to work with whatever data they have instead of whatever data they would wish for. Still, future studies should consider capturing more data and do some kind of Poisson sampling with the data (e.g. samples a few hours at a time over several months using Poisson sampling). General intuition that Poisson sampling captures time averages. I think both tracing and data analysis capabilities have increased greatly since the work was first published.

Long tail name access patterns means TTL ~ duration of each user session, since it's reasonably pointless to cache across sessions that go all over the place. Turning point of all their TTL vs hit rate curves occur at TTL ~ 500 sec. Seems to correspond to duration of user sessions.

Did they propose any fixes for the observed pathologies (no answers, repeated answers etc.)?

Caching and TTL messages reasonably convincing and looks adoptible - just requires software config. of DNS cache, right? Have people been willing to do this?

2008-10-30 Readings - Distributed Hash Tables


Hari Balakrishnan, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica


Surveys various DHT (distributed hash table) schemes for lookup in P2P systems. Frames the "lookup problem" and describes the DHT. Outlines one-dimensional routing (Chord), tree-like routing (Pastry and Tapestry), and multidimensional routing (CAN). Notes open problems in the field, e.g. distance function, operation costs, fault tolerance, proximity routing, malicious nodes, indexing and keyword search.

Discussions & Criticisms:

It would be nice for a survey paper to kind of do some comparative study using some figure of merit. Doesn't need to say one thing sucks and another rocks, but would give an indication of the pros and cons of each. The community did arrive at the conclusion later that Chord is basically the best of them.

Not much else to say for a survey paper. DHT is still new material for me. Skipped the undergraduate database class and didn't fully "get" the topic in CS262A. Also read the Chord and CAN paper there. This will hopefully sink in on the second pass.


Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan


Presents Chord, an architecture and an implementation of a P2P lookup services. Key ideas include a circular structure, consistent hashing to load balance, finger tables to provide O(logN) lookup cost, successor nodes to ensure nodes can join and leave without affecting system performance. Generally agreed to be the best P2P DHT there is. Lots of details in the implementation. Not clear how to deal with concurrent joins and departures.

Discussions & Criticisms:

Same concerns as my first read of the paper. The system is obviously proved to be good and accepted as such. Easy of implementation? Is it adopted widely? Is the next random P2P start up going to jump onto the Chord bandwagon?

The modifications to the vanilla protocol to facilitate concurrent joins/departures look somewhat awkward. Seems like one does need to read their technical report to see all the proofs.

Mentioned this in class. Chord uses hop count now. People have proved for various contexts that hop count may suck as a routing metric. Any attempts to apply ETX and other such tricks to Chord? Could be side project once I take care of business in the datacenter track.

DHT for datacenters? Surely there must be techniques that we can recycle for something.