Inhalt des Dokuments
Topics for the Seminar on Internet Routing, WS 2014/15
Topics for the seminar on Internet Routing (WS
Themen für das Seminar über Internet Routing (WS 2014/15).
Achieving High Utilization with Software-Driven WAN
We present SWAN, a system that boosts the utilization of inter-datacenter networks by centrally controlling when and how much traffic each service sends and frequently re-configuring the network’s data plane to match current traffic demand. But done simplistically, these reconfigurations can also cause severe, transient congestion because different switches may apply updates at different times. We develop a novel technique that leverages a small amount of scratch capacity on links to apply updates in a provably congestion-free manner, without making any assumptions about the order and timing of updates at individual switches. Further, to scale to large networks in the face of limited forwarding table capacity, SWAN greedily selects a small set of entries that can best satisfy current demand. It updates this set without disrupting traffic by leveraging a small amount of scratch capacity in forwarding tables. Experiments using a testbed prototype and data-driven simulations of two production networks show that SWAN carries 60% more traffic than the current practice.
FCP: A Flexible Transport Framework for Accommodating Diversity
Transport protocols must accommodate diverse application and network requirements. As a result, TCP has evolved over time with new congestion control algorithms such as support for generalized AIMD, background flows, and multipath. On the other hand, explicit congestion control algorithms have been shown to be more efficient. However, they are inherently more rigid because they rely on in-network components. Therefore, it is not clear whether they can be made flexible enough to support diverse application requirements. This paper presents a flexible framework for network resource allocation, called FCP, that accommodates diversity by exposing a simple abstraction for resource allocation. FCP incorporates novel primitives for end-point flexibility ( aggregation and preloading ) into a single framework and makes economics-based congestion control practical by explicitly handling load variations and by decoupling it from actual billing. We show that FCP allows evolution by accommodating diversity and ensuring coexistence, while being as efficient as existing explicit congestion control algorithms.
Leveraging Endpoint Flexibility in Data-Intensive Clusters
Many applications do not constrain the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these transfers as well as that of others without similar flexibility can be improved. In this paper, we focus on leveraging the flexibility in replica placement during writes to cluster file systems (CFSes), which account for almost half of all cross-rack traffic in data-intensive clusters. Thereplicas of a CFS write can be placed in any subset of machines as long as they are in multiple fault domains and ensure a balanced use of storage throughout the cluster. We study CFS interactions with the cluster network, analyze optimizations for replica placement, and propose Sinbad – a system that identifies imbalance and adapts replica destinations to navigate around congested links. Experiments on EC2 and trace-driven simulations show that block writes complete 1.3x (respectively,1.58x) faster as the network becomes more balanced. As a collateral benefit, end-to-end completion times of data-intensive jobs improve as well. Sinbad does so with little impact on the long-term storage balance.
Speeding up Distributed Request-Response Workflows
We found that interactive services at Bing have highly variable datacenter-side processing latencies because their processing consists of many sequential stages, parallelization across 10s-1000s of servers and aggregation of responses across the network. To improve the tail latency of such services, we use a few building blocks: reissuing laggards elsewhere in the cluster, new policies to return incomplete results and speeding up laggards by giving them more resources. Combining these building blocks to reduce the overall latency is non-trivial because for the same amount of resource (e.g., number of reissues), different stages improve their latency by different amounts. We present Kwiken, a framework that takes an end-to-end view of latency improvements and costs. It decomposes the problem of minimizing latency over a general processing DAG into a manageable optimization over individual stages. Through simulations with production traces, we show sizable gains; the 99th percentile of latency improves by over 50% when just 0.1% of the responses are allowed to have partial results and by over 40% for 25% of the services when just 5% extra resources are used for reissues.
Dynamic scheduling of network updates
We present Dionysus, a system for fast, consistent network updates in software-defined networks. Dionysus encodes as a graph the consistency-related dependencies among updates at individual switches, and it then dynamically schedules these updates based on runtime differences in the update speeds of different switches. This dynamic scheduling is the key to its speed; prior update methods are slow because they pre-determine a schedule, which does not adapt to runtime conditions. Testbed experiments and data-driven simulations show that Dionysus improves the median update speed by 53--88% in both wide area and data center networks compared to prior methods.
- Xin Jin, Hongqiang Harry Liu, Rohan Gandhi, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Jennifer Rexford, and Roger Wattenhofer. 2014. Dynamic scheduling of network updates. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 539-550. DOI=10.1145/2619239.2626307 doi.acm.org/10.1145/2619239.2626307 
pFabric: Minimal Near-Optimal Datacenter Transport
In this paper we present pFabric, a minimalistic datacenter transport design that provides near theoretically optimal flow co mpletion times even at the 99th percentile for short flows, while still minimizing average flow completion time for long flows. Moreover, pFabric delivers this performance with a very simple design that is based on a key conceptual insight: datacenter transport should decouple flow scheduling from rate control. For flow scheduling, packets carry a single priority number set independently by each flow; switches have very small buffers and implement a very simple priority-based scheduling/dropping mechanism. Rate control is also correspondingly simpler; flows start at line rate and throttle back only under high and persistent packet loss. We provide theoretical intuition and show via extensive simulations that the combination of these two simple mechanisms is sufficient to provide near-optimal performance.
VALE, a Switched Ethernet for Virtual Machines
The growing popularity of virtual machines is pushing the demand for high performance communication between them. Past solutions have seen the use of hardware assistance, in the form of “PCI passthrough” (dedicating parts of physical NICs to each virtual machine) and even bouncing traffic through physical switches to handle data forwarding and replication. In this paper we show that, with a proper design, very high speed communication between virtual machines can be achieved completely in software. Our architecture, called VALE, implements a Virtual Local Ethernet that can be used by virtual machines, such as QEMU, KVM and others, as well as by regular processes. VALE achieves a throughput of over 17 million packets per second (Mpps) between host processes, and over 2 Mpps between QEMU instances, without any hardware assistance. VALE is available for both FreeBSD and Linux hosts, and is implemented as a kernel module that extends our recently proposed netmap framework, and uses similar techniques to achieve high packet rates.
Towards TCAM-based Scalable Virtual Routers
As the key building block for enabling network virtualization, virtual routers have attracted much attention recently. In a virtual router platform, multiple virtual router instances coexist, each with its own FIB (Forwarding Information Base). The small amount of high-speed memory in a physical router platform severely limits the number of FIBs supported, which leads to a scalability challenge. In this paper, we present a method towards TCAM (Ternary Content Addressable Memory) based scalable virtual routers, through a merged data structure that enables the sharing of prefixes from several FIBs in TCAMs. Based on this data structure, we propose two approaches to merge multiple FIBs in TCAMs, paving the way for scalable virtual routers. Experimental results show that, by using the two approaches for storing 14 full IPv4 FIBs, the TCAM memory requirement can be reduced by about 92% and 82% respectively, compared with the conventional approach of treating FIBs as independent entities.
Modeling Complexity of Enterprise Routing Design
Enterprise networks often have complex routing designs given the need to meet a wide set of resiliency, security and routing policies. In this paper, we take the position that minimizing design complexity must be an explicit objective of routing design. We take a first step to this end by presenting a systematic approach for modeling and reasoning about complexity in enterprise routing design. We make three contributions. First, we present a framework for precisely defining objectives of routing design, and for reasoning about how a combination of routing design primitives (e.g. routing instances, static routes, and route filters etc.) will meet the objectives. Second, we show that it is feasible to quantitatively measure the complexity of a routing design by modeling individual routing design primitives, and leveraging configuration complexity metrics . Our approach helps understand how individual design choices made by operators impact configuration complexity, and can enable quantifying design complexity in the absence of configuration files. Third, we validate our model and demonstrate its utility through a longitudinal analysis of the evolution of the routing design of a large campus network over the last three years. We show how our models can enable comparison of the complexity of multiple routing designs that meet the same objective, guide operators in making design choices that can lower complexity, and enable what-if analysis to assess the potential impact of a configuration change on routing design complexity.
Duet: cloud scale load balancing with hardware and software
Load balancing is a foundational function of datacenter infrastructures and is critical to the performance of online services hosted in datacenters. As the demand for cloud services grows, expensive and hard-to-scale dedicated hardware load balancers are being replaced with software load balancers that scale using a distributed data plane that runs on commodity servers. Software load balancers offer low cost, high availability and high flexibility, but suffer high latency and low capacity per load balancer, making them less than ideal for applications that demand either high throughput, or low latency or both. In this paper, we present Duet, which offers all the benefits of software load balancer, along with low latency and high availability -- at next to no cost. We do this by exploiting a hitherto overlooked resource in the data center networks -- the switches themselves. We show how to embed the load balancing functionality into existing hardware switches, thereby achieving organic scalability at no extra cost. For flexibility and high availability, Duet seamlessly integrates the switch-based load balancer with a small deployment of software load balancer. We enumerate and solve several architectural and algorithmic challenges involved in building such a hybrid load balancer. We evaluate Duet using a prototype implementation, as well as extensive simulations driven by traces from our production data centers. Our evaluation shows that Duet provides 10x more capacity than a software load balancer, at a fraction of a cost, while reducing latency by a factor of 10 or more, and is able to quickly adapt to network dynamics including failures.
- Rohan Gandhi, Hongqiang Harry Liu, Y. Charlie Hu, Guohan Lu, Jitendra Padhye, Lihua Yuan, and Ming Zhang. 2014. Duet: cloud scale load balancing with hardware and software. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 27-38. DOI=10.1145/2619239.2626317 doi.acm.org/10.1145/2619239.2626317 
CONGA: distributed congestion-aware load balancing for datacenters
- Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, and George Varghese. 2014. CONGA: distributed congestion-aware load balancing for datacenters. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 503-514. DOI=10.1145/2619239.2626316 doi.acm.org/10.1145/2619239.2626316 
Hedera: Dynamic Flow Scheduling for Data Center Networks
Student/Bearbeiter: –; Supervisor/Betreuer: –
Today’s data centers offer tremendous aggregate bandwidth to clusters of tens of thousands of machines. However, because of limited port densities in even the highest-end switches, data center topologies typically consist of multi-rooted trees with many equal-cost paths between any given pair of hosts. Existing IP multipathing protocols usually rely on per-ﬂow static hashing and can cause substantial bandwidth losses due to longterm collisions. In this paper, we present Hedera, a scalable, dynamic ﬂow scheduling system that adaptively schedules a multi-stage switching fabric to efﬁciently utilize aggregate network resources. We describe our implementation using commodity switches and unmodiﬁed hosts, and show that for a simulated 8,192 host data center, Hedera delivers bisection bandwidth that is 96% of optimal and up to 113% better than static load-balancing methods.
- Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. 2010. Hedera: dynamic flow scheduling for data center networks . In Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI'10). USENIX Association, Berkeley, CA, USA, 19-19.
RouteBricks: Exploiting Parallelism To Scale Software Routers
Student/Bearbeiter: –; Supervisor/Betreuer: –
We revisit the problem of scaling software routers, motivated by recent advances in server technology that enable high-speed parallel processing—a feature router workloads appear ideally suited to exploit. We propose a software router architecture that parallelizes router functionality both across multiple servers and across multiple cores within a single server. By carefully exploiting parallelism at every opportunity, we demonstrate a 35Gbps parallel router prototype; this router capacity can be linearly scaled through the use of additional servers. Our prototype router is fully programmable using the familiar Click/Linux environment and is built entirely from off-the-shelf, general-purpose server hardware.
- Mihai Dobrescu and Norbert Egi, Katerina Argyraki, Byung-Gon Chun, Kevin Fall,Gianluca Iannaccone, Allan Knies, Maziar Manesh, Sylvia Ratnasamy. RouteBricks: Exploiting Parallelism To Scale Software Routers . 22nd ACM Symposium on Operating Systems Principles (SOSP), October 2009
Compact Routing on Internet-Like Graphs
The Thorup-Zwick (TZ) compact routing scheme is the first generic stretch-3 routing scheme delivering a nearly optimal per-node memory upper bound. Using both direct analysis and simulation, we derive the stretch distribution of this routing scheme on Internet-like inter-domain topologies. By investigating the TZ scheme on random graphs with power-law node degree distributions, Pk ≅ k-γ, we find that the average TZ stretch is quite low and virtually independent of γ. In particular, for the Internet inter-domain graph with γ ≈ 2.1, the average TZ stretch is around 1.1, with up to 70% of all pairwise paths being stretch-1 (shortest possible). As the network grows, the average stretch slowly decreases. We find routing table sizes to be very small (around 50 records for 104-node networks), well below their theoretical upper bounds. Furthermore, we find that both the average shortest path length (i.e., distance) d and width of the distance distribution σ observed in the real Internet inter-AS graph have values that are very close to the minimums of the average stretch in the d- and σ-directions. This leads us to the discovery of a unique critical point of the average TZ stretch as a function of d and σ. The Internet's distance distribution is located in a close neighborhood of this point. This is remarkable given the fact that the Internet inter-domain topology has evolved without any direct attention paid to properties of the stretch distribution. It suggests the average stretch function may be an indirect indicator of the optimization criteria influencing the Internet's inter-domain topology evolution.
- Dmitri Krioukov, Kevin Fall, Xiaowei Yang. Compact Routing on Internet-Like Graphs . Infocom 2004
Link Positions Matter: A Noncommutative Routing Metric for Wireless Mesh Networks
We revisit the problem of computing the path with the minimum cost in terms of the expected number of link layer transmissions (including retransmissions) in wireless mesh networks. Unlike previous efforts, such as the popular ETX, we account for the fact that MAC protocols (including the IEEE 802.11 MAC) incorporate a finite number of transmission attempts per packet. This in turn leads to our key observation: the performance of a path depends not only on the number of the links on the path and the quality of its links, but also, on the relative positions of the links on the path. Based on this observation, we propose ETOP, a path metric that accurately captures the expected number of link layer transmissions required for reliable end-to-end packet delivery. We analytically compute ETOP, which is not trivial, since ETOP is a noncommutative function of the link success probabilities. Although ETOP is a more involved metric, we show that the problem of computing paths with the minimum ETOP cost can be solved by a greedy algorithm. We implement and evaluate a routing approach based on ETOP on a 25-node indoor mesh network. Our experiments show that the path selection with ETOP consistently results in superior TCP goodput (by over 50% in many cases) compared to path selection based on ETX. We also perform an in-depth analysis of the measurements to better understand why the paths selected by ETOP improve the TCP performance.
- Jakllari G., Eidenbenz S., Hengartner N., Krishnamurthy S.V., Faloutsos M. Link Positions Matter: A Noncommutative Routing Metric for Wireless Mesh Networks . INFOCOM 2008. The 27th Conference on Computer Communications. IEEE
Quartz: a new design element for low-latency DCNs
Student/Bearbeiter: –; Supervisor/Betreuer:
Most datacenter network (DCN) designs focus on maximizing bisection bandwidth rather than minimizing server-to-server latency. We explore architectural approaches to building low-latency DCNs and introduce Quartz, a design element consisting of a full mesh of switches. Quartz can be used to replace portions of either a hierarchical network or a random network. Our analysis shows that replacing high port-count core switches with Quartz can significantly reduce switching delays, and replacing groups of top-of-rack and aggregation switches with Quartz can significantly reduce congestion-related delays from cross-traffic. We overcome the complexity of wiring a complete mesh using low-cost optical multiplexers that enable us to efficiently implement a logical mesh as a physical ring. We evaluate our performance using both simulations and a small working prototype. Our evaluation results confirm our analysis, and demonstrate that it is possible to build low-latency DCNs using inexpensive commodity elements without significant concessions to cost, scalability, or wiring complexity.
- Yunpeng James Liu, Peter Xiang Gao, Bernard Wong, and Srinivasan Keshav. 2014. Quartz: a new design element for low-latency DCNs. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 283-294. DOI=10.1145/2619239.2626332 doi.acm.org/10.1145/2619239.2626332 
A Deep Dive Into the LISP Cache and What ISPs Should Know About It
Student/Bearbeiter: –; Supervisor/Betreuer: –
Due to scalability issues that the current Internet is facing, the research community has re-discovered the Locator/ID Split paradigm. As the name suggests, this paradigm is based on the idea of separating the identity from the location of end-systems, in order to increase the scalability of the Internet architecture. One of the most successful proposals, currently under discussion at the IETF, is LISP (Locator/ID Separation Protocol). A critical component of LISP, from a performance and resources consumption perspective, as well as from a security point of view, is the LISP Cache. The LISP Cache is meant to temporarily store mappings, i.e., the bindings between identifiers and locations, in order to provide routers with the knowledge of where to forward packets. This paper presents a thorough analysis of such a component, based on real packet-level traces. Furthermore, the implications of policies to increase the level of security of LISP are also analyzed. Our results prove that even a timeout as short as 60 seconds provides high hit ratio and that the impact of using security policies is small.
- Juhoon Kim, Luigi Iannone, and Anja Feldmann. A Deep Dive into the LISP Cache and What ISPs Should Know about It . IFIP International Conference on Networking, 2011.
FireFly: a reconfigurable wireless data center fabric using free-space optics
Conventional static datacenter (DC) network designs offer extreme cost vs. performance tradeoffs---simple leaf-spine networks are cost-effective but oversubscribed, while "fat tree"-like solutions offer good worst-case performance but are expensive. Recent results make a promising case for augmenting an oversubscribed network with reconfigurable inter-rack wireless or optical links. Inspired by the promise of reconfigurability, this paper presents FireFly, an inter-rack network solution that pushes DC network design to the extreme on three key fronts: (1) all links are reconfigurable; (2) all links are wireless; and (3) non top-of-rack switches are eliminated altogether. This vision, if realized, can offer significant benefits in terms of increased flexibility, reduced equipment cost, and minimal cabling complexity. In order to achieve this vision, we need to look beyond traditional RF wireless solutions due to their interference footprint which limits range and data rates. Thus, we make the case for using free-space optics (FSO). We demonstrate the viability of this architecture by (a) building a proof-of-concept prototype of a steerable small form factor FSO device using commodity components and (b) developing practical heuristics to address algorithmic and system-level challenges in network design and management.
- Navid Hamedazimi, Zafar Qazi, Himanshu Gupta, Vyas Sekar, Samir R. Das, Jon P. Longtin, Himanshu Shah, and Ashish Tanwer. 2014. FireFly: a reconfigurable wireless data center fabric using free-space optics. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 319-330. DOI=10.1145/2619239.2626328 doi.acm.org/10.1145/2619239.2626328 
VPN Gate: A Volunteer-Organized Public VPN Relay System with Blocking Resistance for Bypassing Government Censorship Firewalls
Daiyuu Nobori and Yasushi Shinjo, University of Tsukuba
One tunnel is (often) enough
In this paper, we study whether a simple, easy to implement model is sufficient for addressing the aforementioned Internet vulnerabilities. Our model, called ARROW (Advertised Reliable Routing Over Waypoints), is designed to allow users to configure reliable and secure end to end paths through participating providers. With ARROW, a highly reliable ISP offers tunneled transit through its network, along with packet transformation at the ingress, as a service to remote paying customers. Those customers can stitch together reliable end to end paths through a combination of participating and non-participating ISPs in order to improve the fault-tolerance, robustness, and security of mission critical transmissions. Unlike efforts to redesign the Internet from scratch, we show that ARROW can address a set of well-known Internet vulnerabilities, for most users, with the adoption of only a single transit ISP. To demonstrate ARROW, we have added it to a small-scale wide-area ISP we control. We evaluate its performance and failure recovery properties in both simulation and live settings.
- Simon Peter, Umar Javed, Qiao Zhang, Doug Woos, Thomas Anderson, and Arvind Krishnamurthy. 2014. One tunnel is (often) enough. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 99-110. DOI=10.1145/2619239.2626318 doi.acm.org/10.1145/2619239.2626318 
A Technique for Reducing BGP Update Announcements through Path Exploration Damping
This paper defines and evaluates Path Exploration Damping (PED) a router-level mechanism for reducing the volume of propagation of likely transient update messages within a BGP network and decreasing average time to restore reachability compared to current BGP Update damping practices. PED selectively delays and suppresses the propagation of BGP updates that either lengthen an existing AS Path or vary an existing AS Path without shortening its length. We show how PED impacts on convergence time compared to currently deployed mechanisms like Route Flap Damping (RFD), Minimum Route Advertisement Interval (MRAI) and Withdrawal Rate Limiting (WRATE). We replay Internet BGP update traffic captured at two Autonomous Systems to observe that a PED-enabled BGP speaker can reduce the total number of BGP announcements by up to 32% and reduce Path Exploration by 77% compared to conventional use of MRAI. We also describe how PED can be incrementally deployed in the Internet, as it interacts well with prevailing MRAI deployment, and enables restoration of reachability more quickly than MRAI.
- Geoff Huston, Mattia Rossi, and Grenville Armitage. A Technique for Reducing BGP Update Announcements through Path Exploration Damping . IEEE Journal on Selected Areas in Communications, Vol. 28, No. 8, October 2010.
Traffic engineering with forward fault correction
Faults such as link failures and high switch configuration delays can cause heavy congestion and packet loss. Because it takes time to detect and react to faults, these conditions can last long---even tens of seconds. We propose forward fault correction (FFC), a proactive approach to handling faults. FFC spreads network traffic such that freedom from congestion is guaranteed under arbitrary combinations of up to k faults. We show how FFC can be practically realized by compactly encoding the constraints that arise from this large number of possible faults and solving them efficiently using sorting networks. Experiments with data from real networks show that, with negligible loss in overall network throughput, FFC can reduce data loss by a factor of 7--130 in well-provisioned networks, and reduce the loss of high-priority traffic to almost zero in well-utilized networks.
- Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. 2014. Traffic engineering with forward fault correction. In Proceedings of the 2014 ACM conference on SIGCOMM (SIGCOMM '14). ACM, New York, NY, USA, 527-538. DOI=10.1145/2619239.2626314 doi.acm.org/10.1145/2619239.2626314 
Private and Veriﬁable Interdomain Routing Decisions
Existing secure interdomain routing protocols can verify validity properties about individual routes, such as whether they correspond to a real network path. It is often useful to verify more complex properties relating to the route decision procedure – for example, whether the chosen route was the best one available, or whether it was consistent with the network’s peering agreements. However, this is difﬁcult to do without knowing a network’s routing policy and full routing state, which are not normally disclosed. In this paper, we show how a network can allow its peers to verify a number of nontrivial properties of its interdomain routing decisions without revealing any additional information. If all the properties hold, the peers learn nothing beyond what the interdomain routing protocol already reveals; if a property does not hold, at least one peer can detect this and prove the violation. We present SPIDeR, a practical system that applies this approach to the Border Gateway Protocol, and we report results from an experimental evaluation to demonstrate that SPIDeR has a reasonable overhead..
- Mingchen Zhao, Wenchao Zhou, Alexander Gurney, Andreas Haeberlen, Micah Sherr, Boon Thau Loo. Private and Veriﬁable Interdomain Routing Decisions . SIGCOMM 2012.
Lehre / Teaching, WiSe 14/15
- Einführung in die Programmierung (VL+UE) 
- Network Protocols and Architectures (VL+UE) 
- Wirelesslab (PR) 
- NA: Internet Routing (SE) 
Dozent: Anja Feldmann
13.10.2014 bis 14.02.2015