direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Inhalt des Dokuments

Topics for the Seminar on Internet Routing, WS 2013/14

Topics for the seminar on Internet Routing (WS 2013/14).
Themen für das Seminar über Internet Routing (WS 2013/14).

Achieving High Utilization with Software-Driven WAN

We present SWAN, a system that boosts the utilization of inter-datacenter networks by centrally controlling when and how much traffic each service sends and frequently re-configuring the network’s data plane to match current traffic demand. But done simplistically, these reconfigurations can also cause severe, transient congestion because different switches may apply updates at different times. We develop a novel technique that leverages a small amount of scratch capacity on links to apply updates in a provably congestion-free manner, without making any assumptions about the order and timing of updates at individual switches. Further, to scale to large networks in the face of limited forwarding table capacity, SWAN greedily selects a small set of entries that can best satisfy current demand. It updates this set without disrupting traffic by leveraging a small amount of scratch capacity in forwarding tables. Experiments using a testbed prototype and data-driven simulations of two production networks show that SWAN carries 60% more traffic than the current practice.

FCP: A Flexible Transport Framework for Accommodating Diversity

Transport protocols must accommodate diverse application and network requirements. As a result, TCP has evolved over time with new congestion control algorithms such as support for generalized AIMD, background flows, and multipath. On the other hand, explicit congestion control algorithms have been shown to be more efficient. However, they are inherently more rigid because they rely on in-network components. Therefore, it is not clear whether they can be made flexible enough to support diverse application requirements. This paper presents a flexible framework for network resource allocation, called FCP, that accommodates diversity by exposing a simple abstraction for resource allocation. FCP incorporates novel primitives for end-point flexibility ( aggregation and preloading ) into a single framework and makes economics-based congestion control practical by explicitly handling load variations and by decoupling it from actual billing. We show that FCP allows evolution by accommodating diversity and ensuring coexistence, while being as efficient as existing explicit congestion control algorithms.

Leveraging Endpoint Flexibility in Data-Intensive Clusters

Many applications do not constrain the destinations of their network transfers. New opportunities emerge when such transfers contribute a large amount of network bytes. By choosing the endpoints to avoid congested links, completion times of these transfers as well as that of others without similar flexibility can be improved. In this paper, we focus on leveraging the flexibility in replica placement during writes to cluster file systems (CFSes), which account for almost half of all cross-rack traffic in data-intensive clusters. Thereplicas of a CFS write can be placed in any subset of machines as long as they are in multiple fault domains and ensure a balanced use of storage throughout the cluster. We study CFS interactions with the cluster network, analyze optimizations for replica placement, and propose Sinbad – a system that identifies imbalance and adapts replica destinations to navigate around congested links. Experiments on EC2 and trace-driven simulations show that block writes complete 1.3x (respectively,1.58x) faster as the network becomes more balanced. As a collateral benefit, end-to-end completion times of data-intensive jobs improve as well. Sinbad does so with little impact on the long-term storage balance.

Speeding up Distributed Request-Response Workflows

We found that interactive services at Bing have highly variable datacenter-side processing latencies because their processing consists of many sequential stages, parallelization across 10s-1000s of servers and aggregation of responses across the network. To improve the tail latency of such services, we use a few building blocks: reissuing laggards elsewhere in the cluster, new policies to return incomplete results and speeding up laggards by giving them more resources. Combining these building blocks to reduce the overall latency is non-trivial because for the same amount of resource (e.g., number of reissues), different stages improve their latency by different amounts. We present Kwiken, a framework that takes an end-to-end view of latency improvements and costs. It decomposes the problem of minimizing latency over a general processing DAG into a manageable optimization over individual stages. Through simulations with production traces, we show sizable gains; the 99th percentile of latency improves by over 50% when just 0.1% of the responses are allowed to have partial results and by over 40% for 25% of the services when just 5% extra resources are used for reissues.

Ananta: Cloud Scale Load Balancing

Layer-4 load balancing is fundamental to creating scale-out web services. We designed and implemented Ananta, a scale-out layer-4 load balancer that runs on commodity hardware and meets the performance, reliability and operational requirements of multi-tenant cloud computing environments. Ananta combines existing techniques in routing and distributed systems in a unique way and splits the components of a load balancer into a consensus-based reliable control plane and a decentralized scale-out data plane. A key component of Ananta is an agent in every host that can take over the packet modification function from the load balancer, thereby enabling the load balancer to naturally scale with the size of the data center. Due to its distributed architecture, Ananta provides direct server return (DSR) and network address translation (NAT) capabilities across layer-2 boundaries. Multiple instances of Ananta have been deployed in the Windows Azure public cloud with combined bandwidth capacity exceeding 1Tbps. It is serving traffic needs of a diverse set of tenants, including the blob, table and relational storage services. With its scale-out data plane we can easily achieve more than 100Gbps throughput for a single public IP address. In this paper, we describe the requirements of a cloud-scale load balancer, the design of Ananta and lessons learnt from its implementation and operation in the Windows Azure public cloud.

pFabric: Minimal Near-Optimal Datacenter Transport

In this paper we present pFabric, a minimalistic datacenter transport design that provides near theoretically optimal flow co mpletion times even at the 99th percentile for short flows, while still minimizing average flow completion time for long flows. Moreover, pFabric delivers this performance with a very simple design that is based on a key conceptual insight: datacenter transport should decouple flow scheduling from rate control. For flow scheduling, packets carry a single priority number set independently by each flow; switches have very small buffers and implement a very simple priority-based scheduling/dropping mechanism. Rate control is also correspondingly simpler; flows start at line rate and throttle back only under high and persistent packet loss. We provide theoretical intuition and show via extensive simulations that the combination of these two simple mechanisms is sufficient to provide near-optimal performance.

VALE, a Switched Ethernet for Virtual Machines

The growing popularity of virtual machines is pushing the demand for high performance communication between them. Past solutions have seen the use of hardware assistance, in the form of “PCI passthrough” (dedicating parts of physical NICs to each virtual machine) and even bouncing traffic through physical switches to handle data forwarding and replication. In this paper we show that, with a proper design, very high speed communication between virtual machines can be achieved completely in software. Our architecture, called VALE, implements a Virtual Local Ethernet that can be used by virtual machines, such as QEMU, KVM and others, as well as by regular processes. VALE achieves a throughput of over 17 million packets per second (Mpps) between host processes, and over 2 Mpps between QEMU instances, without any hardware assistance. VALE is available for both FreeBSD and Linux hosts, and is implemented as a kernel module that extends our recently proposed netmap framework, and uses similar techniques to achieve high packet rates.

Towards TCAM-based Scalable Virtual Routers

As the key building block for enabling network virtualization, virtual routers have attracted much attention recently. In a virtual router platform, multiple virtual router instances coexist, each with its own FIB (Forwarding Information Base). The small amount of high-speed memory in a physical router platform severely limits the number of FIBs supported, which leads to a scalability challenge. In this paper, we present a method towards TCAM (Ternary Content Addressable Memory) based scalable virtual routers, through a merged data structure that enables the sharing of prefixes from several FIBs in TCAMs. Based on this data structure, we propose two approaches to merge multiple FIBs in TCAMs, paving the way for scalable virtual routers. Experimental results show that, by using the two approaches for storing 14 full IPv4 FIBs, the TCAM memory requirement can be reduced by about 92% and 82% respectively, compared with the conventional approach of treating FIBs as independent entities.

Modeling Complexity of Enterprise Routing Design

Enterprise networks often have complex routing designs given the need to meet a wide set of resiliency, security and routing policies. In this paper, we take the position that minimizing design complexity must be an explicit objective of routing design. We take a first step to this end by presenting a systematic approach for modeling and reasoning about complexity in enterprise routing design. We make three contributions. First, we present a framework for precisely defining objectives of routing design, and for reasoning about how a combination of routing design primitives (e.g. routing instances, static routes, and route filters etc.) will meet the objectives. Second, we show that it is feasible to quantitatively measure the complexity of a routing design by modeling individual routing design primitives, and leveraging configuration complexity metrics [5]. Our approach helps understand how individual design choices made by operators impact configuration complexity, and can enable quantifying design complexity in the absence of configuration files. Third, we validate our model and demonstrate its utility through a longitudinal analysis of the evolution of the routing design of a large campus network over the last three years. We show how our models can enable comparison of the complexity of multiple routing designs that meet the same objective, guide operators in making design choices that can lower complexity, and enable what-if analysis to assess the potential impact of a configuration change on routing design complexity.

LIFEGUARD: Practical Repair of Persistent Route Failures

Student/Bearbeiter: –; Supervisor/Betreuer: –

The Internet was designed to always find a route if there is a policycompliant path. However, in many cases, connectivity is disrupted despite the existence of an underlying valid path. The research community has focused on short-term outages that occur during route convergence. There has been less progress on addressing avoidable long-lasting outages. Our measurements show that longlasting events contribute significantly to overall unavailability. To address these problems, we develop LIFEGUARD, a system for automatic failure localization and remediation. LIFEGUARD uses active measurements and a historical path atlas to locate faults, even in the presence of asymmetric paths and failures. Given the ability to locate faults, we argue that the Internet protocols should allow edge ISPs to steer traffic to them around failures, without requiring the involvement of the network causing the failure. Although the Internet does not explicitly support this functionality today, we show how to approximate it using carefully crafted BGP messages. LIFEGUARD employs a set of techniques to reroute around failures with low impact on working routes. Deploying LIFEGUARD on the Internet, we find that it can effectively route traffic around an AS without causing widespread disruption.


  • Ethan Katz-Bassett, Colin Scott, David R. Choffnes, Ítalo Cunha, Vytautas Valancius, Nick Feamster, Harsha V. Madhyastha, Thomas Anderson, and Arvind Krishnamurthy. 2012. LIFEGUARD: practical repair of persistent route failures. In Proceedings of the ACM SIGCOMM 2012 conference on Applications, technologies, architectures, and protocols for computer communication (SIGCOMM '12). ACM, New York, NY, USA, 395-406.

Seamless BGP Migration With Router Grafting

Student/Bearbeiter: –; Supervisor/Betreuer: –

Network operators are under tremendous pressure to make their networks highly reliable to avoid service disruptions. Yet, operators often need to change the network to upgrade faulty equipment, deploy new services, and install new routers. Unfortunately, changes cause disruptions, forcing a trade-off between the benefit of the change and the disruption it will cause. In this paper we present router grafting, where parts of a router are seamlessly removed from one router and merged into another. We focus on grafting a BGP session and the underlying link–from one router to another, or between blades in a cluster-based router. Router grafting allows an operator to rehome a customer with no disruption, compared to downtimes today measured in minutes. In addition, grafting a BGP session can help in balancing load between routers or blades, planned maintenance, and even traffic management. We show that grafting a BGP session is practical even with today’s monolithic router software. Our prototype implementation uses and extends Click, the Linux kernel, and Quagga, and introduces a daemon that automates the migration process.


  • Eric Keller, Jennifer Rexford, and Jacobus Van Der Merwe. 2010. Seamless BGP migration with router grafting. In Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI'10). USENIX Association, Berkeley, CA, USA, 16-16.

Hedera: Dynamic Flow Scheduling for Data Center Networks

Student/Bearbeiter: –; Supervisor/Betreuer: –

Today’s data centers offer tremendous aggregate bandwidth to clusters of tens of thousands of machines. However, because of limited port densities in even the highest-end switches, data center topologies typically consist of multi-rooted trees with many equal-cost paths between any given pair of hosts. Existing IP multipathing protocols usually rely on per-flow static hashing and can cause substantial bandwidth losses due to longterm collisions. In this paper, we present Hedera, a scalable, dynamic flow scheduling system that adaptively schedules a multi-stage switching fabric to efficiently utilize aggregate network resources. We describe our implementation using commodity switches and unmodified hosts, and show that for a simulated 8,192 host data center, Hedera delivers bisection bandwidth that is 96% of optimal and up to 113% better than static load-balancing methods.


  • Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. 2010. Hedera: dynamic flow scheduling for data center networks. In Proceedings of the 7th USENIX conference on Networked systems design and implementation (NSDI'10). USENIX Association, Berkeley, CA, USA, 19-19.



RouteBricks: Exploiting Parallelism To Scale Software Routers

Student/Bearbeiter: –; Supervisor/Betreuer: –


We revisit the problem of scaling software routers, motivated by recent advances in server technology that enable high-speed parallel processing—a feature router workloads appear ideally suited to exploit. We propose a software router architecture that parallelizes router functionality both across multiple servers and across multiple cores within a single server. By carefully exploiting parallelism at every opportunity, we demonstrate a 35Gbps parallel router prototype; this router capacity can be linearly scaled through the use of additional servers. Our prototype router is fully programmable using the familiar Click/Linux environment and is built entirely from off-the-shelf, general-purpose server hardware.




Compact Routing on Internet-Like Graphs

Student/Bearbeiter: –; Supervisor/Betreuer: –

The Thorup-Zwick (TZ) compact routing scheme is the first generic stretch-3 routing scheme delivering a nearly optimal per-node memory upper bound. Using both direct analysis and simulation, we derive the stretch distribution of this routing scheme on Internet-like inter-domain topologies. By investigating the TZ scheme on random graphs with power-law node degree distributions, Pk ≅ k, we find that the average TZ stretch is quite low and virtually independent of γ. In particular, for the Internet inter-domain graph with γ ≈ 2.1, the average TZ stretch is around 1.1, with up to 70% of all pairwise paths being stretch-1 (shortest possible). As the network grows, the average stretch slowly decreases. We find routing table sizes to be very small (around 50 records for 104-node networks), well below their theoretical upper bounds. Furthermore, we find that both the average shortest path length (i.e., distance) d and width of the distance distribution σ observed in the real Internet inter-AS graph have values that are very close to the minimums of the average stretch in the d- and σ-directions. This leads us to the discovery of a unique critical point of the average TZ stretch as a function of d and σ. The Internet's distance distribution is located in a close neighborhood of this point. This is remarkable given the fact that the Internet inter-domain topology has evolved without any direct attention paid to properties of the stretch distribution. It suggests the average stretch function may be an indirect indicator of the optimization criteria influencing the Internet's inter-domain topology evolution.

Link Positions Matter: A Noncommutative Routing Metric for Wireless Mesh Networks

Student/Bearbeiter: –; Supervisor/Betreuer: –

We revisit the problem of computing the path with the minimum cost in terms of the expected number of link layer transmissions (including retransmissions) in wireless mesh networks. Unlike previous efforts, such as the popular ETX, we account for the fact that MAC protocols (including the IEEE 802.11 MAC) incorporate a finite number of transmission attempts per packet. This in turn leads to our key observation: the performance of a path depends not only on the number of the links on the path and the quality of its links, but also, on the relative positions of the links on the path. Based on this observation, we propose ETOP, a path metric that accurately captures the expected number of link layer transmissions required for reliable end-to-end packet delivery. We analytically compute ETOP, which is not trivial, since ETOP is a noncommutative function of the link success probabilities. Although ETOP is a more involved metric, we show that the problem of computing paths with the minimum ETOP cost can be solved by a greedy algorithm. We implement and evaluate a routing approach based on ETOP on a 25-node indoor mesh network. Our experiments show that the path selection with ETOP consistently results in superior TCP goodput (by over 50% in many cases) compared to path selection based on ETX. We also perform an in-depth analysis of the measurements to better understand why the paths selected by ETOP improve the TCP performance.

Avoiding transient Loops during IGP Convergence in IP Networks

Student/Bearbeiter: –; Supervisor/Betreuer: –

When the topology of an IP network changes due to a link failure or a link metric modification, the routing tables of all the routers must be updated. Each of those updates may cause transient loops. In this paper, we prove that by ordering the updates of the routing tables on the routers, it is possible to avoid all transient loops during the convergence of ISIS or OSPF after a planned link failure, an unplanned failure of a protected link and after a link metric modification. We then propose a protocol that allows the routers to order the update of their routing tables to avoid transient loops without requiring any complex computation.

Pathlet Routing

Student/Bearbeiter: –; Supervisor/Betreuer: –

We present a new routing protocol, pathlet routing, in which networks advertise fragments of paths, called pathlets, that sources concatenate into end-to-end source routes. Intuitively, the pathlet is a highly flexible building block, capturing policy constraints as well as enabling an exponentially large number of path choices. In particular, we show that pathlet routing can emulate the policies of BGP, source routing, and several recent multipath proposals.

This flexibility lets us address two major challenges for Internet routing: scalability and source-controlled routing. When a router's routing policy has only "local" constraints, it can be represented using a small number of pathlets, leading to very small forwarding tables and many choices of routes for senders. Crucially, pathlet routing does not impose a global requirement on what style of policy is used, but rather allows multiple styles to coexist. The protocol thus supports complex routing policies while enabling and incentivizing the adoption of policies that yield small forwarding plane state and a high degree of path choice.

  • P. Brighton Godfrey, Igor Ganichev, Scott Shenker, and Ion Stoica. Pathlet Routing. SIGCOMM 2009, Barcelona, Spain, 2009.

A Deep Dive Into the LISP Cache and What ISPs Should Know About It

Student/Bearbeiter: –; Supervisor/Betreuer: –

Due to scalability issues that the current Internet is facing, the research community has re-discovered the Locator/ID Split paradigm. As the name suggests, this paradigm is based on the idea of separating the identity from the location of end-systems, in order to increase the scalability of the Internet architecture. One of the most successful proposals, currently under discussion at the IETF, is LISP (Locator/ID Separation Protocol). A critical component of LISP, from a performance and resources consumption perspective, as well as from a security point of view, is the LISP Cache. The LISP Cache is meant to temporarily store mappings, i.e., the bindings between identifiers and locations, in order to provide routers with the knowledge of where to forward packets. This paper presents a thorough analysis of such a component, based on real packet-level traces. Furthermore, the implications of policies to increase the level of security of LISP are also analyzed. Our results prove that even a timeout as short as 60 seconds provides high hit ratio and that the impact of using security policies is small.

A Reality Check for Content Centric Networking

Student/Bearbeiter: –; Supervisor/Betreuer: –

Content-Centric Networking (CCN) is a novel networking paradigm centered around content distribution rather than host-to-host connectivity. This change from host-centric to content-centric has several attractive advantages, such as network load reduction, low dissemination latency, and energy efficiency. However, it is unclear whether today's technology is ready for the CCN (r)evolution. The major contribution of this paper is a systematic evaluation of the suitability of existing software and hardware components in today's routers for the support of CCN. Our main conclusion is that a CCN deployment is feasible at a Content Distribution Network (CDN) and ISP scale, whereas today's technology is not yet ready to support an Internet scale deployment.

SPAIN: COTS Data-Center Ethernet for Multipathing over Arbitrary Topologies

Student/Bearbeiter: –; Supervisor/Betreuer: –

Operators of data centers want a scalable network fabric that supports high bisection bandwidth and host mobility, but which costs very little to purchase and administer. Ethernet almost solves the problem – it is cheap and supports high link bandwidths – but traditional Ethernet does not scale, because its spanning-tree topology forces traffic onto a single tree. Many researchers have described “scalable Ethernet” designs to solve the scaling problem, by enabling the use of multiple paths through the network. However, most such designs require specific wiring topologies, which can create deployment problems, or changes to the network switches, which could obviate the commodity pricing of these parts. In this paper, we describe SPAIN (“Smart Path Assignment In Networks”). SPAIN provides multipath forwarding using inexpensive, commodity off-the-shelf (COTS) Ethernet switches, over arbitrary topologies. SPAIN precomputes a set of paths that exploit the redundancy in a given network topology, then merges these paths into a set of trees; each tree is mapped as a separate VLAN onto the physical Ethernet. SPAIN requires only minor end-host software modifications, including a simple algorithm that chooses between pre-installed paths to efficiently spread load over the network. We demonstrate SPAIN’s ability to improve bisection bandwidth over both simulated and experimental data-center networks.

Improving Datacenter Performance and Robustness with Multipath TCP

Student/Bearbeiter: –; Supervisor/Betreuer: –

The latest large-scale data centers offer higher aggregate bandwidth and robustness by creating multiple paths in the core of the network. To utilize this bandwidth requires different flows take different paths, which poses a challenge. In short, a single-path transport seems ill-suited to such networks.

We propose using Multipath TCP as a replacement for TCP in such data centers, as it can effectively and seamlessly use available bandwidth, giving improved throughput and better fairness on many topologies. We investigate what causes these benefits, teasing apart the contribution of each of the mechanisms used by MPTCP.

Using MPTCP lets us rethink data center networks, with a different mindset as to the relationship between transport protocols, routing and topology. MPTCP enables topologies that single path TCP cannot utilize. As a proof-of-concept, we present a dual-homed variant of the FatTree topology. With MPTCP, this outperforms FatTree for a wide range of workloads, but costs the same.

In existing data centers, MPTCP is readily deployable leveraging widely deployed technologies such as ECMP. We have run MPTC on Amazon EC2 and found that it outperforms TCP by a factor of three when there is path diversity. But the biggest benefits will come when data centers are designed for multipath transports.

A Technique for Reducing BGP Update Announcements through Path Exploration Damping

Student/Bearbeiter: –; Supervisor/Betreuer: –

This paper defines and evaluates Path Exploration Damping (PED) ­ a router-level mechanism for reducing the volume of propagation of likely transient update messages within a BGP network and decreasing average time to restore reachability compared to current BGP Update damping practices. PED selectively delays and suppresses the propagation of BGP updates that either lengthen an existing AS Path or vary an existing AS Path without shortening its length. We show how PED impacts on convergence time compared to currently deployed mechanisms like Route Flap Damping (RFD), Minimum Route Advertisement Interval (MRAI) and Withdrawal Rate Limiting (WRATE). We replay Internet BGP update traffic captured at two Autonomous Systems to observe that a PED-enabled BGP speaker can reduce the total number of BGP announcements by up to 32% and reduce Path Exploration by 77% compared to conventional use of MRAI. We also describe how PED can be incrementally deployed in the Internet, as it interacts well with prevailing MRAI deployment, and enables restoration of reachability more quickly than MRAI.

GreenTE: Power-Aware Traffic Engineering

Student/Bearbeiter: –; Supervisor/Betreuer: –

Current network infrastructures exhibit poor power efficiency, running network devices at full capacity all the time regardless of the traffic demand and distribution over the network. Most research on router power management are at component level or link level, treating routers as isolated devices. A complementary approach is to facilitate power management at network level by routing traffic through different paths to adjust the workload on individual routers or links. Given the high path redundancy and low link utilization in today's large networks, this approach can potentially allow more network devices or components to go into power saving mode. This paper proposes an intra-domain traffic engineering mechanism, GreenTE, which maximizes the number of links that can be put into sleep under given performance constraints such as link utilization and packet delay. Using network topologies and traffic data from several wide-area networks, our evaluation shows that GreenTE can reduce line-cards' power consumption by 27% to 42% under constraints that the maximum link utilization is below 50% and the network diameter remains the same as in shortest path routing.

Private and Verifiable Interdomain Routing Decisions

Student/Bearbeiter: –; Supervisor/Betreuer: –

Existing secure interdomain routing protocols can verify validity properties about individual routes, such as whether they correspond to a real network path. It is often useful to verify more complex properties relating to the route decision procedure – for example, whether the chosen route was the best one available, or whether it was consistent with the network’s peering agreements. However, this is difficult to do without knowing a network’s routing policy and full routing state, which are not normally disclosed. In this paper, we show how a network can allow its peers to verify a number of nontrivial properties of its interdomain routing decisions without revealing any additional information. If all the properties hold, the peers learn nothing beyond what the interdomain routing protocol already reveals; if a property does not hold, at least one peer can detect this and prove the violation. We present SPIDeR, a practical system that applies this approach to the Border Gateway Protocol, and we report results from an experimental evaluation to demonstrate that SPIDeR has a reasonable overhead.. 

Zusatzinformationen / Extras


Schnellnavigation zur Seite über Nummerneingabe

Internet Routing
0432 L 822

Dozent: Anja Feldman et al.

ab 19.10.2012

Ort: TEL 1118/19

ab 19.10.2012 16:00 Uhr


19.10.2012: Preparatory meeting. The dates for the seminar itself will be fixed later.