direkt zum Inhalt springen

direkt zum Hauptnavigationsmenü

Sie sind hier

TU Berlin

Page Content

Publications by Type: PhD Theses

Towards Improved Control and Troubleshooting for Operational Networks
Citation key W-TICTON-11
Author Wundsam, Andreas
Year 2011
Month August
School Technische Universit├Ąt Berlin
Abstract Over the past decade, operational networks, have grown tremendously in size, performance and importance. This concerns particularly the Internet, the ultimate 'network of networks.' We expect this trend to continue as more and more services traditionally provided by the local computer move to the cloud, e.g., file storage services and office applications. In spite of this, our ability to control and manage these networks remains painfully inadequate, and our visibility into the network limited. This has been exemplified by several recent outages that have caused significant disruption of important Internet services. Part of the challenges for controlling and troubleshooting networks stem from the nature of the problem: Networks are intrinsically highly distributed systems with distributed state and configuration. Consequently, a consistent view of the network state is often difficult to attain. They are also highly heterogeneous: Their scale ranges from small home-networks to data center networks that transfer enormous amounts of data at high speeds between thousands of hosts. Their geographic spread may be confined to a single rack, or span the globe. The Internet combines all these different kinds of networks, and thus their individual challenges. In addition, the network architecture and the available toolset has evolved little if at all over the past decade. In fact, the Internet core and architecture has been diagnosed with ossification. Thus, debugging problems in an operational network still comes down to guesswork, as the architecture provides little support for fault localization and troubleshooting, and available tools like NetFlow, traceroute and tcpdump provide either only coarse-grained statistical insight, or are confined to single vantage points and do not provide consistent information across the network. In this thesis, we explore how to improve our control over networks and our abilities to debug and troubleshoot problems. Due to the extreme diversity of the environments, we do not strive for a one-size-fits-all solution, but propose and evaluate several approaches tailored to specific important scenarios and environments. We emphasize network centric approaches that can be implemented locally and are transparent to the end hosts. In the spirit of trusting 'running code', we implement all our approaches 'on the metal' and evaluate them in real networks. We first explore the Potential of Flow Routing as an approach available to end users to self-improve their Internet Access. We find Flow-Routing to be a viable, cost-efficient approach for communities to share and bundle their access lines for improved reliability and performance. On a wider scale, we explore Network Virtualization as a possible means to to overcome the ossification of the Internet core and also enable new troubleshooting primitives. We propose a Control Architecture for Network Virtualization in a multi-player, multi-role scenario. We next turn to troubleshooting. Based on Network Virtualization, we propose Mirror VNets as a primitive that enables safer evolution and improved debugging abilities for complex network services. To this end, a production VNet is paired with a Mirror VNet in identical state and configuration. Finally, we explore how Software Defined Network architectures, e.g., OpenFlow, can be leveraged to enable record and replay troubleshooting for Networks. We propose and evaluate OFRewind, the first system that enables practical record and replay in operational networks, even in the presence of black-box devices than cannot be modified or instrumented. We present several case studies that underline its utility. Our evaluation shows that OFRewind scales at least as well as current controller implementations and does not significantly impact the scalability of an OpenFlow controller domain. In summary, we propose several simple but effective, scenario-specific and network centric approaches that improve the control and troubleshooting of Operational Networks, from the residential network and access line to the datacenter. Our approaches have all been implemented and evaluated on real networks, and can serve as a datapoint and guidance for how networks may need to evolve to cater to their growing importance.
Bibtex Type of Publication Dissertation
Link to publication Link to original publication Download Bibtex entry

Zusatzinformationen / Extras

Quick Access:

Schnellnavigation zur Seite über Nummerneingabe