Almost every computing system
nowadays is distributed, ranging from multi-core laptops to
Internet-scale services; understanding the principles of distributed
computing is hence important for the design and engineering of modern
computing systems. Fundamental issues that arise in reliable and
efficient distributed systems include developing adequate methods for
modeling failures and synchrony assumptions, determining precise
performance bounds on implementations of concurrent data structures,
capturing the trade-off between consistency and efficiency, and
demarcating the frontier of feasibility in distributed computing.
For example, popular Internet services and applications such as CNN.com, YouTube, Facebook, Skype, BitTorrent attract millions of users every day, and only by the effective load-balancing and collaboration of many thousand machines, an acceptable Quality-of-Service/Quality-of-Experience can be guaranteed. While distributed systems promise a good scalability as well as a high robustness, they pose challenging research problems, such as: How to design robust and scalable distributed architectures and services? How to coordinate access to a shared resource, e.g., by electing a leader? Or how to provide incentives for cooperation in an open, collaborative distributed system?
|Author||Haeberlen, Andreas and Kouznetsov, Petr and Druschel, Peter|
|Title of Book||21st ACM Symposium on Operating Systems Principles (SOSP 2007)|
|Location||Stevenson, Washington, USA|
|Abstract||We describe PeerReview, a system that provides accountability in distributed systems. PeerReview ensures that Byzantine faults whose effects are observed by a correct node are eventually detected and irrefutably linked to a faulty node. At the same time, PeerReview ensures that a correct node can always defend itself against false accusations. These guarantees are particularly important for systems that span multiple administrative domains, which may not trust each other.PeerReview works by maintaining a secure record of the messages sent and received by each node. The record isused to automatically detect when a node's behavior deviates from that of a given reference implementation, thus exposing faulty nodes. PeerReview is widely applicable: it only requires that a correct node's actions are deterministic, that nodes can sign messages, and that each node is periodically checked by a correct node. We demonstrate that PeerReview is practical by applying it to three different types of distributed systems: a network filesystem, a peer-to-peer system, and an overlay multicast system.|