Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Slides:



Advertisements
Similar presentations
High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.
Advertisements

Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Lab 2 Group Communication Andreas Larsson
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
A Dependable Auction System: Architecture and an Implementation Framework
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Algorithm for Virtually Synchronous Group Communication Idit Keidar, Roger Khazan MIT Lab for Computer Science Theory of Distributed Systems Group.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Transis Efficient Message Ordering in Dynamic Networks PODC 1996 talk slides Idit Keidar and Danny Dolev The Hebrew University Transis Project.
Group Communication Phuong Hoai Ha & Yi Zhang Introduction to Lab. assignments March 24 th, 2004.
1 Dynamic Atomic Storage Without Consensus Alex Shraer (Technion) Joint work with: Marcos K. Aguilera (MSR), Idit Keidar (Technion), Dahlia Malkhi (MSR.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Abstractions for Fault-Tolerant Distributed Computing Idit Keidar MIT LCS.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
1 Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group Paradigms for Building Distributed Systems: Performance Measurements and.
Transis Dynamic Voting for Consistent Primary Components PODC 1997 talk slides Esti Yeger Lotem, Idit Keidar and Danny Dolev The Hebrew University
1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.
Optimistic Virtual Synchrony Jeremy Sussman - IBM T.J.Watson Idit Keidar – MIT LCS Keith Marzullo – UCSD CS Dept.
Transis 1 Fault Tolerant Video-On-Demand Services Tal Anker, Danny Dolev, Idit Keidar, The Transis Project.
Lab 1 Bulletin Board System Farnaz Moradi Based on slides by Andreas Larsson 2012.
Evaluating the Running Time of a Communication Round over the Internet Omar Bakr Idit Keidar MIT MIT/Technion PODC 2002.
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
Managing Service Metadata as Context The 2005 Istanbul International Computational Science & Engineering Conference (ICCSE2005) Mehmet S. Aktas
Distributed Systems: Concepts and Design Chapter 1 Pages
ARMADA Middleware and Communication Services T. ABDELZAHER, M. BJORKLUND, S. DAWSON, W.-C. FENG, F. JAHANIAN, S. JOHNSON, P. MARRON, A. MEHRA, T. MITTON,
TOTEM: A FAULT-TOLERANT MULTICAST GROUP COMMUNICATION SYSTEM L. E. Moser, P. M. Melliar Smith, D. A. Agarwal, B. K. Budhia C. A. Lingley-Papadopoulos University.
SPREAD TOOLKIT High performance messaging middleware Presented by Sayantam Dey Vipin Mehta.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Replication with View Synchronous Group Communication Steve Ko Computer Sciences and Engineering.
Consistent and Efficient Database Replication based on Group Communication Bettina Kemme School of Computer Science McGill University, Montreal.
Computer Networks with Internet Technology William Stallings
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Improving the Efficiency of Fault-Tolerant Distributed Shared-Memory Algorithms Eli Sadovnik and Steven Homberg Second Annual MIT PRIMES Conference, May.
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Dealing with open groups The view of a process is its current knowledge of the membership. It is important that all processes have identical views. Inconsistent.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Dealing with open groups The view of a process is its current knowledge of the membership. It is important that all processes have identical views. Inconsistent.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
November NC state university Group Communication Specifications Gregory V Chockler, Idit Keidar, Roman Vitenberg Presented by – Jyothish S Varma.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
December 4, 2002 CDS&N Lab., ICU Dukyun Nam The implementation of video distribution application using mobile group communication ICE 798 Wireless Mobile.
1 Communication and Data Management in Dynamic Distributed Systems Nancy Lynch MIT June 20, 2002 …
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
1 Reliable Group Communication: a Mathematical Approach Nancy Lynch Theory of Distributed Systems MIT LCS Kansai chapter, IEEE July 7, 2000 GC …
1 Roie Melamed, Technion AT&T Labs Araneola: A Scalable Reliable Multicast System for Dynamic Wide Area Environments Roie Melamed, Idit Keidar Technion.
NTT - MIT Research Collaboration — Bi-Annual Report, July 1—December 31, 1999 MIT : Cooperative Computing in Dynamic Environments Nancy Lynch, Idit.
1 Compositional Design and Analysis of Timing-Based Distributed Algorithms Nancy Lynch Theory of Distributed Systems MIT Third MURI Workshop Washington,
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Coordination and Agreement
Replication & Fault Tolerance CONARD JAMES B. FARAON
Algorithm for Virtually Synchronous Group Communication
Replication Middleware for Cloud Based Storage Service
Active replication for fault tolerance
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Performance Evaluation of a Communication Round over the Internet
Evaluating the Running Time of a Communication Round over the Internet
Presentation transcript:

Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is joint work with Sussman, Marzullo and Dolev

Collaborators Tal Anker Ziv Bar-Joseph Gregory Chockler Danny Dolev Alan Fekete Nabil Huleihel Kyle Ingols Roger Khazan Carl Livadas Nancy Lynch Keith Marzullo Yoav Sasson Jeremy Sussman Alex Shvartsman Igor Tarashchanskiy Roman Vitenberg Esti Yeger-Lotem

Outline Motivation Group communication - background A novel architecture for scalable group communication services in WAN A new scalable group membership algorithm –Specification –Algorithm –Implementation –Performance Conclusions

Modern Distributed Applications (in WANs) Highly available servers –Web –Video-on-Demand Collaborative computing –Shared white-board, shared editor, etc. –Military command and control –On-line strategy games Stock market

Important Issues in Building Distributed Applications Consistency of view –Same picture of game, same shared file Fault tolerance, high availability Performance –Conflicts with consistency? Scalability –Topology - WAN, long unpredictable delays –Number of participants

Generic Primitives - Middleware, “Building Blocks” E.g., total order, group communication Abstract away difficulties, e.g., –Total order - a basis for replication –Mask failures Important issues: –Well specified semantics - complete –Performance

Research Approach Rigorous modeling, specification, proofs, performance analysis Implementation and performance tuning Services  Applications Specific examples  General observations

G Send(G) Group Communication Group abstraction - a group of processes is one logical entity Dynamic Groups (join, leave, crash) Systems: Ensemble, Horus, ISIS, Newtop, Psync, Sphynx, Relacs, RMP, Totem, Transis

Virtual Synchrony [ Birman, Joseph 87] Group members all see events in same order –Events: messages, process crash/join Powerful abstraction for replication Framework for fault tolerance, high availability Basic component: group membership –Reports changes in set of group members

Example: Highly Available VoD [ Anker, Dolev, Keidar ICDCS1999] Dynamic set of servers Clients talk to “abstract” service Server can crash, client shouldn’t know

VoD Service: Exploiting Group Communication Group abstraction for connection establishment and transparent migration (with simple clients) Membership services detect conditions for migration - fault tolerance and load balancing Reliable group multicast among servers for consistently sharing information Virtual Synchrony allows servers to agree upon migration immediately (no message exchange) Reliable messages for control Server: ~2500 C++ lines –All fault tolerance logic at server

Related Projects Moshe: Group Membership ICDCS 00 Architecture for Group Membership in WAN DIMACS 98 Specification Survey 99 Virtual Synchrony ICDCS 00 Inheritance-based Modeling ICSE 00 Object Replication PODC 96 CSCW NGITS 97 Highly Available VoD ICDCS 99 Group communication Applications Dynamic Voting PODC 97 QoS Support TINA 96, OPODIS 00 Optimistic VS SRDS 00

A Scalable Architecture for Group Membership in WANs Tal Anker, Gregory Chockler, Danny Dolev, Idit Keidar DIMACS Workshop 1998

Scalable Membership Architecture Dedicated distributed membership servers “divide and conquer” –Servers involved only in membership changes –Members communicate with each other directly (implement “virtual synchrony”) Two levels of membership –Notification Service NSView - “who is around” –Agreed membership views

Architecture NSView: "Who is around" failure/join/leave Agreed View: Members set and identifier Notification Service (NS) Membership {A,B,C,D,E},7 Notification Service (NS) Membership {A,B,C,D,E},7

The Notification Service (NS) Group members send requests: –join(Group G), –leave(Group G) directly to (local) NS NS detects faults (member / domain) Information propagated to all NS servers NS servers notify membership servers of new NSView

The NS Communication: Reliable FIFO links Membership servers can send each other messages using NS FIFO order If S1 sends m1 and later m2 then any server which receives both, receives m1 first. Reliable links If S1 sends m to S2 then eventually either S2 receives m or S1 suspects S2 (and all of its clients).

Moshe: A Group Membership Algorithm for WANs Idit Keidar, Jeremy Sussman Keith Marzullo, Danny Dolev ICDCS 2000

Membership in WAN: the Challenge Message latency is large and unpredictable Frequent message loss è Time-out failure detection is inaccurate è We use a notification service (NS) for WANs è Number of communication rounds matters è Algorithms may change views frequently è View changes require communication for state transfer, which is costly in WAN

Moshe’s Novel Concepts Designed for WANs from the ground up –Previous systems emerged from LAN Avoids delivery of “obsolete” views –Views that are known to be changing –Not always terminating (but NS is) Runs in a single round (“typically”)

Member-Server Interaction

Moshe Guarantees View - –Identifier is monotonically increasing Conditional liveness property: Agreement on views If “all” eventually have the same last NSView then “all” eventually agree on the last view obsolete views Composable  Allows reasoning about individual components  Useful for applications

Moshe Operation: Typical Case In response to new NSView (members), –send proposal to other servers with NSView –send startChange to local members (clients) Once proposals from all servers of NSView members arrive, deliver view: –members - NSView, –identifier higher than all previous to local members

Example: Client B joins a group

Goal: Self Stabilizing Once the same last NSView is received by all servers: –All send proposals for this NSView –All the proposals reach all the servers –All servers use these proposals to deliver the same view And they live happily ever after!

è To avoid deadlock: A must respond Out-of-Sync Case: unexpected proposal X proposal +c X -c proposal AB C

è Extra proposals are redundant, responding with a proposal may cause live-lock Out-of-Sync Case: unexpected proposal AB C -C +C +AB+C view

Out-of-Sync Case: missing proposal AB C -C +C +AB+C view This case exposed by correctness proof

Detecting a “Missing Proposal” Proposals are numbered –Numbers monotonically increasing PropNum - the latest I sent Proposal has extra information: –Used[] contains last proposal used for a view (per server) Detection: proposal p arrives with: p.Used [me] = PropNum

Missing Proposal Detection AB C +C+AB+C 111 view Used: 1,1,1 -C +C 2, [111] Used[C] =1= PropNum Detection! PropNum:1

Detecting Blocking Cases My last proposal was used by me with an earlier proposal of server A –A’s last proposal will arrive when my algorithm is not running (extra proposal) My last proposal was used by A with an earlier proposal of A –In A’s latest proposal, Used[me] = PropNum

Handling Out-of-Sync Cases: “Slow Agreement” Also sends proposals, tagged “SA” Invoked upon blocking detection or upon receipt of “SA” proposal Upon receipt of “SA” proposal with bigger number than PropNum, respond with same number Deliver view only with “full house” of same number proposals

Rational for Slow Agreement Algorithm Termination Number can only increase if NSView occurs –Eventually, it stops increasing (if stable) Every recipient responds with greater or equal number Numbers are strictly increasing – Same number is not sent twice

How Typical is the “typical” Case? Depends on the notification service (NS) –Classify NS good behaviors: symmetric and transitive perception of failures Transitivity depends on logical topology, how suspicions propagate Typical case should be very common Need to measure

Implementation Use CONGRESS [ Anker et al ] –NS for WAN –Always symmetric, can be non-transitive –Logical topology can be configured Moshe servers extend CONGRESS servers Socket interface with processes

The Experiment Run over the Internet –In the US: MIT, Cornell (CU), UCSD –In Taiwan: NTU –In Israel: HUJI Run for 10 days in one configuration, 2.5 days in another 10 clients at each location –continuously join/leave 10 groups

Two Experiment Configurations

Percentage of “Typical” Cases Configuration 1: –MIT: 10,786 views, 10,661 one round % –Other sites: 98.8%, 98.9%, 98.97%, 98.85% Configuration 2: –MIT: 2,559 views, 2,555 one round % –Other sites: 99.82%, 99.79%, 99.81%, 99.84% Overwhelming majority for one round! Depends on topology  can scale

Performance: Surprise! milliseconds number of runs Histogram of Moshe duration MIT, configuration 1, runs up to 4 seconds (97%)

Performance: Part II Histogram of Moshe duration MIT, configuration 2, runs up to 3 seconds (99.7%) milliseconds number of runs

Performance over the Internet: What is Going on? Without message loss, running time is close to biggest round-trip-time, ~650 ms. –As expected Message loss has a big impact Configuration 2 has much less loss,  more cases of good performance

“Slow” versus “Typical” Slow can take 1 or 2 rounds once it is run –Depending on PropNum Slow after NE –One-round is run first, then detection, and slow –Without loss ms., 40% more than usual Slow without NE –Detection by unexpected proposal –Only slow algorithm is run –Runs less time than one-round

Unstable Periods: No Obsolete Views “Unstable” = –constant changes; or –connected processes differ in failure detection Configuration 1: –379 of the 10,786 views  4 seconds, 3.5% –167  20 seconds, 1.5% –Longest running time 32 minutes Configuration 2: –14 of 2,559 views  4 seconds, 0.5% –Longest running time 31 seconds

Scalability Measurements Controlled experiment at MIT and UCSD –Prototype NS, based on TCP/IP ( Sasson ) –Inject faults to test “slow” case Vary number of members, servers Measure end-to-end latencies at member, from join/leave/suspicion to corresponding view Average of 150 (50 slow) runs

End-to-End Latency: Scalable! Member scalability: 4 servers (constant) Server and member scalability: 4-14 servers

Conclusion: Moshe Features Avoiding obsolete views A single round –98% of the time in one configuration –99.8% of the time in another Using a notification service for WANs –Good abstraction –Flexibility to configure multiple ways –Future work: configure more ways Scalable “divide and conquer” architecture

Retrospective: Role of Theory Specification –Possible to implement –Useful for applications (composable) Specification can be met in one round “typically” (unlike Consensus) Correctness proof exposes subtleties –Need to avoid live-lock –Two types of detection mechanisms needed

Future Work: The QoS Challenge Some distributed applications require QoS –Guaranteed available bandwidth –Bounded delay, bounded jitter Membership algorithm terminates in one round under certain circumstances –Can we leverage on that to guarantee QoS under certain assumptions? Can other primitives guarantee QoS?