Presentation on theme: "Epidemic Protocols CS614 March 7 th 2002 Ashish Motivala."— Presentation transcript:
Epidemic Protocols CS614 March 7 th 2002 Ashish Motivala
Papers Epidemic algorithms for replicated database maintenance; Alan Demers, Dan Greene, Carl Hauser, Wes Irish and John Larson; Proceedings of the Sixth Annual ACM Symposium on Principles of distributed computing, 1987 Bimodal multicast; Kenneth P. Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu and Yaron Minsky; ACM Trans. Comput. Syst. 17, 2 (May. 1999) Managing update conflicts in Bayou, a weakly connected replicated storage system; D. B. Terry, M. M. Theimer, Karin Petersen, A. J. Demers, M. J. Spreitzer and C. H. Hauser; SOSP1995. Flexible update propagation for weakly consistent replication; Karin Petersen, Mike J. Spreitzer, Douglas B. Terry, Marvin M. Theimer and Alan J. Demers; SOSP, 1997, Fighting fire with fire: using randomized gossip to combat stochastic scalability limits; Indranil Gupta, Kenneth P. Birman, Robbert van Renesse; To appear, March, 2002 Dangers of Replication and a Solution; Jim Gray, Pat Helland, Patrick O’Neil, Dennis Sasha, SIGMOD 1996 (<< Read in CS632)
Simple Epidemic Assume a fixed population of size n For simplicity, assume homogeneous spreading –Simple epidemic: any one can infect any one with equal probability Assume that k members are already infected infection occurs in rounds
Probability of Infection Probability P infect (k,n) that a particular uninfected member is infected in a round if k are already in a round if k are already infected? P infect (k,n) = 1 – P(nobody infects member) = 1 – (1 – 1/n) k E(#newly infected members) = (n-k)x P infect (k,n) Basically its a Binomial Distribution
2 Phases Intuition: 2 Phases Infection –Initial Growth Factor is very high about 2 –Exponential growth Uninfection –Slow death of uninfection to start –Exponential decline Number of rounds necessary to infect the entire population is O(log n) First Half: 1 -> n/2 Phase 1 Second Half: n/2 -> nPhase 2 For large n, P infect (n/2,n) ~ 1 – (1/e) 0.5 ~ 0.4
Applications for Epidemic Protocols Reliable Multicast: virtual synchrony, randomized rumour spreading. Systems (Database Replication) : Clearinghouse, Grapevine, Bayou Membership and Failure Detection: SWIM, SCAMP Data Aggregation Other distributed protocols: leader election; Lightweight Prob. broadcast; delta reliability; Li Li's work; Kempe and Kleinberg's work »Our focus today
Grapevine and Clearinghouse Weakly consistent replication was used at Xerox PARC: Grapevine and Clearinghouse name services –Updates are propagated by unreliable multicast (direct mail). Periodic anti-entropy exchanges among replicas ensure that they eventually converge, even if updates are lost. –Arbitrary pairs of replicas periodically establish contact and resolve all differences between their databases. –Various mechanisms (e.g., MD5 digests and update logs) reduce the volume of data exchanged in the common case. –Deletions handled as a special case via “death certificates” recording the delete operation as an update.
Epidemic Algorithm: Rumour Mongering Each replica periodically “touches” a selected “susceptible” peer site and “infects” it with updates. –Transfer every update known to the carrier but not the victim in pull and vice versa in push. Rumours are dropped using counter or coins schemes. –Partner selection is randomized using a variety of heuristics. Distance vs. Convergence Tradeoff. ie. If only neighbours are updated then link traffic is O(1) but convergence traffic is O(n). –Sites connect to others at distance d with probability d -a Theory shows that the epidemic will eventually the entire population (assuming it is connected). –Heuristics (push vs. pull) affect traffic load and the expected time-to-convergence. Pull converges faster than push. –Pull: p i+1 = (p i ) 2 –Push: p i+1 = p i /e where p i = prob. of a site being susceptible after i rounds (cycles)
Recap. Two Reliable Multicast Models –SRM Local repair of problems but no end-to-end guarantees –Virtual synchrony model (Isis, Horus, Ensemble) All or nothing message delivery with ordering Membership managed on behalf of group State transfer to joining member Great performance for small systems. In large group sizes, under perturbations (heavy load, applications acting little flakey) performance is very hard to maintain.
Multicast scaling issue (SRM) Rep air requ ests (per sec) Retr ans mis sion s (per sec)
Multicast scaling issue (Ensemble)
Bimodal Multicast 2 Sub-protocols Unreliable data distribution (IP multicast) –Upon arrival, a message enters the receiver’s message buffer. –Messages are delivered to the application layer in FIFO order, and are garbage collected out of the message buffer after some period of time. The second sub-protocol is used to repair gaps in the message delivery record –processes maintain a list of a random subset of the full system membership. In practice, we weight this list to contain primarily processes from close by – processes accessible over low-latency links.
Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)
Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesn’t include them.
Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip
Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.
Optimizations Request retransmissions most recent multicast first Idea is to “catch up quickly” leaving at most one gap in the retrieved sequence Participants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests
Optimizations Label each gossip message with senders gossip round number Ignore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct Don’t retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant)
Optimizations Use IP multicast when retransmitting a message if several processes lack a copy –For example, if solicited twice –Also, if a retransmission is received from “far away” –Tradeoff: excess messages versus low latency Use regional TTL to restrict multicast scope
Bimodal Multicast and SRM with system wide constant noise, tree topology Repair reques ts (per sec)
Two predicates Predicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast. Predicate II: A faulty outcome is one where roughly half get the multicast and failures might “conceal” true outcome
Figure 5: Graphs of analytical results Bimodal Multicast is amenable to formal analysis
Unlimited scalability! Probabilistic gossip “routes around” congestion And probabilistic reliability model lets the system move on if a computer lags behind Results in: –Constant communication costs –Constant loads on links –Steady behavior even under stress
Good things? Overcome Internet limitations using randomized P2P gossip –However, Internet routing can “defeat” our clever solutions unless we know network topology Both have great scalability and can survive under stress And both are backed by formal models as well as real code and experimental data
Bayou Basics The motivation for Bayou comes from observations of mobile computing. Connections are expensive, frequent, and often intermittent. Collaborating agents are likely to be guaranteed simultaneous connections. Bayou accommodates these applications by helping them manage weakly consistent data. Bayou does not attempt to be transparent.
Bayou Basics (cont.) Applications should use specific knowledge of their data, along with the knowledge that data may be stale, to detect and resolve conflicts. Applications detect and resolve conflicts differently –Bayou allows for arbitrary dependencies, constraints, and detection of write/write and read/write conflicts. –Programs resolve conflicts with each write. Resolution may involve cascading back-outs. –Procedures must be deterministic so that they may be replayed on multiple machines. A write is considered tentative until committed at the primary server. –A global ordering is used by the primary server to dictate which of several conflicting writes wins. –A modification is stable once it reaches the primary server. –Primary servers have authority, a tradeoff that allows data to become stable w/o hearing responses from all clients and servers.
Implementation Two applications are studied, a bibliographic database and a meeting room scheduler. Anti-entropy: A client may connect to any server for reading and writing data. –Servers replicate all data, and synchronize using pair-wise communication. –Anti-entropy insures eventual consistency of the database (they "gossip"). A primary server is the authoritative source of consistency. Implementation: each server logs committed and tentative data. Anti-entropy sessions update these logs accordingly. Access control and security: security is achieved with public key cryptography, access control by allowing users to grant and revoke privileges. Primary servers are responsible for managing revocation lists.