Presentation is loading. Please wait.

Presentation is loading. Please wait.

Partition-Tolerant Distributed Publish/Subscribe Systems

Similar presentations


Presentation on theme: "Partition-Tolerant Distributed Publish/Subscribe Systems"— Presentation transcript:

1 Partition-Tolerant Distributed Publish/Subscribe Systems
Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011

2 Content-Based Publish/Subscribe
Toronto NY London P P Publish P Pub/Sub S S S S S P Application scenario (stock quote dissemination) Entities: publishers/subscribers/brokers Correct delivery depends on: connected overlay + correct routing state at brokers Challenge: How to maintain these despite failures. S sub = [STOCK=IBM] S sub= [CHANGE>-8%] Trader 1 Trader 2 Stock quote dissemination SRDS 2011

3 Goals Fault-tolerance (against concurrent failures): Reliability:
Pub Fault-tolerance (against concurrent failures): Broker crashes Link failures Recoveries Reliability: Publications match subscriptions Per-source in-order delivery After some point in time Exactly-once delivery (no loss, no duplicates) Assumptions: Clients are light-weight (broker network is responsible for reliability) A time t after which the system provides guaranteed delivery P/S Goals are to provide P/S service in presence of failures (broker/link). We also consider recoveries of past crashed brokers Reliable delivery Sub Sub SRDS 2011

4 System Architecture Tree dissemination networks: One path from source to destination Pros: Simple, loop-free Preserves publication order (difficult for non-tree content-based P/S) Cons: Trees are highly susceptible to failures Primary tree: Initial spanning tree that is formed as brokers join the system Maintain neighborhood knowledge Allows brokers to reconfigure overlay after failures on the fly ∆-Neighborhood knowledge: ∆ is configuration parameter ensures handling ∆-1 concurrent failures (worst case) Knowledge of other brokers within distance ∆  Join algorithm Knowledge of routing paths within neighborhood  Subscription propagation algorithm 3-neighborhood 2-neighborhood 1-neighborhood Conventional P/S systems use tree-based dissemination Pros (simple, loop-free) Cons (non-resilient to failure) Our deisgn (take an initial tree called “Primary tree” but augment it with information about neighborhoods) Successful reliable forwarding needs: (i) connected overlay; (ii) correct end-to-end routing state SRDS 2011

5 Overview of the Approach
Subscription Propagation Publication Forwarding Broker/Link Recovery Overlay Management Single chain 4 algorithmic components Focus on 3 in this talk Describe algorithms over chain SRDS 2011

6 Overlay Management Alg.
Maintains end-to-end connectivity despite failures in the overlay. Maintains a connected overlay despite failures. SRDS 2011

7 Overlay Partitions When primary tree is setup, brokers communicate with their immediate neighbors in the primary tree through FIFO links. Overlay partitions: Broker crash or link failures creates “partitions” and some neighbor brokers “on the partition” become unreachable from neighboring brokers Active connections: At each point they try to maintain a connection to its closest neighbor in the primary tree. Only active connections are used by brokers Active connection to E x FIFO links A B C D E F S P D ? pid1=<C, {D}> Partition detector Brokers on the partition Brokers beyond the partition Brokers on the partition SRDS 2011

8 Overlay Partitions – 2 Adjacent Failures
What if there are more failures, particularly adjacent failures? If ∆ is large enough the same process can be used for larger partitions. Active connection to F A B C D E F S P E D pid1=<C, {D}> Brokers beyond the partition Brokers on the partition + pid2=<C, {D, E}> SRDS 2011

9 Overlay Partitions - ∆ Adjacent Failures
Worst case scenario: ∆-neighborhood knowledge is not sufficient to reconnect the overlay. Brokers “on” and “beyond” the partition are unreachable. No new active connection A B C D E F S P F E D pid1=<C, {D}> Brokers beyond the partition Brokers on the partition pid2=<C, {D, E}> + pid3=<C, {D, E, F}> SRDS 2011

10 Partition Island and Barriers
Brokers are connected to closest reachable neighbors & aware of nearby partition identifiers. How does this affect end-to-end connectivity? For any pair of brokers, a partition on the primary path between them is: An “island” if end-to-end brokers are reachable through a sequence of active connections A “barrier” if end-toe-end brokers are unreachable through some sequence of active connections source destination A B C D E F S P A B C D E F S P source destination SRDS 2011

11 Subscription Propagation Alg.
How correct routing tables are maintained despite overlay partitions? SRDS 2011

12 Subscription Propagation Algorithm
Establishes end-to-end routing state among brokers while taking into account overlay partitions. Subscriptions are dynamically inserted by subscribers and are propagated along branches of primary tree over active connections. Primary tree is the “basis” of constructing end-to-end forwarding paths. Each subscription contains: SUB = <Id, Predicates, Anchor> Predicates specifies subscriber’s interest, e.g., [STOCK=“IBM”] Anchor is a reference to brokers along the propagation path of the subscription SRDS 2011

13 Subscription Propagation in Absence of Overlay Partitions
Subscription anchor field is updated to a broker point up to ∆ hops closer to subscriber Accepting a subscription is to add it into routing tables Only after confirmations are received, a subscription is accepted (i.e., will be used for matching) Observation: Matching publications are delivered to a subscriber once its local broker accepts subscription Subscriptions s.anchor s s s s s s A B C D E P S conf conf conf conf conf conf ∆ hops Confirmations ∆ hops SRDS 2011

14 Subscription Propagation in Presence of overlay Partitions
Broker B sends s via its active link to bypass the partition and awaits receipt of the corresponding confirmation Once B receives confirmation and accepts s, it tags the confirmation with pid of partitions that s bypassed. Brokers relay this tag in their confirmation messages towards the subscriber’s local broker which accepts and stores s tags along with the tag in its routing table. s conf Subscriptions s s s A B C D E P S C D B Confirmations conf conf* conf* ☑* pid tag is also stored along with s * Tag conf with pid SRDS 2011

15 Publication Forwarding Alg.
How accepted subscriptions and their partition tags are used to achieve reliable delivery? SRDS 2011

16 Publication Forwarding in Absence of Overlay Partitions
Forwarding only uses subscriptions accepted brokers. Steps in forwarding of publication p: Identify anchor of accepted subscriptions that match p Determine active connections towards matching subscriptions’ anchors Send p on those active connections and wait for confirmations If there are local matching subscribers, deliver to them If no downstream matching subscriber exists, issue confirmation towards P Once confirmations arrive, discard p and send a conf towards P Publications p p p p p p p Subscriptions A B C D E P S E C Deliver to local subscribers conf conf conf conf conf conf SRDS 2011

17 Publication Forwarding in Presence of Overlay Partitions
Key forwarding invariant to ensure reliability: we ensure that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription. Case1: Sub s has been accepted with no pid.  It is safe to bypass intermediate brokers p conf Publications Subscriptions p p p A B C D E P S C B D Deliver to local subscribers conf conf conf SRDS 2011

18 Publication Forwarding (cont’d)
Case2: Sub s has been accepted with some pid. Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:  It is safe to deliver publications from sources beyond the partition. p conf Publications Subscriptions p p p A B C D E P S C B D ☑* conf conf conf SRDS 2011

19 Publication Forwarding (cont’d)
Case2: Sub s has been accepted with some pid. Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:  It is safe to deliver publications from sources beyond the partition. p conf Publications Subscriptions p p p A B C D E P S C B D ☑* conf conf conf Depending on when this link has been established either recovery or subscription propagation ensure C accepts s prior to receiving p SRDS 2011

20 Publication Forwarding (cont’d)
Case2: Subscription s is accepted with some pid tags. Case 2b: Publisher’s broker has not accepted s:  It is unsafe to deliver publications from this publisher (invariant). p Subscriptions Publications p p p p p* Tag with pid A B C D E P S ☑* s was accepted at S with the same pid tag SRDS 2011

21 Evaluation Using a mix of simulation and experimental deployments on large-scale testbed. SRDS 2011

22 Size of ∆-neighborhoods
Simulation Results Size of brokers’ Neighborhoods as a function of ∆ Network size of 1000 Broker fanout of 3 ∆=1 ∆=2 ∆=3 ∆=4 ∆=4 ∆=3 ∆=1 ∆=2 Size of ∆-neighborhoods SRDS 2011

23 Impact of Failures on End-to-End Broker Reachability
Using a graph simulation tool. Overlay setup: Network size 1000 Brokers with fanout=3 Failure injection: Failures: up to 100 brokers We randomly marked a given number of nodes as failed Measurements: We counted the number of end-to-end brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain. ∆=1 ∆=3 ∆=4 ∆=2 ∆=1 ∆=4 SRDS 2011

24 Experimental Deployments: Impact of Failures on Pub Delivery
∆=4 Expected ∆=3 500 brokers deployed on 8-core machines in a cluster: Network setup: Overlay fanout=3. We measured aggregate pub. delivery count in an interval of 120s Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers). ∆=1 ∆=1 ∆=3 ∆=2 ∆=4 ∆=1 SRDS 2011

25 Conclusions We developed a reliable P/S system that tolerate concurrent broker and link failures: Configuration parameter ∆ determines level of resiliency against failures (in the worst case). Dissemination trees augmented with neighborhood knowledge. Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures. We studied the performance of the system when number of failures far exceeds ∆: A small value for ∆ ensures good connectivity. SRDS 2011

26 Questions… Thanks for your attention! SRDS 2011

27 Challenges Why “end-to-end” principle does not work?
Responsibility on P/S messaging system Why “end-to-end” principle does not work? Publishers and subscribers are decoupled and unaware of each other. Routing paths are established by dynamically inserted subscriptions Subscription propagation is also subject to broker/link failure. Selective delivery makes in-order delivery over redundant path difficult Subscribers are only interested in a subset of what is published. Subscription propagation algorithm We use a special form of tree dissemination From now on we only discuss brokers. SRDS 2011

28 Store-and-Forward A copy is first preserved on disk
Intermediate hops send an ACK to previous hop after preserving ACKed copies can be dismissed from disk Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery This ensures reliable delivery but may cause delays while the machine is down P P P P From here To here ack ack ack SRDS'09

29 Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001]
Use a mesh network to concurrently forward msgs on disjoint paths Upon failures, the msg is delivered using alternative routes Pros: Minimal impact on delivery delay Cons: Imposes additional traffic & possibility of duplicate delivery P P P P From here To here SRDS'09

30 Replica-based Approach [Bhola , et al., DSN 2002]
Replicas are grouped into virtual nodes Replicas have identical routing information Virtual node Physical Machines SRDS'09

31 Replica-based Approach [Bhola , et al., DSN 2002]
Replicas are grouped into virtual nodes Replicas have identical routing information We compare against this approach Virtual node P P P P P P SRDS'09

32 Publication Forwarding (cont’d)
Case2: Sub s has been accepted with some pid. Case 2b (Partition barrier): Publisher’s broker has also not accepted s p1 Subscriptions Publications p1 p1 p1 p1 p1* p1* Tag with pid ☑r R A B C D E P S ☑* s was accepted at S with the same pid tag p2 & p1 matches r & s matches s SRDS 2011

33 Subscription Propagation with Partitions
Partition islands: Simply confirm (and accept) subscriptions over available If partition brokers are reachable from the other side of the partition Intuition: Publications from P may only be lost if they arrive at B But this will not happen since there is no link towards B from F Correctness proof argues on the precedence of acceptance and creation of links Subscriptions Lead broker A A B C D E P S C B Will accept during recovery Confirmations SRDS 2011

34 Subscription Propagation with Partition Barriers
If a portion of the network that includes publishers is on/beyond a partition barrier, there is no way to communicate the subscription information for the duration of failures Lead broker “partially confirms” the subscription and tags the confirmation with the partition information Accepting brokers store the partition information along with the subscription This ensures liveness Forward Lead broker B A B C E F G P S D C ☑* Partial conf Δ hops SRDS 2011

35 Publication Forwarding
Only accepted subscriptions are stored in SRT and used for matching At each point in time, a broker has a number of connections to its nearest reachable neighbors This set of active connections may change over time Publication forwarding steps: Store publication in a FIFO internal message queue Match and compute set of {from} for subscriptions that match For each partially confirmed subscription, tag the publication with the partition information Send the publication to the closest reachable neighbors towards {from} Once all confirmations arrive, discard publication and issue confirmation towards publisher P queue P A (δ+1)- neighborhood S S SRDS 2011

36 Evaluations Network size of 1000 Broker fanout of 3
Size of brokers’ Neighborhoods as a function of ∆ Network size of 1000 Broker fanout of 3 Network size of 1000 Broker fanout of 7 ∆=4 ∆=3 ∆=2 ∆=1 Size of ∆-neighborhoods ∆=4 ∆=3 ∆=1 ∆=2 Size of ∆-neighborhoods SRDS 2011

37 Overlay Links Management
Sessions: FIFO communication links between brokers. Active sessions: Broker A’s session to B is active if A has no session to another broker C on the path between A and B. Primary tree ∆ = 2 SRDS 2011

38 Agenda Challenges of reliability and fault-tolerance in P/S
Our approach Topology neighborhood knowledge Subscription propagation Publication forwarding Recovery procedure Evaluation results Conclusions SRDS 2011


Download ppt "Partition-Tolerant Distributed Publish/Subscribe Systems"

Similar presentations


Ads by Google