Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011.

Similar presentations


Presentation on theme: "Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011."— Presentation transcript:

1 Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011

2 London Toronto Trader 1 Pub/Sub S S S S S S S S S S S S S S P P Publish P P P P P P sub = [STOCK=IBM] sub= [CHANGE>-8%] NYTrader 2 2SRDS 2011 Stock quote dissemination

3 Fault-tolerance (against concurrent failures):  Broker crashes  Link failures  Recoveries Reliability:  Publications match subscriptions  Per-source in-order delivery  After some point in time Exactly- once delivery (no loss, no duplicates)  Assumptions:  Clients are light-weight (broker network is responsible for reliability)  A time t after which the system provides guaranteed delivery P/S Pub Sub 3SRDS 2011 Reliable delivery

4 Tree dissemination networks: One path from source to destination  Pros:  Simple, loop-free  Preserves publication order (difficult for non-tree content-based P/S)  Cons:  Trees are highly susceptible to failures Primary tree: Initial spanning tree that is formed as brokers join the system  Maintain neighborhood knowledge  Allows brokers to reconfigure overlay after failures on the fly ∆-Neighborhood knowledge: ∆ is configuration parameter ensures handling ∆-1 concurrent failures (worst case)  Knowledge of other brokers within distance ∆  Join algorithm  Knowledge of routing paths within neighborhood  Subscription propagation algorithm 3-neighborhood 2-neighborhood 1-neighborhood 4SRDS 2011

5 Subscription Propagation Publication Forwarding Broker/Link Recovery Overlay Management Single chain SRDS 20115

6 Maintains end-to-end connectivity despite failures in the overlay. SRDS 20116

7  When primary tree is setup, brokers communicate with their immediate neighbors in the primary tree through FIFO links.  Overlay partitions: Broker crash or link failures creates “partitions” and some neighbor brokers “on the partition” become unreachable from neighboring brokers  Active connections: At each point they try to maintain a connection to its closest neighbor in the primary tree.  Only active connections are used by brokers A A B B C C D D E E F F S S P P D D pid1= Partition detector Brokers on the partition Brokers beyond the partition Brokers on the partition Active connection to E 7SRDS 2011 ? x

8  What if there are more failures, particularly adjacent failures?  If ∆ is large enough the same process can be used for larger partitions. A A B B C C D D E E F F S S P P D D pid1= Brokers beyond the partition Brokers on the partition E E + pid2= Active connection to F 8SRDS 2011

9  Worst case scenario: ∆-neighborhood knowledge is not sufficient to reconnect the overlay.  Brokers “on” and “beyond” the partition are unreachable. No new active connection A A B B C C D D E E F F S S P P D D pid1= Brokers beyond the partition Brokers on the partition E E pid2= F F + pid3= 9SRDS 2011

10 Brokers are connected to closest reachable neighbors & aware of nearby partition identifiers.  How does this affect end-to-end connectivity? For any pair of brokers, a partition on the primary path between them is:  An “island” if end-to-end brokers are reachable through a sequence of active connections  A “barrier” if end-toe-end brokers are unreachable through some sequence of active connections A A B B C C D D E E F F S S P P D D E E F F A A B B C C D D E E F F S S P P D D 10SRDS 2011 source destination source

11 How correct routing tables are maintained despite overlay partitions? SRDS

12  Establishes end-to-end routing state among brokers while taking into account overlay partitions.  Subscriptions are dynamically inserted by subscribers and are propagated along branches of primary tree over active connections.  Primary tree is the “basis” of constructing end-to-end forwarding paths.  Each subscription contains: SUB =  Predicates specifies subscriber’s interest, e.g., [STOCK=“IBM”]  Anchor is a reference to brokers along the propagation path of the subscription 12SRDS 2011

13  Subscription anchor field is updated to a broker point up to ∆ hops closer to subscriber  Accepting a subscription is to add it into routing tables  Only after confirmations are received, a subscription is accepted (i.e., will be used for matching)  Observation: Matching publications are delivered to a subscriber once its local broker accepts subscription A A B B C C D D E E P P S S Subscriptions Confirmations s sssss ☑ conf s.anchor ☑ conf ☑ ☑ ☑ ☑ ∆ hops 13SRDS 2011 ☑

14  Broker B sends s via its active link to bypass the partition and awaits receipt of the corresponding confirmation  Once B receives confirmation and accepts s, it tags the confirmation with pid of partitions that s bypassed.  Brokers relay this tag in their confirmation messages towards the subscriber’s local broker which accepts and stores s tags along with the tag in its routing table.  A A B B C C D D E E P P S S Confirmations Subscriptions C C D D B B s ☑ conf s s s ☑ ☑ conf* ☑ ☑*☑* pid tag is also stored along with s * Tag conf with pid 14SRDS 2011 ☑

15 How accepted subscriptions and their partition tags are used to achieve reliable delivery? SRDS

16 SRDS 2011  Forwarding only uses subscriptions accepted brokers.  Steps in forwarding of publication p:  Identify anchor of accepted subscriptions that match p  Determine active connections towards matching subscriptions’ anchors  Send p on those active connections and wait for confirmations  If there are local matching subscribers, deliver to them  If no downstream matching subscriber exists, issue confirmation towards P  Once confirmations arrive, discard p and send a conf towards P Publications A A B B C C D D E E P P S S Subscriptions ☑ p ☑ ☑☑☑☑☑ C C E E p p ppp Deliver to local subscribers conf 16 p

17  Key forwarding invariant to ensure reliability: we ensure that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription.  Case1: Sub s has been accepted with no pid.  It is safe to bypass intermediate brokers conf Publications A A B B C C D D E E P P S S Subscriptions p C C B B D D ☑☑ ☑☑☑☑☑ p p Deliver to local subscribers conf p 17SRDS 2011

18  Case2: Sub s has been accepted with some pid.  Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:  It is safe to deliver publications from sources beyond the partition. conf Publications A A B B C C D D E E P P S S Subscriptions p C C B B D D ☑☑ ☑☑☑*☑* p p conf p 18SRDS 2011

19  Case2: Sub s has been accepted with some pid.  Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:  It is safe to deliver publications from sources beyond the partition. conf Publications A A B B C C D D E E P P S S Subscriptions p C C B B D D ☑☑ ☑☑☑*☑* p p Depending on when this link has been established either recovery or subscription propagation ensure C accepts s prior to receiving p conf p 19SRDS 2011

20  Case2: Subscription s is accepted with some pid tags.  Case 2b: Publisher’s broker has not accepted s :  It is unsafe to deliver publications from this publisher (invariant). Publications A A B B C C D D E E P P S S Subscriptions p ☑ ☑*☑* p p* s was accepted at S with the same pid tag pp p Tag with pid 20

21 Using a mix of simulation and experimental deployments on large-scale testbed. SRDS

22 SIZE OF BROKERS’ NEIGHBORHOODS AS A FUNCTION OF ∆ SRDS ∆=4 ∆=3 ∆=1 ∆=2 Size of ∆-neighborhoods  Network size of 1000  Broker fanout of 3 ∆=1 ∆=2∆=3∆=4

23  Using a graph simulation tool.  Overlay setup: ▪ Network size 1000 Brokers with fanout=3  Failure injection: ▪ Failures: up to 100 brokers ▪ We randomly marked a given number of nodes as failed  Measurements: ▪ We counted the number of end- to-end brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain. SRDS ∆=3 ∆=4 ∆=2 ∆=1 ∆=4

24  500 brokers deployed on 8-core machines in a cluster:  Network setup: Overlay fanout=3.  We measured aggregate pub. delivery count in an interval of 120s  Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers). 24SRDS 2011 ∆=1 ∆=3 ∆=2 ∆=4 Expected ∆=4 ∆=3 ∆=1

25  We developed a reliable P/S system that tolerate concurrent broker and link failures:  Configuration parameter ∆ determines level of resiliency against failures (in the worst case).  Dissemination trees augmented with neighborhood knowledge.  Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures.  We studied the performance of the system when number of failures far exceeds ∆:  A small value for ∆ ensures good connectivity. 25SRDS 2011

26 Thanks for your attention! SRDS

27  Why “end-to-end” principle does not work?  Publishers and subscribers are decoupled and unaware of each other.  Routing paths are established by dynamically inserted subscriptions  Subscription propagation is also subject to broker/link failure.  Selective delivery makes in-order delivery over redundant path difficult  Subscribers are only interested in a subset of what is published. Responsibility on P/S messaging system Subscription propagation algorithm We use a special form of tree dissemination 27SRDS 2011

28  A copy is first preserved on disk  Intermediate hops send an ACK to previous hop after preserving  ACKed copies can be dismissed from disk  Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery  This ensures reliable delivery but may cause delays while the machine is down P P P P P P P P 28SRDS'09 From here To here ack

29  Use a mesh network to concurrently forward msgs on disjoint paths  Upon failures, the msg is delivered using alternative routes  Pros: Minimal impact on delivery delay  Cons: Imposes additional traffic & possibility of duplicate delivery P P P P P P P P 29SRDS'09 From here To here

30  Replicas are grouped into virtual nodes  Replicas have identical routing information 30SRDS'09 Physical Machines Virtual node

31  Replicas are grouped into virtual nodes  Replicas have identical routing information  We compare against this approach P P P P P P P P P P P P Virtual node 31SRDS'09

32 SRDS 2011  Case2: Sub s has been accepted with some pid.  Case 2b (Partition barrier): Publisher’s broker has also not accepted s Publications A A B B C C D D E E P P S S Subscriptions p1 ☑ ☑*☑* p1* s was accepted at S with the same pid tag p1 p1* Tag with pid 32 p2 & p1 matches r & smatches s ☑r☑r ☑r☑r ☑r☑r ☑r☑r ☑r☑r ☑r☑r ☑☑r☑r R R

33  Partition islands:  Simply confirm (and accept) subscriptions over available  If partition brokers are reachable from the other side of the partition  Intuition:  Publications from P may only be lost if they arrive at B  But this will not happen since there is no link towards B from F  Correctness proof argues on the precedence of acceptance and creation of links A A B B C C D D E E P P S S P P B B C C Lead broker A A Confirmations Subscriptions ☑☑ ☑ ☑ Will accept during recovery ☑ 33SRDS 2011

34  If a portion of the network that includes publishers is on/beyond a partition barrier, there is no way to communicate the subscription information for the duration of failures  Lead broker “partially confirms” the subscription and tags the confirmation with the partition information  Accepting brokers store the partition information along with the subscription  This ensures liveness A A B B C C E E F F G G A A P P S S C C D D Lead broker B B Forward Partial conf ☑*☑* ☑*☑* Δ hops 34SRDS 2011

35  Only accepted subscriptions are stored in SRT and used for matching  At each point in time, a broker has a number of connections to its nearest reachable neighbors  This set of active connections may change over time Publication forwarding steps: 1. Store publication in a FIFO internal message queue 2. Match and compute set of {from} for subscriptions that match 3. For each partially confirmed subscription, tag the publication with the partition information 4. Send the publication to the closest reachable neighbors towards {from} 5. Once all confirmations arrive, discard publication and issue confirmation towards publisher queue P P P P P P P P S S S S S P P (δ+1)- neighborhood A 35SRDS 2011

36  Network size of 1000  Broker fanout of 3  Network size of 1000  Broker fanout of 7 SIZE OF BROKERS’ NEIGHBORHOODS AS A FUNCTION OF ∆ SRDS ∆=4 ∆=3 ∆=1 ∆=2 Size of ∆-neighborhoods ∆=4 ∆=3 ∆=2 ∆=1 Size of ∆-neighborhoods

37  Sessions: FIFO communication links between brokers.  Active sessions: Broker A’s session to B is active if A has no session to another broker C on the path between A and B.  ∆ = 2 Primary tree 37SRDS 2011

38  Challenges of reliability and fault-tolerance in P/S  Our approach  Topology neighborhood knowledge  Subscription propagation  Publication forwarding  Recovery procedure  Evaluation results  Conclusions 38SRDS 2011


Download ppt "Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011."

Similar presentations


Ads by Google