Partition-Tolerant Distributed Publish/Subscribe Systems

Slides:



Advertisements
Similar presentations
Opportunistic Multipath Forwarding in Publish/Subscribe Systems Reza Sherafat Kazemzadeh AND Hans-Arno Jacobsen Middleware Systems Research Group University.
Advertisements

Data and Computer Communications
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
1 Cycle Detection in Publish/Subscribe Overlay Networks Reza Sherafat Alex Cheung Prof. Cristiana Amza ECE1747 – Course Project University of Toronto.
Overlay Neighborhoods for Distributed Publish/Subscribe Systems Reza Sherafat Kazemzadeh Supervisor: Dr. Hans-Arno Jacobsen SGS PhD Thesis Defense University.
Bridging. Bridge Functions To extend size of LANs either geographically or in terms number of users. − Protocols that include collisions can be performed.
Improving TCP Performance over Mobile Ad Hoc Networks by Exploiting Cross- Layer Information Awareness Xin Yu Department Of Computer Science New York University,
Ranveer Chandra , Kenneth P. Birman Department of Computer Science
MANETs Routing Dr. Raad S. Al-Qassas Department of Computer Science PSUT
Monday, June 01, 2015 ARRIVE: Algorithm for Robust Routing in Volatile Environments 1 NEST Retreat, Lake Tahoe, June
Transactional Mobility in Distributed Content-Based Publish/Subscribe Systems Songlin Hu*, Vinod Muthusamy +, Guoli Li +, Hans-Arno Jacobsen + * Chinese.
Small-world Overlay P2P Network
Haiyun Luo, Fan Ye, Jerry Cheng, Songwu Lu, Lixia Zhang
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Reliable Communication for Highly Mobile Agents ECE 7995: Term Paper.
Adaptive backup routing for ad-hoc networks Adviser: Ho-Ting Wu Speaker: Zen-De Liu Date:05/14/2007.
Design and Evaluation of a Wide-Area Event Notification Service Antonio Carzaniga David S. Rosenblum Alexander L. Wolf.
Slide Set 15: IP Multicast. In this set What is multicasting ? Issues related to IP Multicast Section 4.4.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
Anonymous Gossip: Improving Multicast Reliability in Mobile Ad-Hoc Networks Ranveer Chandra (joint work with Venugopalan Ramasubramanian and Ken Birman)
Application Layer Multicast for Earthquake Early Warning Systems Valentina Bonsi - April 22, 2008.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #5 Mobile Ad-Hoc Networks TBRPF.
P2P Course, Structured systems 1 Introduction (26/10/05)
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems (Antony Rowstron and Peter Druschel) Shariq Rizvi First.
Distributed Publish/Subscribe Network Presented by: Yu-Ling Chang.
Delivery, Forwarding and
Data Communications and Networking
Alex King Yeung Cheung and Hans-Arno Jacobsen University of Toronto June, 24 th 2010 ICDCS 2010 MIDDLEWARE SYSTEMS RESEARCH GROUP.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
1 Pertemuan 20 Teknik Routing Matakuliah: H0174/Jaringan Komputer Tahun: 2006 Versi: 1/0.
Thesis Proposal Data Consistency in DHTs. Background Peer-to-peer systems have become increasingly popular Lots of P2P applications around us –File sharing,
© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking BGP, Flooding, Multicast routing.
A Distributed Scheduling Algorithm for Real-time (D-SAR) Industrial Wireless Sensor and Actuator Networks By Kiana Karimpour.
Publisher Mobility in Distributed Publish/Subscribe Systems Vinod Muthusamy, Milenko Petrovic, Dapeng Gao, Hans-Arno Jacobsen University of Toronto June.
Leader Election Algorithms for Mobile Ad Hoc Networks Presented by: Joseph Gunawan.
Content-Based Routing in Mobile Ad Hoc Networks Milenko Petrovic, Vinod Muthusamy, Hans-Arno Jacobsen University of Toronto July 18, 2005 MobiQuitous 2005.
Network and Communications Ju Wang Chapter 5 Routing Algorithm Adopted from Choi’s notes Virginia Commonwealth University.
Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.
MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG Total Order in Content-based Publish/Subscribe Systems Joint work with: Vinod Muthusamy, Hans-Arno Jacobsen.
CSCI 465 D ata Communications and Networks Lecture 15 Martin van Bommel CSCI 465 Data Communications & Networks 1.
Data Communications and Networking Chapter 11 Routing in Switched Networks References: Book Chapters 12.1, 12.3 Data and Computer Communications, 8th edition.
Presentation slides prepared by Ramakrishnan.V LMS: A Router Assisted Scheme for Reliable Multicast Christos Papadopoulos, University of Southern California.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
KAIS T High-throughput multicast routing metrics in wireless mesh networks Sabyasachi Roy, Dimitrios Koutsonikolas, Saumitra Das, and Y. Charlie Hu ICDCS.
MIDDLEWARE SYSTEMS RESEARCH GROUP Adaptive Content-based Routing In General Overlay Topologies Guoli Li, Vinod Muthusamy Hans-Arno Jacobsen Middleware.
Minimal Broker Overlay Design for Content-Based Publish/Subscribe Systems Naweed Tajuddin Balasubramaneyam Maniymaran Hans-Arno Jacobsen University of.
1 Computer Communication & Networks Lecture 21 Network Layer: Delivery, Forwarding, Routing Waleed.
Information-Centric Networks10b-1 Week 10 / Paper 2 Hermes: a distributed event-based middleware architecture –P.R. Pietzuch, J.M. Bacon –ICDCS 2002 Workshops.
Teknik Routing Pertemuan 10 Matakuliah: H0524/Jaringan Komputer Tahun: 2009.
“Controlled Straight Mobility and Energy-Aware Routing in Robotic Wireless Sensor Networks ” Rafael Falcon, Hai Liu, Amiya Nayak and Ivan Stojmenovic
a/b/g Networks Routing Herbert Rubens Slides taken from UIUC Wireless Networking Group.
Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems.
Information-Centric Networks Section # 10.2: Publish/Subscribe Instructor: George Xylomenos Department: Informatics.
Peer to Peer Network Design Discovery and Routing algorithms
A Framework for Reliable Routing in Mobile Ad Hoc Networks Zhenqiang Ye Srikanth V. Krishnamurthy Satish K. Tripathi.
Peter R Pietzuch and Jean Bacon Peer-to-Peer Overlay Networks in an Event-Based Middleware DEBS’03, San Diego, CA, USA,
Computer Networks22-1 Network Layer Delivery, Forwarding, and Routing.
2/14/2016  A. Orda, A. Segall, 1 Queueing Networks M nodes external arrival rate (Poisson) service rate in each node (exponential) upon service completion.
Relying on Safe Distance to Achieve Strong Partitionable Group Membership in Ad Hoc Networks Authors: Q. Huang, C. Julien, G. Roman Presented By: Jeff.
1 Routing on a Logical Grid Mohamed Gouda, Anish Arora, Young-ri Choi, Vinayak Naik The University of Texas at Austin The Ohio-State University January.
1 Roie Melamed, Technion AT&T Labs Araneola: A Scalable Reliable Multicast System for Dynamic Wide Area Environments Roie Melamed, Idit Keidar Technion.
Congestion Avoidance with Incremental Filter Aggregation in Content-Based Routing Networks Mingwen Chen 1, Songlin Hu 1, Vinod Muthusamy 2, Hans-Arno Jacobsen.
1 Towards Scalable Pub/Sub Systems Shuping Ji 1, Chunyang Ye 2, Jun Wei 1 and Arno Jacobsen 3 1 Chinese Academy of Sciences 2 Hainan University 3 Middleware.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Foundations for Highly-Available Content-based Publish/Subscribe Overlays Young Yoon, Vinod Muthusamy and Hans-Arno Jacobsen.
Indirect Communication Paradigms (or Messaging Methods)
Indirect Communication Paradigms (or Messaging Methods)
Virtual LAN (VLAN).
Presentation transcript:

Partition-Tolerant Distributed Publish/Subscribe Systems Reza Sherafat Kazemzadeh * Hans-Arno Jacobsen University of Toronto IEEE SRDS October 6, 2011

Content-Based Publish/Subscribe Toronto NY London P P Publish P Pub/Sub S S S S S P Application scenario (stock quote dissemination) Entities: publishers/subscribers/brokers Correct delivery depends on: connected overlay + correct routing state at brokers Challenge: How to maintain these despite failures. S sub = [STOCK=IBM] S sub= [CHANGE>-8%] Trader 1 Trader 2 Stock quote dissemination SRDS 2011

Goals Fault-tolerance (against concurrent failures): Reliability: Pub Fault-tolerance (against concurrent failures): Broker crashes Link failures Recoveries Reliability: Publications match subscriptions Per-source in-order delivery After some point in time Exactly-once delivery (no loss, no duplicates) Assumptions: Clients are light-weight (broker network is responsible for reliability) A time t after which the system provides guaranteed delivery P/S Goals are to provide P/S service in presence of failures (broker/link). We also consider recoveries of past crashed brokers Reliable delivery Sub Sub SRDS 2011

System Architecture Tree dissemination networks: One path from source to destination Pros: Simple, loop-free Preserves publication order (difficult for non-tree content-based P/S) Cons: Trees are highly susceptible to failures Primary tree: Initial spanning tree that is formed as brokers join the system Maintain neighborhood knowledge Allows brokers to reconfigure overlay after failures on the fly ∆-Neighborhood knowledge: ∆ is configuration parameter ensures handling ∆-1 concurrent failures (worst case) Knowledge of other brokers within distance ∆  Join algorithm Knowledge of routing paths within neighborhood  Subscription propagation algorithm 3-neighborhood 2-neighborhood 1-neighborhood Conventional P/S systems use tree-based dissemination Pros (simple, loop-free) Cons (non-resilient to failure) Our deisgn (take an initial tree called “Primary tree” but augment it with information about neighborhoods) Successful reliable forwarding needs: (i) connected overlay; (ii) correct end-to-end routing state SRDS 2011

Overview of the Approach Subscription Propagation Publication Forwarding Broker/Link Recovery Overlay Management Single chain 4 algorithmic components Focus on 3 in this talk Describe algorithms over chain SRDS 2011

Overlay Management Alg. Maintains end-to-end connectivity despite failures in the overlay. Maintains a connected overlay despite failures. SRDS 2011

Overlay Partitions When primary tree is setup, brokers communicate with their immediate neighbors in the primary tree through FIFO links. Overlay partitions: Broker crash or link failures creates “partitions” and some neighbor brokers “on the partition” become unreachable from neighboring brokers Active connections: At each point they try to maintain a connection to its closest neighbor in the primary tree. Only active connections are used by brokers Active connection to E x FIFO links A B C D E F S P D ? pid1=<C, {D}> Partition detector Brokers on the partition Brokers beyond the partition Brokers on the partition SRDS 2011

Overlay Partitions – 2 Adjacent Failures What if there are more failures, particularly adjacent failures? If ∆ is large enough the same process can be used for larger partitions. Active connection to F A B C D E F S P E D pid1=<C, {D}> Brokers beyond the partition Brokers on the partition + pid2=<C, {D, E}> SRDS 2011

Overlay Partitions - ∆ Adjacent Failures Worst case scenario: ∆-neighborhood knowledge is not sufficient to reconnect the overlay. Brokers “on” and “beyond” the partition are unreachable. No new active connection A B C D E F S P F E D pid1=<C, {D}> Brokers beyond the partition Brokers on the partition pid2=<C, {D, E}> + pid3=<C, {D, E, F}> SRDS 2011

Partition Island and Barriers Brokers are connected to closest reachable neighbors & aware of nearby partition identifiers. How does this affect end-to-end connectivity? For any pair of brokers, a partition on the primary path between them is: An “island” if end-to-end brokers are reachable through a sequence of active connections A “barrier” if end-toe-end brokers are unreachable through some sequence of active connections source destination A B C D E F S P A B C D E F S P source destination SRDS 2011

Subscription Propagation Alg. How correct routing tables are maintained despite overlay partitions? SRDS 2011

Subscription Propagation Algorithm Establishes end-to-end routing state among brokers while taking into account overlay partitions. Subscriptions are dynamically inserted by subscribers and are propagated along branches of primary tree over active connections. Primary tree is the “basis” of constructing end-to-end forwarding paths. Each subscription contains: SUB = <Id, Predicates, Anchor> Predicates specifies subscriber’s interest, e.g., [STOCK=“IBM”] Anchor is a reference to brokers along the propagation path of the subscription SRDS 2011

Subscription Propagation in Absence of Overlay Partitions Subscription anchor field is updated to a broker point up to ∆ hops closer to subscriber Accepting a subscription is to add it into routing tables Only after confirmations are received, a subscription is accepted (i.e., will be used for matching) Observation: Matching publications are delivered to a subscriber once its local broker accepts subscription Subscriptions s.anchor s s s s s s A B C D E P S ☑ conf ☑ conf ☑ ☑ conf ☑ conf ☑ conf ☑ conf ∆ hops Confirmations ∆ hops SRDS 2011

Subscription Propagation in Presence of overlay Partitions Broker B sends s via its active link to bypass the partition and awaits receipt of the corresponding confirmation Once B receives confirmation and accepts s, it tags the confirmation with pid of partitions that s bypassed. Brokers relay this tag in their confirmation messages towards the subscriber’s local broker which accepts and stores s tags along with the tag in its routing table. s ☑ conf Subscriptions s s s A B C D E P S C D B Confirmations ☑ conf ☑ ☑ conf* ☑ conf* ☑* pid tag is also stored along with s * Tag conf with pid SRDS 2011

Publication Forwarding Alg. How accepted subscriptions and their partition tags are used to achieve reliable delivery? SRDS 2011

Publication Forwarding in Absence of Overlay Partitions Forwarding only uses subscriptions accepted brokers. Steps in forwarding of publication p: Identify anchor of accepted subscriptions that match p Determine active connections towards matching subscriptions’ anchors Send p on those active connections and wait for confirmations If there are local matching subscribers, deliver to them If no downstream matching subscriber exists, issue confirmation towards P Once confirmations arrive, discard p and send a conf towards P Publications p p p p p p p Subscriptions A B C D E P S E C Deliver to local subscribers ☑ ☑ conf conf ☑ conf ☑ conf ☑ conf ☑ conf ☑ SRDS 2011

Publication Forwarding in Presence of Overlay Partitions Key forwarding invariant to ensure reliability: we ensure that no stream of publications are delivered to a subscriber after being forwarded by brokers that have not accepted its subscription. Case1: Sub s has been accepted with no pid.  It is safe to bypass intermediate brokers p conf Publications Subscriptions p p p A B C D E P S C B D Deliver to local subscribers ☑ conf conf conf SRDS 2011

Publication Forwarding (cont’d) Case2: Sub s has been accepted with some pid. Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:  It is safe to deliver publications from sources beyond the partition. p conf Publications Subscriptions p p p A B C D E P S C B D ☑ ☑* conf conf conf SRDS 2011

Publication Forwarding (cont’d) Case2: Sub s has been accepted with some pid. Case 2a: Publisher’s local broker has accepted s and we ensure all intermediate forwarding brokers have also done so:  It is safe to deliver publications from sources beyond the partition. p conf Publications Subscriptions p p p A B C D E P S C B D ☑ ☑* conf conf conf Depending on when this link has been established either recovery or subscription propagation ensure C accepts s prior to receiving p SRDS 2011

Publication Forwarding (cont’d) Case2: Subscription s is accepted with some pid tags. Case 2b: Publisher’s broker has not accepted s:  It is unsafe to deliver publications from this publisher (invariant). p Subscriptions Publications p p p p p* Tag with pid A B C D E P S ☑ ☑* s was accepted at S with the same pid tag SRDS 2011

Evaluation Using a mix of simulation and experimental deployments on large-scale testbed. SRDS 2011

Size of ∆-neighborhoods Simulation Results Size of brokers’ Neighborhoods as a function of ∆ Network size of 1000 Broker fanout of 3 ∆=1 ∆=2 ∆=3 ∆=4 ∆=4 ∆=3 ∆=1 ∆=2 Size of ∆-neighborhoods SRDS 2011

Impact of Failures on End-to-End Broker Reachability Using a graph simulation tool. Overlay setup: Network size 1000 Brokers with fanout=3 Failure injection: Failures: up to 100 brokers We randomly marked a given number of nodes as failed Measurements: We counted the number of end-to-end brokers whose intermediate primary tree path contains ∆ consecutive failed brokers in a chain. ∆=1 ∆=3 ∆=4 ∆=2 ∆=1 ∆=4 SRDS 2011

Experimental Deployments: Impact of Failures on Pub Delivery ∆=4 Expected ∆=3 500 brokers deployed on 8-core machines in a cluster: Network setup: Overlay fanout=3. We measured aggregate pub. delivery count in an interval of 120s Expected bar is number of publications that must be delivered despite failures (this excludes traffic to/from failed brokers). ∆=1 ∆=1 ∆=3 ∆=2 ∆=4 ∆=1 SRDS 2011

Conclusions We developed a reliable P/S system that tolerate concurrent broker and link failures: Configuration parameter ∆ determines level of resiliency against failures (in the worst case). Dissemination trees augmented with neighborhood knowledge. Neighborhood knowledge allows brokers to maintain network connectivity and make forwarding decision despite failures. We studied the performance of the system when number of failures far exceeds ∆: A small value for ∆ ensures good connectivity. SRDS 2011

Questions… Thanks for your attention! SRDS 2011

Challenges Why “end-to-end” principle does not work? Responsibility on P/S messaging system Why “end-to-end” principle does not work? Publishers and subscribers are decoupled and unaware of each other. Routing paths are established by dynamically inserted subscriptions Subscription propagation is also subject to broker/link failure. Selective delivery makes in-order delivery over redundant path difficult Subscribers are only interested in a subset of what is published. Subscription propagation algorithm We use a special form of tree dissemination From now on we only discuss brokers. SRDS 2011

Store-and-Forward A copy is first preserved on disk Intermediate hops send an ACK to previous hop after preserving ACKed copies can be dismissed from disk Upon failures, unacknowledged copies survive failure and are re-transmitted after recovery This ensures reliable delivery but may cause delays while the machine is down P P P P From here To here ack ack ack SRDS'09

Mesh-Based Overlay Networks [Snoeren, et al., SOSP 2001] Use a mesh network to concurrently forward msgs on disjoint paths Upon failures, the msg is delivered using alternative routes Pros: Minimal impact on delivery delay Cons: Imposes additional traffic & possibility of duplicate delivery P P P P From here To here SRDS'09

Replica-based Approach [Bhola , et al., DSN 2002] Replicas are grouped into virtual nodes Replicas have identical routing information Virtual node Physical Machines SRDS'09

Replica-based Approach [Bhola , et al., DSN 2002] Replicas are grouped into virtual nodes Replicas have identical routing information We compare against this approach Virtual node P P P P P P SRDS'09

Publication Forwarding (cont’d) Case2: Sub s has been accepted with some pid. Case 2b (Partition barrier): Publisher’s broker has also not accepted s p1 Subscriptions Publications p1 p1 p1 p1 p1* p1* Tag with pid ☑r ☑ R A B C D E P S ☑ ☑* s was accepted at S with the same pid tag p2 & p1 matches r & s matches s SRDS 2011

Subscription Propagation with Partitions Partition islands: Simply confirm (and accept) subscriptions over available If partition brokers are reachable from the other side of the partition Intuition: Publications from P may only be lost if they arrive at B But this will not happen since there is no link towards B from F Correctness proof argues on the precedence of acceptance and creation of links Subscriptions ☑ Lead broker A A B C D E P S C B ☑ ☑ Will accept during recovery ☑ ☑ Confirmations SRDS 2011

Subscription Propagation with Partition Barriers If a portion of the network that includes publishers is on/beyond a partition barrier, there is no way to communicate the subscription information for the duration of failures Lead broker “partially confirms” the subscription and tags the confirmation with the partition information Accepting brokers store the partition information along with the subscription This ensures liveness Forward Lead broker B A B C E F G P S D C ☑* Partial conf Δ hops SRDS 2011

Publication Forwarding Only accepted subscriptions are stored in SRT and used for matching At each point in time, a broker has a number of connections to its nearest reachable neighbors This set of active connections may change over time Publication forwarding steps: Store publication in a FIFO internal message queue Match and compute set of {from} for subscriptions that match For each partially confirmed subscription, tag the publication with the partition information Send the publication to the closest reachable neighbors towards {from} Once all confirmations arrive, discard publication and issue confirmation towards publisher P queue P A (δ+1)- neighborhood S S SRDS 2011

Evaluations Network size of 1000 Broker fanout of 3 Size of brokers’ Neighborhoods as a function of ∆ Network size of 1000 Broker fanout of 3 Network size of 1000 Broker fanout of 7 ∆=4 ∆=3 ∆=2 ∆=1 Size of ∆-neighborhoods ∆=4 ∆=3 ∆=1 ∆=2 Size of ∆-neighborhoods SRDS 2011

Overlay Links Management Sessions: FIFO communication links between brokers. Active sessions: Broker A’s session to B is active if A has no session to another broker C on the path between A and B. Primary tree ∆ = 2 SRDS 2011

Agenda Challenges of reliability and fault-tolerance in P/S Our approach Topology neighborhood knowledge Subscription propagation Publication forwarding Recovery procedure Evaluation results Conclusions SRDS 2011