Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T.

Slides:

Advertisements

Similar presentations

Peer-to-Peer Infrastructure and Applications Andrew Herbert Microsoft Research, Cambridge

Advertisements

CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK

Scalable Content-Addressable Network Lintao Liu

SplitStream: High- Bandwidth Multicast in Cooperative Environments Monica Tudora.

Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Schenker Presented by Greg Nims.

SCRIBE A large-scale and decentralized application-level multicast infrastructure.

Ranveer Chandra , Kenneth P. Birman Department of Computer Science

Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.

1 PASTRY Partially borrowed from Gabi Kliot ’ s presentation.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

Scribe An application level multicast infrastructure Kasper Egdø and Morten Bjerre.

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Bowstron & Peter Druschel Presented by: Long Zhang.

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.

ZIGZAG A Peer-to-Peer Architecture for Media Streaming By Duc A. Tran, Kien A. Hua and Tai T. Do Appear on “Journal On Selected Areas in Communications,

SplitStream: High-Bandwidth Multicast in Cooperative Environments Marco Barreno Peer-to-peer systems 9/22/2003.

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.

Storage Management and Caching in PAST, a large-scale, persistent peer- to-peer storage utility Authors: Antony Rowstorn (Microsoft Research) Peter Druschel.

Secure routing for structured peer-to-peer overlay networks Miguel Castro, Ayalvadi Ganesh, Antony Rowstron Microsoft Research Ltd. Peter Druschel, Dan.

Scalable Application Layer Multicast Suman Banerjee Bobby Bhattacharjee Christopher Kommareddy ACM SIGCOMM Computer Communication Review, Proceedings of.

Pastry Partially borrowed for Gabi Kliot. Pastry Scalable, decentralized object location and routing for large-scale peer-to-peer systems  Antony Rowstron.

Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.

Slide Set 15: IP Multicast. In this set What is multicasting ? Issues related to IP Multicast Section 4.4.

SCRIBE: A large-scale and decentralized application-level multicast infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec and Antony Rowstron.

A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:

Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)

SkipNet: A Scaleable Overlay Network With Practical Locality Properties Presented by Rachel Rubin CS294-4: Peer-to-Peer Systems By Nicholas Harvey, Michael.

Application Layer Multicast for Earthquake Early Warning Systems Valentina Bonsi - April 22, 2008.

An Evaluation of Scalable Application-level Multicast Using Peer-to-peer Overlays Miguel Castro, Michael B. Jones, Anne-Marie Kermarrec, Antony Rowstron,

P2P Course, Structured systems 1 Skip Net (9/11/05)

1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.

P2P Course, Structured systems 1 Introduction (26/10/05)

Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.

“Umbrella”: A novel fixed-size DHT protocol A.D. Sotiriou.

 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.

1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems (Antony Rowstron and Peter Druschel) Shariq Rizvi First.

Mobile Ad-hoc Pastry (MADPastry) Niloy Ganguly. Problem of normal DHT in MANET No co-relation between overlay logical hop and physical hop – Low bandwidth,

1 A scalable Content- Addressable Network Sylvia Rathnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Shenker Pirammanayagam Manickavasagam.

Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)

© Janice Regan, CMPT 128, CMPT 371 Data Communications and Networking Multicast routing.

1 PASTRY. 2 Pastry paper “ Pastry: Scalable, decentralized object location and routing for large- scale peer-to-peer systems ” by Antony Rowstron (Microsoft.

PIC: Practical Internet Coordinates for Distance Estimation Manuel Costa joint work with Miguel Castro, Ant Rowstron, Peter Key Microsoft Research Cambridge.

Multicast Routing Algorithms n Multicast routing n Flooding and Spanning Tree n Forward Shortest Path algorithm n Reversed Path Forwarding (RPF) algorithms.

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.

Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park

Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.

DHT-based unicast for mobile ad hoc networks Thomas Zahn, Jochen Schiller Institute of Computer Science Freie Universitat Berlin 報告 : 羅世豪.

1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.

Pastry Antony Rowstron and Peter Druschel Presented By David Deschenes.

Peer to Peer Network Design Discovery and Routing algorithms

BATON A Balanced Tree Structure for Peer-to-Peer Networks H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu.

CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.

Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.

LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.

Spring Routing: Part I Section 4.2 Outline Algorithms Scalability.

Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.

1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.

Plethora: A Locality Enhancing Peer-to-Peer Network Ronaldo Alves Ferreira Advisor: Ananth Grama Co-advisor: Suresh Jagannathan Department of Computer.

Peer-to-Peer Networks 05 Pastry Christian Schindelhauer Technical Faculty Computer-Networks and Telematics University of Freiburg.

Fabián E. Bustamante, Fall 2005 A brief introduction to Pastry Based on: A. Rowstron and P. Druschel, Pastry: Scalable, decentralized object location and.

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,

Chapter 29 Peer-to-Peer Paradigm Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.

Pastry Scalable, decentralized object locations and routing for large p2p systems.

Controlling the Cost of Reliability in Peer-to-Peer Overlays

Accessing nearby copies of replicated objects

Host Multicast: A Framework for Delivering Multicast to End Users

Presentation transcript:

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. Rowstron Presented by Yu Feng and Elizabeth Lynch

Introduction Application-level multicast  Goals Scalability Failure tolerance Low delay Effective use of network resources

Pastry P2P location and routing substrate Provides:  Scalability Large numbers of groups Large numbers of multicast sources Large numbers of members per group  Self-organization  Peer-to-peer location and routing  Good locality properties

Scribe Application-level multicast infrastructure Built on top of Pastry  Takes advantage of Pastry properties Robustness Self-organization Locality Reliability

nodeId Each node is assigned 128-bit nodeId  nodeIds are uniformly distributed Each node maintains tables that map nodeIds to IP addresses  (2^b-1)*[log (2^b) N] + l entries  O(log (2^b) N) messages required to update after group membership change

Routing Guarantees A message and key will be routed to the live node whose nodeId is closest to the key In a network of N nodes, the average number of steps in a route to any node is less than log (2^b) N Delivery is guaranteed unless l/2 or more nodes with adjacent nodeIds fail

Routing Tables nodeIds and keys are treated as sequences of digits base 2^b Each node's routing table has [log (2^b) N] rows and 2^b – 1 entries per row Each entry in row n refers to a node whose nodeId matches the present node's nodeId in the first n digits but whose n+1 th digit has one of 2^b – 1 other possible values The entry closest to the present node according to a distance metric is chosen

Leaf Sets l/2 closest larger and l/2 closest smaller nodeIds relative to present nodeId Each node maintains IP addresses for its leaf set

Routing algorithm Current node forwards to a node whose nodeId has a prefix at least one digit (b bits) longer in common with the key If no such node is available, forward to a node with the same prefix length whose nodeId is closer to the key

Locality Proximity metric Locality properties relevant to Scribe  Short routes According to simulations: 1.59 to 2.2 times distance directly between the source and destination  Route convergence According to simulations: average distance traveled by two messages sent to the same key is approximately equal to the distance between the two source nodes

Node Addition New node X picks a nodeId X contacts nearby node A A routes special message with X as key Message is routed to a node Z with nodeId numerically closest to X If X==Z, X must choose a new nodeId X obtains leafset from Z X obtains ith row of routing table from ith node traversed from A to Z X notifies appropriate nodes that it is now alive

Node Failure Neighboring nodes in nodeId space periodically exchange keep-alive messages If a node is silent for a period of time, T, it is presumed failed. All members of the failed node's leaf set are notified and then remove the failed node from their leaf sets and update.

Node Recovery Contacts all the nodes in last known leaf set Obtains their leaf sets Updates its leaf set Notifies members of new leaf set

Pastry API nodeId=pastryInit(Credentials)  Causes local node to join existing Pastry network or start a new one route(msg, key)  Routes msg to the node with nodeId numerically closest to key send(msg, IP-addr)  Sends msg to the node at IP-addr

Required Pastry Functions deliver(msg, key)  When msg is received and local node's nodeId is closest to key out of all live nodes  When msg is received that was transmitted via send() to IP of local node forward(msg, key, nextId)  Called just before msg is forwarded to node with nodeId=nextId  Application can change msg content or nextId value  If nextId=NULL, msg terminates at local node newLeafs(leafSet)  Called whenever there's a change in the leaf set

Scribe Overview Multicast application framework built on top of Pastry Any Scribe node may create a group Other nodes can join the group and multicast to all members of that group Best effort delivery and does not guarantee ordered delivery

How? A group is formed by building a multicast tree through joining Pastry routes from each group member to a rendezvous point (root of the tree). Multicast messages are sent to rendezvous point for distribution Pastry and Scribe are fully decentralized  Decisions are based on local information  Provides reliability and scalability

Multicast Tree Scribe creates a multicast tree rooted at the rendezvous point. Scribe nodes that are part of a multicast tree are called forwarders. They may or MAY NOT be a members of the group. Each forwarder contains a children table. There is an entry (IP address and nodeId) for each of its children in the multicast tree.

Scribe API create(credentials, groupId)  Creates a new group using the credentials to control future access join(credentials, groupId, messageHandler)  Join a group with the specified groupId leave(credentials, groupId)  Leave a group with the specified groupId multicast(credentials, groupId, message)  Multicast the specified message to the group with specified groupId

Scribe Implementation Creating a Group 1.A scribe node asks Pastry to route a CREATE message using the groupId as the key. [e.g., route(CREATE, groupId)] 2.Pastry delivers the CREATE message to a node that has its nodeId numerically closest to the groupId. 3.Scribe’s deliver method is invoked and adds the new groupId to a list of groups it already knows. In addition, it also checks the credentials to ensure the group can be created. 4.This node becomes the rendezvous point for the newly created group.

Scribe Implementation Joining a Group 1.Asks Pastry to route a JOIN message with the groupId as the key. [e.g., route(JOIN, groupId)]. The message is routed towards the rendezvous point. 2.Each node along the route, Pastry invokes Scribe’s Forward method. a.Checks to see if it is a forwarder for the group. b.If it is a current forwarder for the group, then it adds the node as a child. c.If it is NOT a current forwarder for the group, then it creates a children table for the new group, adds the node as a child. Then it routes a JOIN message with groupId as key [e.g., route(JOIN, groupId)]. d.Finally, it terminates route message it received form the source.

Scribe Implementation Leaving a Group 1.It records locally that it left the group. 2.If there are no more children in its children table, it sends a LEAVE message to its parent node. 3.The parent node repeats step 2 until a node with a non-empty children table is found after removing the source node.

Multicast a Message Locate rendezvous point for the group. [e.g., route(MULTICAST, groupId)], and ask it to return its IP address. The source caches the IP address and uses it for future multicasts. If the rendezvous point changes or fails, it uses Pastry again to find the new rendezvous point. All multicast messages are sent from rendezvous point.

Scribe Implementation

Reliability of Scribe Repairing the Tree Periodically, each non-leaf node sends out a heartbeat message to all of its children. When a leaf node does not receive a heartbeat after a certain period of time, it sends a JOIN message with the group’s identifier. Pastry will route the message to a new parent, thus fixing the multicast tree.

Reliability of Scribe Failure of Rendezvous Point The state of rendezvous point is replicated across k closest nodes to the root node (Typical value of k is 5). These k nodes are all children of the root node. When a root node fails, its immediate children detect the failure and join again through pastry. Pastry routes the new join message to a new root (a live root with the numerically closest nodeId to the groupId), which takes over the role of the rendezvous point.

Reliability of Scribe Children table entries are discarded unless the child node sends a explicit message stating it wants to remain in the table. Tree repair mechanism scales well:  Fault detection is done by sending messages to a small number of nodes  Recovery from faults is local and only a small number of nodes is involved (O(log 2 b N))

Scribe - Providing Additional Guarantees Scribe only provides reliable, ordered delivery of multicast messages only if the TCP connections do not fail. Scribe provides a simple mechanism to allow other applications to implement stronger reliability guarantees. – forwardHandler(msg): Invoked by Scribe before the node forwards a multicast message to its children. – joinHandler(msg): Invoked by Scribe after a new child is added to one of the node’s children tables. – faultHandler(msg): Invoked by Scribe when a node suspects its parent is faulty.

Additional Reliability Example forwardHandler  The root assigns a sequence number to each message  Multicast messages are buffered by the root and by each node in the multicast tree. Messages are retransmitted after the multicast tree is repaired. faultHandler  adds the last sequence number delivered by the node to the JOIN message that is sent out to repair the tree. joinHandler  retransmits buffered messages numbers above n to the new child.

Experimental Setup Randomly generated network topology with 5050 routers  Scribe was run on 100,000 end nodes randomly assigned to routers with uniform distribution  Using different random seeds, ten different topologies were generated Results are averaged over all ten topologies Experimented with a wide range of group sizes and large number of groups  Size of group with rank r: gsize(r)=floor(N*r^(-1.25) +.5) Group membership selected randomly with uniform distribution

Delay Penalty Compare delay between Scribe multicast and IP multicast  Measure distribution of delay to deliver a message to each member of a group  Two metrics: RMD  50% of groups less than 1.69  Max = 4.26 RAD  50% of groups less than 1.68  Max = 2

Node Stress Stress imposed by maintaining groups and handling forwarding packets and duplicate packets at the end node instead of on the routers Measure the number of groups with non-empty children tables and the number of entries in children tables In our simulation with 1500 groups  Non-empty children tables per node: Avg=2.4, max=40  Children table entries per node: Avg=6.2, max=1059

Link Stress Experiment Computed link stress by counting the number of packets that are sent over each link when a message is sent to each of the 1500 groups.  Total number of links is 1,035,295  Total number of messages for Scribe is 2,489,824  Total number of messages for IP multicast is 758,853 Mean number of message per link:  2.4 for Scribe  0.7 for IP multicast Maximum Link Stress:  4031 for Scribe  950 for IP multicast

Bottleneck Remover When a node detects it is overloaded, it selects the group that consumes the most resources. Then it chooses the child in this group that is farthest away. The parent then drops the child by sending it a message containing the children table for the group along with delays between each children and the parent. When the child receives the message it does the following: 1.It measures the delay between itself and other child in the children table received. 2.It then computes the delay between itself and the parent via each of the nodes. 3.Finally, it sends a JOIN message to the node that provides the least combined delay.

Bottleneck Remover Results This introduces potential for routing loops When a loop is detected, the node sends another JOIN message to generate a new random route. The bottleneck remover limits the number of entries for its children tables at a cost of increased link stress during join.  Average link stress increases from 2.4 to 2.7 and maximum increases from 4031 to 4728.

Scalability with Many Small Groups 50,000 Scribe nodes 30,000 Scribe group with 11 nodes per group Average number of children entries per node is 21.2 compared to a plain (naïve) multicast average of only 6.6 Average link stress:  6.1 for Scribe  1.6 for IP multicast  2.9 for Naïve multicast Scribe entries are higher because it creates trees with long paths and no branching.

Conclusion Scribe is a fully decentralized and large-scale application-level multicast infrastructure built on top of Pastry. Designed to scale to large number of groups, large group size, and supports multiple multicasting sources per group. Scribe and Pastry’s randomized placement of nodes, groups, and multicast roots balances the load and the multicast tree. Scribe uses a best effort delivery scheme but can be extended to guarantee more strict multicast requirements. Experimental results show that Scribe can efficiently support large number of nodes, groups, and a wide range of group sizes compared to IP multicasting.