Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T.

Similar presentations


Presentation on theme: "Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T."— Presentation transcript:

1 Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T. Rowstron Presented by Yu Feng and Elizabeth Lynch

2 Introduction Application-level multicast  Goals Scalability Failure tolerance Low delay Effective use of network resources

3 Pastry P2P location and routing substrate Provides:  Scalability Large numbers of groups Large numbers of multicast sources Large numbers of members per group  Self-organization  Peer-to-peer location and routing  Good locality properties

4 Scribe Application-level multicast infrastructure Built on top of Pastry  Takes advantage of Pastry properties Robustness Self-organization Locality Reliability

5 nodeId Each node is assigned 128-bit nodeId  nodeIds are uniformly distributed Each node maintains tables that map nodeIds to IP addresses  (2^b-1)*[log (2^b) N] + l entries  O(log (2^b) N) messages required to update after group membership change

6 Routing Guarantees A message and key will be routed to the live node whose nodeId is closest to the key In a network of N nodes, the average number of steps in a route to any node is less than log (2^b) N Delivery is guaranteed unless l/2 or more nodes with adjacent nodeIds fail

7 Routing Tables nodeIds and keys are treated as sequences of digits base 2^b Each node's routing table has [log (2^b) N] rows and 2^b – 1 entries per row Each entry in row n refers to a node whose nodeId matches the present node's nodeId in the first n digits but whose n+1 th digit has one of 2^b – 1 other possible values The entry closest to the present node according to a distance metric is chosen

8 Leaf Sets l/2 closest larger and l/2 closest smaller nodeIds relative to present nodeId Each node maintains IP addresses for its leaf set

9 Routing algorithm Current node forwards to a node whose nodeId has a prefix at least one digit (b bits) longer in common with the key If no such node is available, forward to a node with the same prefix length whose nodeId is closer to the key

10 Locality Proximity metric Locality properties relevant to Scribe  Short routes According to simulations: 1.59 to 2.2 times distance directly between the source and destination  Route convergence According to simulations: average distance traveled by two messages sent to the same key is approximately equal to the distance between the two source nodes

11 Node Addition New node X picks a nodeId X contacts nearby node A A routes special message with X as key Message is routed to a node Z with nodeId numerically closest to X If X==Z, X must choose a new nodeId X obtains leafset from Z X obtains ith row of routing table from ith node traversed from A to Z X notifies appropriate nodes that it is now alive

12 Node Failure Neighboring nodes in nodeId space periodically exchange keep-alive messages If a node is silent for a period of time, T, it is presumed failed. All members of the failed node's leaf set are notified and then remove the failed node from their leaf sets and update.

13 Node Recovery Contacts all the nodes in last known leaf set Obtains their leaf sets Updates its leaf set Notifies members of new leaf set

14 Pastry API nodeId=pastryInit(Credentials)  Causes local node to join existing Pastry network or start a new one route(msg, key)  Routes msg to the node with nodeId numerically closest to key send(msg, IP-addr)  Sends msg to the node at IP-addr

15 Required Pastry Functions deliver(msg, key)  When msg is received and local node's nodeId is closest to key out of all live nodes  When msg is received that was transmitted via send() to IP of local node forward(msg, key, nextId)  Called just before msg is forwarded to node with nodeId=nextId  Application can change msg content or nextId value  If nextId=NULL, msg terminates at local node newLeafs(leafSet)  Called whenever there's a change in the leaf set

16 Scribe Overview Multicast application framework built on top of Pastry Any Scribe node may create a group Other nodes can join the group and multicast to all members of that group Best effort delivery and does not guarantee ordered delivery

17 How? A group is formed by building a multicast tree through joining Pastry routes from each group member to a rendezvous point (root of the tree). Multicast messages are sent to rendezvous point for distribution Pastry and Scribe are fully decentralized  Decisions are based on local information  Provides reliability and scalability

18 Multicast Tree Scribe creates a multicast tree rooted at the rendezvous point. Scribe nodes that are part of a multicast tree are called forwarders. They may or MAY NOT be a members of the group. Each forwarder contains a children table. There is an entry (IP address and nodeId) for each of its children in the multicast tree.

19 Scribe API create(credentials, groupId)  Creates a new group using the credentials to control future access join(credentials, groupId, messageHandler)  Join a group with the specified groupId leave(credentials, groupId)  Leave a group with the specified groupId multicast(credentials, groupId, message)  Multicast the specified message to the group with specified groupId

20 Scribe Implementation Creating a Group 1.A scribe node asks Pastry to route a CREATE message using the groupId as the key. [e.g., route(CREATE, groupId)] 2.Pastry delivers the CREATE message to a node that has its nodeId numerically closest to the groupId. 3.Scribe’s deliver method is invoked and adds the new groupId to a list of groups it already knows. In addition, it also checks the credentials to ensure the group can be created. 4.This node becomes the rendezvous point for the newly created group.

21 Scribe Implementation Joining a Group 1.Asks Pastry to route a JOIN message with the groupId as the key. [e.g., route(JOIN, groupId)]. The message is routed towards the rendezvous point. 2.Each node along the route, Pastry invokes Scribe’s Forward method. a.Checks to see if it is a forwarder for the group. b.If it is a current forwarder for the group, then it adds the node as a child. c.If it is NOT a current forwarder for the group, then it creates a children table for the new group, adds the node as a child. Then it routes a JOIN message with groupId as key [e.g., route(JOIN, groupId)]. d.Finally, it terminates route message it received form the source.

22 Scribe Implementation Leaving a Group 1.It records locally that it left the group. 2.If there are no more children in its children table, it sends a LEAVE message to its parent node. 3.The parent node repeats step 2 until a node with a non-empty children table is found after removing the source node.

23 Multicast a Message Locate rendezvous point for the group. [e.g., route(MULTICAST, groupId)], and ask it to return its IP address. The source caches the IP address and uses it for future multicasts. If the rendezvous point changes or fails, it uses Pastry again to find the new rendezvous point. All multicast messages are sent from rendezvous point.

24 Scribe Implementation

25 Reliability of Scribe Repairing the Tree Periodically, each non-leaf node sends out a heartbeat message to all of its children. When a leaf node does not receive a heartbeat after a certain period of time, it sends a JOIN message with the group’s identifier. Pastry will route the message to a new parent, thus fixing the multicast tree.

26 Reliability of Scribe Failure of Rendezvous Point The state of rendezvous point is replicated across k closest nodes to the root node (Typical value of k is 5). These k nodes are all children of the root node. When a root node fails, its immediate children detect the failure and join again through pastry. Pastry routes the new join message to a new root (a live root with the numerically closest nodeId to the groupId), which takes over the role of the rendezvous point.

27 Reliability of Scribe Children table entries are discarded unless the child node sends a explicit message stating it wants to remain in the table. Tree repair mechanism scales well:  Fault detection is done by sending messages to a small number of nodes  Recovery from faults is local and only a small number of nodes is involved (O(log 2 b N))

28 Scribe - Providing Additional Guarantees Scribe only provides reliable, ordered delivery of multicast messages only if the TCP connections do not fail. Scribe provides a simple mechanism to allow other applications to implement stronger reliability guarantees. – forwardHandler(msg): Invoked by Scribe before the node forwards a multicast message to its children. – joinHandler(msg): Invoked by Scribe after a new child is added to one of the node’s children tables. – faultHandler(msg): Invoked by Scribe when a node suspects its parent is faulty.

29 Additional Reliability Example forwardHandler  The root assigns a sequence number to each message  Multicast messages are buffered by the root and by each node in the multicast tree. Messages are retransmitted after the multicast tree is repaired. faultHandler  adds the last sequence number delivered by the node to the JOIN message that is sent out to repair the tree. joinHandler  retransmits buffered messages numbers above n to the new child.

30 Experimental Setup Randomly generated network topology with 5050 routers  Scribe was run on 100,000 end nodes randomly assigned to routers with uniform distribution  Using different random seeds, ten different topologies were generated Results are averaged over all ten topologies Experimented with a wide range of group sizes and large number of groups  Size of group with rank r: gsize(r)=floor(N*r^(-1.25) +.5) Group membership selected randomly with uniform distribution

31 Delay Penalty Compare delay between Scribe multicast and IP multicast  Measure distribution of delay to deliver a message to each member of a group  Two metrics: RMD  50% of groups less than 1.69  Max = 4.26 RAD  50% of groups less than 1.68  Max = 2

32 Node Stress Stress imposed by maintaining groups and handling forwarding packets and duplicate packets at the end node instead of on the routers Measure the number of groups with non-empty children tables and the number of entries in children tables In our simulation with 1500 groups  Non-empty children tables per node: Avg=2.4, max=40  Children table entries per node: Avg=6.2, max=1059

33 Link Stress Experiment Computed link stress by counting the number of packets that are sent over each link when a message is sent to each of the 1500 groups.  Total number of links is 1,035,295  Total number of messages for Scribe is 2,489,824  Total number of messages for IP multicast is 758,853 Mean number of message per link:  2.4 for Scribe  0.7 for IP multicast Maximum Link Stress:  4031 for Scribe  950 for IP multicast

34 Bottleneck Remover When a node detects it is overloaded, it selects the group that consumes the most resources. Then it chooses the child in this group that is farthest away. The parent then drops the child by sending it a message containing the children table for the group along with delays between each children and the parent. When the child receives the message it does the following: 1.It measures the delay between itself and other child in the children table received. 2.It then computes the delay between itself and the parent via each of the nodes. 3.Finally, it sends a JOIN message to the node that provides the least combined delay.

35 Bottleneck Remover Results This introduces potential for routing loops When a loop is detected, the node sends another JOIN message to generate a new random route. The bottleneck remover limits the number of entries for its children tables at a cost of increased link stress during join.  Average link stress increases from 2.4 to 2.7 and maximum increases from 4031 to 4728.

36 Scalability with Many Small Groups 50,000 Scribe nodes 30,000 Scribe group with 11 nodes per group Average number of children entries per node is 21.2 compared to a plain (naïve) multicast average of only 6.6 Average link stress:  6.1 for Scribe  1.6 for IP multicast  2.9 for Naïve multicast Scribe entries are higher because it creates trees with long paths and no branching.

37 Conclusion Scribe is a fully decentralized and large-scale application-level multicast infrastructure built on top of Pastry. Designed to scale to large number of groups, large group size, and supports multiple multicasting sources per group. Scribe and Pastry’s randomized placement of nodes, groups, and multicast roots balances the load and the multicast tree. Scribe uses a best effort delivery scheme but can be extended to guarantee more strict multicast requirements. Experimental results show that Scribe can efficiently support large number of nodes, groups, and a wide range of group sizes compared to IP multicasting.


Download ppt "Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony I.T."

Similar presentations


Ads by Google