Tapestry: Scalable and Fault-tolerant Routing and Location

Slides:



Advertisements
Similar presentations
Peer-to-Peer Infrastructure and Applications Andrew Herbert Microsoft Research, Cambridge
Advertisements

Brocade: Landmark Routing on Peer to Peer Networks Ben Y. Zhao Yitao Duan, Ling Huang, Anthony Joseph, John Kubiatowicz IPTPS, March 2002.
Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Distributed Hash Tables: An Overview
Qualifying Examination A Decentralized Location and Routing Infrastructure for Fault-tolerant Wide-area Applications John Kubiatowicz (Chair)Satish Rao.
Tapestry: Scalable and Fault-tolerant Routing and Location Stanford Networking Seminar October 2001 Ben Y. Zhao
Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Scalable Content-Addressable Network Lintao Liu
Peer-to-Peer Systems Chapter 25. What is Peer-to-Peer (P2P)? Napster? Gnutella? Most people think of P2P as music sharing.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
P2P Simulation Platform Enhancement Shih Chin, Chai Superviser: Dr. Tim Moors Assessor: Dr. Robert Malaney.
The Oceanstore Regenerative Wide-area Location Mechanism Ben Zhao John Kubiatowicz Anthony Joseph Endeavor Retreat, June 2000.
Scribe: A Large-Scale and Decentralized Application-Level Multicast Infrastructure Miguel Castro, Peter Druschel, Anne-Marie Kermarrec, and Antony L. T.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Spring 2003CS 4611 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Distributed Lookup Systems
Tapestry: Wide-area Location and Routing Ben Y. Zhao John Kubiatowicz Anthony D. Joseph U. C. Berkeley.
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter: Chunyuan Liao March 6, 2002 Ben Y.Zhao, John Kubiatowicz, and.
Distributed Object Location in a Dynamic Network Kirsten Hildrum, John D. Kubiatowicz, Satish Rao and Ben Y. Zhao.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
CITRIS Poster Supporting Wide-area Applications Complexities of global deployment  Network unreliability.
Decentralized Location Services CS273 Guest Lecture April 24, 2001 Ben Y. Zhao.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
1 Peer-to-Peer Networks Outline Survey Self-organizing overlay network File system on top of P2P network Contributions from Peter Druschel.
Tapestry: Finding Nearby Objects in Peer-to-Peer Networks Joint with: Ling Huang Anthony Joseph Robert Krauthgamer John Kubiatowicz Satish Rao Sean Rhea.
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Tapestry An off-the-wall routing protocol? Presented by Peter, Erik, and Morten.
Tapestry: Decentralized Routing and Location System Seminar S ‘01 Ben Y. Zhao CS Division, U. C. Berkeley.
Locality Aware Mechanisms for Large-scale Networks Ben Y. Zhao Anthony D. Joseph John D. Kubiatowicz UC Berkeley Future Directions in Distributed Computing.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
1 Plaxton Routing. 2 Introduction Plaxton routing is a scalable mechanism for accessing nearby copies of objects. Plaxton mesh is a data structure that.
1 PASTRY. 2 Pastry paper “ Pastry: Scalable, decentralized object location and routing for large- scale peer-to-peer systems ” by Antony Rowstron (Microsoft.
Arnold N. Pears, CoRE Group Uppsala University 3 rd Swedish Networking Workshop Marholmen, September Why Tapestry is not Pastry Presenter.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Vincent Matossian September 21st 2001 ECE 579 An Overview of Decentralized Discovery mechanisms.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
Tapestry: A Resilient Global-scale Overlay for Service Deployment 1 Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.
Peer to Peer Network Design Discovery and Routing algorithms
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Incrementally Improving Lookup Latency in Distributed Hash Table Systems Hui Zhang 1, Ashish Goel 2, Ramesh Govindan 1 1 University of Southern California.
Peer-to-Peer Information Systems Week 12: Naming
Internet Indirection Infrastructure (i3)
Pastry Scalable, decentralized object locations and routing for large p2p systems.
Plethora: Infrastructure and System Design
Accessing nearby copies of replicated objects
Turning Heterogeneity into an Advantage in Overlay Routing
Early Measurements of a Cluster-based Architecture for P2P Systems
PASTRY.
EE 122: Peer-to-Peer (P2P) Networks
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
John D. Kubiatowicz UC Berkeley
Dynamic Replica Placement for Scalable Content Delivery
Applications (2) Outline Overlay Networks Peer-to-Peer Networks.
Exploiting Routing Redundancy via Structured Peer-to-Peer Overlays
Peer-to-Peer Information Systems Week 12: Naming
A. D. Sotiriou, P. Kalliaras, N. Mitrou
Brocade: Landmark Routing on Peer to Peer Networks
Simultaneous Insertions in Tapestry
Presentation transcript:

Tapestry: Scalable and Fault-tolerant Routing and Location Stanford Networking Seminar October 2001 Ben Y. Zhao ravenben@eecs.berkeley.edu

Challenges in the Wide-area Trends: Exponential growth in CPU, storage Network expanding in reach and b/w Can applications leverage new resources? Scalability: increasing users, requests, traffic Resilience: more components  inversely low MTBF Management: intermittent resource availability  complex management schemes Proposal: an infrastructure that solves these issues and passes benefits onto applications Stanford Networking Seminar, 10/2001

Driving Applications Leverage of cheap & plentiful resources: CPU cycles, storage, network bandwidth Global applications share distributed resources Shared computation: SETI, Entropia Shared storage OceanStore, Gnutella, Scale-8 Shared bandwidth Application-level multicast, content distribution networks Stanford Networking Seminar, 10/2001

Key: Location and Routing Hard problem: Locating and messaging to resources and data Goals for a wide-area overlay infrastructure Easy to deploy Scalable to millions of nodes, billions of objects Available in presence of routine faults Self-configuring, adaptive to network changes Localize effects of operations/failures Stanford Networking Seminar, 10/2001

Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001

What is Tapestry? A prototype of a decentralized, scalable, fault-tolerant, adaptive location and routing infrastructure (Zhao, Kubiatowicz, Joseph et al. U.C. Berkeley) Network layer of OceanStore Routing: Suffix-based hypercube Similar to Plaxton, Rajamaran, Richa (SPAA97) Decentralized location: Virtual hierarchy per object with cached location references Core API: publishObject(ObjectID, [serverID]) routeMsgToObject(ObjectID) routeMsgToNode(NodeID) Stanford Networking Seminar, 10/2001

Routing and Location Namespace (nodes and objects) 160 bits  280 names before name collision Each object has its own hierarchy rooted at Root f (ObjectID) = RootID, via a dynamic mapping function Suffix routing from A to B At hth hop, arrive at nearest node hop(h) s.t. hop(h) shares suffix with B of length h digits Example: 5324 routes to 0629 via 5324  2349  1429  7629  0629 Object location: Root responsible for storing object’s location Publish / search both route incrementally to root Stanford Networking Seminar, 10/2001

Publish / Lookup Publish object with ObjectID: Lookup object // route towards “virtual root,” ID=ObjectID For (i=0, i<Log2(N), i+=j) { //Define hierarchy j is # of bits in digit size, (i.e. for hex digits, j = 4 ) Insert entry into nearest node that matches on last i bits If no matches found, deterministically choose alternative Found real root node, when no external routes left Lookup object Traverse same path to root as publish, except search for entry at each node For (i=0, i<Log2(N), i+=j) { Search for cached object location Once found, route via IP or Tapestry to object Stanford Networking Seminar, 10/2001

Tapestry Mesh Incremental suffix-based routing 4 2 3 1 NodeID 0x43FE NodeID 0x79FE NodeID 0x23FE NodeID 0x993E NodeID 0x43FE NodeID 0x73FE NodeID 0x44FE NodeID 0xF990 NodeID 0x035E NodeID 0x04FE NodeID 0x13FE NodeID 0x555E NodeID 0xABFE NodeID 0x9990 NodeID 0x239E NodeID 0x73FF NodeID 0x1290 NodeID 0x423E Stanford Networking Seminar, 10/2001

Routing in Detail Example: Octal digits, 212 namespace, 5712  7510 Neighbor Map For “5712” (Octal) Routing Levels 1 2 3 4 xxx1 5712 xxx0 xxx3 xxx4 xxx5 xxx6 xxx7 xx02 xx22 xx32 xx42 xx52 xx62 xx72 x012 x112 x212 x312 x412 x512 x612 0712 1712 2712 3712 4712 6712 7712 5712 5712 1 2 3 4 5 6 7 0880 1 2 3 4 5 6 7 0880 3210 1 2 3 4 5 6 7 3210 4510 1 2 3 4 5 6 7 4510 7510 1 2 3 4 5 6 7 7510 Stanford Networking Seminar, 10/2001

Object Location Randomization and Locality Stanford Networking Seminar, 10/2001

Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001

Fault-tolerant Location Minimized soft-state vs. explicit fault-recovery Redundant roots Object names hashed w/ small salts  multiple names/roots Queries and publishing utilize all roots in parallel P(finding reference w/ partition) = 1 – (1/2)n where n = # of roots Soft-state periodic republish 50 million files/node, daily republish, b = 16, N = 2160 , 40B/msg, worst case update traffic: 156 kb/s, expected traffic w/ 240 real nodes: 39 kb/s Stanford Networking Seminar, 10/2001

Fault-tolerant Routing Strategy: Detect failures via soft-state probe packets Route around problematic hop via backup pointers Handling: 3 forward pointers per outgoing route (2 backups) 2nd chance algorithm for intermittent failures Upgrade backup pointers and replace Protocols: First Reachable Link Selection (FRLS) Proactive Duplicate Packet Routing Stanford Networking Seminar, 10/2001

Summary Decentralized location and routing infrastructure Core routing similar to PRR97 Distributed algorithms for object-root mapping, node insertion / deletion Fault-handling with redundancy, soft-state beacons, self-repair Decentralized and scalable, with locality Analytical properties Per node routing table size: bLogb(N) N = size of namespace, n = # of physical nodes Find object in Logb(n) overlay hops Stanford Networking Seminar, 10/2001

Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001

Deployment Status Java Implementation in OceanStore Running static Tapestry Deploying dynamic Tapestry with fault-tolerant routing Packet-level simulator Delay measured in network hops No cross traffic or queuing delays Topologies: AS, MBone, GT-ITM, TIERS ns2 simulations Stanford Networking Seminar, 10/2001

Evaluation Results Cached object pointers Multiple object roots Efficient lookup for nearby objects Reasonable storage overhead Multiple object roots Improves availability under attack Improves performance and perf. stability Reliable packet delivery Redundant pointers approximate optimal reachability FRLS, a simple fault-tolerant UDP protocol Stanford Networking Seminar, 10/2001

First Reachable Link Selection Use periodic UDP packets to gauge link condition Packets routed to shortest “good” link Assumes IP cannot correct routing table in time for packet delivery IP Tapestry A B C D E No path exists to dest. Stanford Networking Seminar, 10/2001

Talk Outline Motivation Tapestry overview Fault-tolerant operation Deployment / evaluation Related / ongoing work Stanford Networking Seminar, 10/2001

Bayeux Global-scale application-level multicast (NOSSDAV 2001) Scalability Scales to > 105 nodes Self-forming member group partitions Fault tolerance Multicast root replication FRLS for resilient packet delivery More optimizations Group ID clustering for better b/w utilization Stanford Networking Seminar, 10/2001

Bayeux: Multicast 79FE 23FE 993E F9FE 43FE 44FE 73FE F990 29FE 035E Receiver 23FE 993E 793E 79FE F9FE 43FE 44FE 73FE F990 29FE 035E 04FE 13FE 555E ABFE 9990 Fanout in the tree is bounded by the node ID base Height of the tree is bounded by the # of digits in the node IDs 793E 239E 1290 093E 423E Receiver Multicast Root Stanford Networking Seminar, 10/2001

Bayeux: Tree Partitioning JOIN 79FE 43FE Receiver 23FE 993E F9FE 43FE 44FE 73FE F990 29FE 035E 04FE 13FE Tree Partitioning Algorithm Integrate Bayeux multicast root nodes into Tapestry network Name an object O with the hash of the multicast session name, and place O on each multicast root Each multicast root advertises O in Tapestry New member M uses Tapestry location service to route a JOIN message to the nearest multicast root node R R sends TREE message to M, now a member of R’s receiver set 555E ABFE Multicast Root 9990 JOIN 793E 239E 1290 093E 423E Stanford Networking Seminar, 10/2001 Multicast Root Receiver

Overlay Routing Networks CAN: Ratnasamy et al., (ACIRI / UCB) Uses d-dimensional coordinate space to implement distributed hash table Route to neighbor closest to destination coordinate Chord: Stoica, Morris, Karger, et al., (MIT / UCB) Linear namespace modeled as circular address space “Finger-table” point to logarithmic # of inc. remote hosts Pastry: Rowstron and Druschel (Microsoft / Rice ) Hypercube routing similar to PRR97 Objects replicated to servers by name Fast Insertion / Deletion Constant-sized routing state Unconstrained # of hops Overlay distance not prop. to physical distance Simplicity in algorithms Fast fault-recovery Log2(N) hops and routing state Overlay distance not prop. to physical distance Log(N) hops and routing state Data replication required for fault-tolerance Stanford Networking Seminar, 10/2001

Ongoing Research Fault-tolerant routing Application-level multicast Reliable Overlay Networks (MIT) Fault-tolerant Overlay Routing (UCB) Application-level multicast Bayeux (UCB), CAN (AT&T), Scribe and Herald (Microsoft) File systems OceanStore (UCB) PAST (Microsoft / Rice) Cooperative File System (MIT) Stanford Networking Seminar, 10/2001

For More Information Tapestry: OceanStore: Related papers: http://www.cs.berkeley.edu/~ravenben/tapestry OceanStore: http://oceanstore.cs.berkeley.edu Related papers: http://oceanstore.cs.berkeley.edu/publications http://www.cs.berkeley.edu/~ravenben/publications ravenben@cs.berkeley.edu Stanford Networking Seminar, 10/2001