Tapestry: Decentralized Routing and Location System Seminar S ‘01 Ben Y. Zhao CS Division, U. C. Berkeley.

Slides:



Advertisements
Similar presentations
Brocade: Landmark Routing on Peer to Peer Networks Ben Y. Zhao Yitao Duan, Ling Huang, Anthony Joseph, John Kubiatowicz IPTPS, March 2002.
Advertisements

Dynamic Replica Placement for Scalable Content Delivery Yan Chen, Randy H. Katz, John D. Kubiatowicz {yanchen, randy, EECS Department.
Qualifying Examination A Decentralized Location and Routing Infrastructure for Fault-tolerant Wide-area Applications John Kubiatowicz (Chair)Satish Rao.
Tapestry: Scalable and Fault-tolerant Routing and Location Stanford Networking Seminar October 2001 Ben Y. Zhao
Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Robert Morris Ion Stoica, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
P2P Simulation Platform Enhancement Shih Chin, Chai Superviser: Dr. Tim Moors Assessor: Dr. Robert Malaney.
The Oceanstore Regenerative Wide-area Location Mechanism Ben Zhao John Kubiatowicz Anthony Joseph Endeavor Retreat, June 2000.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel Proc. of the 18th IFIP/ACM.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Scalable Adaptive Data Dissemination Under Heterogeneous Environment Yan Chen, John Kubiatowicz and Ben Zhao UC Berkeley.
Distributed Lookup Systems
Each mesh represents a single hop on the route to a given root. Sibling nodes maintain pointers to each other. Each referrer has pointers to the desired.
Tapestry: Wide-area Location and Routing Ben Y. Zhao John Kubiatowicz Anthony D. Joseph U. C. Berkeley.
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter: Chunyuan Liao March 6, 2002 Ben Y.Zhao, John Kubiatowicz, and.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek and Hari alakrishnan.
Distributed Object Location in a Dynamic Network Kirsten Hildrum, John D. Kubiatowicz, Satish Rao and Ben Y. Zhao.
SkipNet: A Scaleable Overlay Network With Practical Locality Properties Presented by Rachel Rubin CS294-4: Peer-to-Peer Systems By Nicholas Harvey, Michael.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
Tapestry Deployment and Fault-tolerant Routing Ben Y. Zhao L. Huang, S. Rhea, J. Stribling, A. D. Joseph, J. D. Kubiatowicz Berkeley Research Retreat January.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
CITRIS Poster Supporting Wide-area Applications Complexities of global deployment  Network unreliability.
Decentralized Location Services CS273 Guest Lecture April 24, 2001 Ben Y. Zhao.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
Tapestry: Finding Nearby Objects in Peer-to-Peer Networks Joint with: Ling Huang Anthony Joseph Robert Krauthgamer John Kubiatowicz Satish Rao Sean Rhea.
Tapestry: A Resilient Global-scale Overlay for Service Deployment Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
“Umbrella”: A novel fixed-size DHT protocol A.D. Sotiriou.
Locality Aware Mechanisms for Large-scale Networks Ben Y. Zhao Anthony D. Joseph John D. Kubiatowicz UC Berkeley Future Directions in Distributed Computing.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
1 Plaxton Routing. 2 Introduction Plaxton routing is a scalable mechanism for accessing nearby copies of objects. Plaxton mesh is a data structure that.
Arnold N. Pears, CoRE Group Uppsala University 3 rd Swedish Networking Workshop Marholmen, September Why Tapestry is not Pastry Presenter.
Brocade Landmark Routing on P2P Networks Gisik Kwon April 9, 2002.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Vincent Matossian September 21st 2001 ECE 579 An Overview of Decentralized Discovery mechanisms.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
SIGCOMM 2001 Lecture slides by Dr. Yingwu Zhu Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Tapestry: A Resilient Global-scale Overlay for Service Deployment 1 Ben Y. Zhao, Ling Huang, Jeremy Stribling, Sean C. Rhea, Anthony D. Joseph, and John.
Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems Antony Rowstron and Peter Druschel, Middleware 2001.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 Distributed Hash Table CS780-3 Lecture Notes In courtesy of Heng Yin.
Peer to Peer Network Design Discovery and Routing algorithms
Tapestry : An Infrastructure for Fault-tolerant Wide-area Location and Routing Presenter : Lee Youn Do Oct 5, 2005 Ben Y.Zhao, John Kubiatowicz, and Anthony.
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,
Peer-to-Peer Information Systems Week 12: Naming
Accessing nearby copies of replicated objects
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
John D. Kubiatowicz UC Berkeley
Tapestry: Scalable and Fault-tolerant Routing and Location
MIT LCS Proceedings of the 2001 ACM SIGCOMM Conference
P2P: Distributed Hash Tables
Peer-to-Peer Information Systems Week 12: Naming
Simultaneous Insertions in Tapestry
Presentation transcript:

Tapestry: Decentralized Routing and Location System Seminar S ‘01 Ben Y. Zhao CS Division, U. C. Berkeley

Ben Zhao - U. W. 590 S'012 Challenges in the Wide-area  Trends: –Exponential growth in CPU, b/w, storage –Network expanding in reach and b/w  Can applications leverage new resources? –Scalability: increasing users, requests, traffic –Resilience: more components  inversely low MTBF –Management: intermittent resource availability  complex management schemes  Proposal: an infrastructure that solves these issues and passes benefits onto applications

Ben Zhao - U. W. 590 S'013 Cluster-based Applications Advantages  Ease of fault-monitoring  Communication on LANs –Low latency –High bandwidth –Abstract away comm.  Shared state  Simple load balancing Limitations  Centralization as liability –Centralized network link –Centralized power source –Geographic locality  Scalability limitations –Outgoing bandwidth –Power consumption –Physical resources (space, cooling)  Non-trivial deployment

Ben Zhao - U. W. 590 S'014 Global Computation Model  A wish list for global scale application services  Global self-adaptive system –Utilize all available resources –Decentralize all functionality no bottlenecks, no single points of vulnerability –Exploit locality whenever possible localize impact of failures –Peer-based monitoring of failures and resources

Ben Zhao - U. W. 590 S'015 Driving Applications  Leverage proliferation of cheap & plentiful resources: CPU’s, storage, network bandwidth  Global applications share distributed resources –Shared computation:  SETI, Entropia –Shared storage  OceanStore, Napster, Scale-8 –Shared bandwidth  Application-level multicast, content distribution

Ben Zhao - U. W. 590 S'016 Key: Location and Routing  Hard problem: –Locating and messaging to resources and data  Approach: wide-area overlay infrastructure: –Easier to deploy than lower-level solutions –Scalable: million nodes, billion objects –Available: detect and survive routine faults –Dynamic: self-configuring, adaptive to network –Exploits locality: localize effects of operations/failures –Load balancing

Ben Zhao - U. W. 590 S'017 Talk Outline  Problems facing wide-area applications  Previous work: Location services & PRR97  Tapestry: mechanisms and protocols  Preliminary Evaluation  Sample application: Bayeux  Related and future work

Ben Zhao - U. W. 590 S'018 Previous Work: Location  Goals: –Given ID or description, locate nearest object  Location services (scalability via hierarchy) –DNS –Globe –Berkeley SDS  Issues –Consistency for dynamic data –Scalability at root –Centralized approach: bottleneck and vulnerability

Ben Zhao - U. W. 590 S'019 Decentralizing Hierarchies  Centralized hierarchies –Each higher level node responsible for locating objects in a greater domain  Decentralize: Create a tree for object O (really!) –Object O has its own root and subtree –Server on each level keeps pointer to nearest object in domain –Queries search up in hierarchy Root ID = O Directory servers tracking 2 replicas

Ben Zhao - U. W. 590 S'0110 What is Tapestry?  A prototype of a decentralized, scalable, fault-tolerant, adaptive location and routing infrastructure  Network layer of OceanStore (Zhao, Kubiatowicz, Joseph et al. U.C. Berkeley)  Suffix-based hypercube routing –Core system inspired by Plaxton, Rajamaran, Richa (SPAA97)  Core API: –publishObject(ObjectID, [serverID]) –sendmsgToObject(ObjectID) –sendmsgToNode(NodeID)

Ben Zhao - U. W. 590 S'0111 PRR (SPAA 97)  Namespace (nodes and objects) –large enough to avoid collisions (~2 160 ?) (size N in Log 2 (N) bits)  Insert Object: –Hash Object into namespace to get ObjectID –For (i=0, i<Log 2 (N), i+j) { //Define hierarchy  j is base of digit size used, (j = 4  hex digits)  Insert entry into nearest node that matches on last i bits  When no matches found, then pick node matching (i – n) bits with highest ID value, terminate

Ben Zhao - U. W. 590 S'0112 PRR97 Object Lookup  Lookup object –Traverse same relative nodes as insert, except searching for entry at each node –For (i=0, i<Log 2 (N), i+n) {  Search for entry in nearest node matching on last i bits  Each object maps to hierarchy defined by single root –f (ObjectID) = RootID  Publish / search both route incrementally to root  Root node = f (O), is responsible for “knowing” object’s location

Ben Zhao - U. W. 590 S' Basic PRR Mesh Incremental suffix-based routing NodeID 0x43FE NodeID 0x13FE NodeID 0xABFE NodeID 0x1290 NodeID 0x239E NodeID 0x73FE NodeID 0x423E NodeID 0x79FE NodeID 0x23FE NodeID 0x73FF NodeID 0x555E NodeID 0x035E NodeID 0x44FE NodeID 0x9990 NodeID 0xF990 NodeID 0x993E NodeID 0x04FE NodeID 0x43FE

Ben Zhao - U. W. 590 S'0114 PRR97 Routing to Nodes Example: Octal digits, 2 18 namespace,  Neighbor Map For “5712” (Octal) Routing Levels 1234 xxx xxx0 xxx3 xxx4 xxx5 xxx6 xxx7 xx xx22 xx32 xx42 xx52 xx62 xx72 x012 x112 x212 x312 x412 x512 x

Ben Zhao - U. W. 590 S'0115 Use of Plaxton Mesh Randomization and Locality

Ben Zhao - U. W. 590 S'0116 PRR97 Limitations  Setting up the routing tables –Uses global knowledge –Supports only static networks  Finding way up to root –Sparse networks: find node with highest ID value –What happens as network changes  Need deterministic way to find the same node over time  Result: good analytical properties, but fragile in practice, and limited to small, static networks

Ben Zhao - U. W. 590 S'0117 Talk Outline  Problems facing wide-area applications  Previous work: Location services & PRR97  Tapestry: mechanisms and protocols  Preliminary Evaluation  Sample application: Bayeux  Related and future work

Ben Zhao - U. W. 590 S'0118 Tapestry Contributions PRR97  Benefits inherited by Tapestry: –Scalable: state: bLog b (N), hops: Log b (N) b=digit base, N= |namespace| –Exploits locality –Proportional route distance  Limitations –Global knowledge algorithms –Root node vulnerability –Lack of adaptability Tapestry  A real System! –Distributed algorithms  Dynamic root mapping  Dynamic node insertion –Redundancy in location and routing –Fault-tolerance protocols –Self-configuring / adaptive –Support for mobile objects  Application Infrastructure

Ben Zhao - U. W. 590 S'0119 Fault-tolerant Location  Minimized soft-state vs. explicit fault-recovery  Multiple roots –Objects hashed w/ small salts  multiple names/roots –Queries and publishing utilize all roots in parallel –P(finding Reference w/ partition) = 1 – (1/2) n where n = # of roots  Soft-state periodic republish –50 million files/node, daily republish, b = 16, N = 2 160, 40B/msg, worst case update traffic: 156 kb/s, –expected traffic w/ 2 40 real nodes: 39 kb/s

Ben Zhao - U. W. 590 S'0120 Fault-tolerant Routing  Detection: –Periodic probe packets between neighbors –Selective NACKs  Handling: –Each entry in routing map has 2 alternate nodes –Second chance algorithm for intermittent failures –Long term failures  alternates found via routing tables  Protocols: –Reactive Adaptive Routing –Proactive Duplicate Packet Routing

Ben Zhao - U. W. 590 S'0121 Dynamic Insertion Operations necessary for N to become fully integrated:  Step 1: Build up N’s routing maps –Send messages to each hop along path from gateway to current node N’ that best approximates N –The i th hop along the path sends its i th level route table to N –N optimizes those tables where necessary  Step 2: Move appropriate data from N’ to N  Step 3: Use back pointers from N’ to find nodes which have null entries for N’s ID, tell them to add new entry to N  Step 4: Notify local neighbors to modify paths to route through N where appropriate

Ben Zhao - U. W. 590 S'0122 Dynamic Insertion Example NodeID 0x243FE NodeID 0x913FE NodeID 0x0ABFE NodeID 0x71290 NodeID 0x5239E NodeID 0x973FE NEW 0x143FE NodeID 0x779FE NodeID 0xA23FE Gateway 0xD73FF NodeID 0xB555E NodeID 0xC035E NodeID 0x244FE NodeID 0x09990 NodeID 0x4F990 NodeID 0x6993E NodeID 0x704FE NodeID 0x243FE

Ben Zhao - U. W. 590 S'0123 Summary  Decentralized location and routing infrastructure –Core design from PRR97 –Distributed algorithms for object-root mapping, node insertion –Fault-handling with redundancy, soft-state beacons, self-repair  Analytical properties –Per node routing table size: bLog b (N)  N = size of namespace, n = # of physical nodes –Find object in Log b (n) overlay hops  Key system properties –Decentralized and scalable via random naming, yet has locality –Adaptive approach to failures and environmental changes

Ben Zhao - U. W. 590 S'0124 Talk Outline  Problems facing wide-area applications  Previous work: Location services & PRR97  Tapestry: mechanisms and protocols  Preliminary Evaluation  Sample application: Bayeux  Related and future work

Ben Zhao - U. W. 590 S'0125 Evaluation Issues  Routing distance overhead (RDP)  Routing redundancy  fault-tolerance –Availability of objects and references –Message delivery under link/router failures –Overhead of fault-handling  Optimality of dynamic insertion  Locality vs. storage overhead  Performance stability via redundancy

Ben Zhao - U. W. 590 S'0126 Results: Location Locality Measuring effectiveness of locality pointers (TIERS 5000)

Ben Zhao - U. W. 590 S'0127 Results: Stability via Redundancy Parallel queries on multiple roots. Aggregate bandwidth measures b/w used for soft-state republish 1/day and b/w used by requests at rate of 1/s.

Ben Zhao - U. W. 590 S'0128 Talk Outline  Problems facing wide-area applications  Previous work: Location services & PRR97  Tapestry: mechanisms and protocols  Preliminary Evaluation  Sample application: Bayeux  Related and future work

Ben Zhao - U. W. 590 S'0129 Example Application: Bayeux  Application-level multicast  Leverages Tapestry –Scalability –Fault tolerant data delivery  Novel optimizations –Self-forming member group partitions –Group ID clustering for better b/w utilization Root **10 **00 ***0 *010 *110 *100 *

Ben Zhao - U. W. 590 S'0130 Related Work  Content Addressable Networks –Ratnasamy et al., (ACIRI / UCB)  Chord –Stoica, Morris, Karger, Kaashoek, Balakrishnan (MIT / UCB)  Pastry –Druschel and Rowstron (Rice / Microsoft Research)

Ben Zhao - U. W. 590 S'0131 Future Work  Explore effects of parameters on system performance via simulations  Explore stability via statistics  Show effectiveness of application infrastructure Build novel applications, scale existing apps to wide-area –Silverback / OceanStore: global archival systems –Fault-tolerant Adaptive Routing –Network Embedded Directory Services  Deployment –Large scale time-delayed event-driven simulation –Real wide-area network of universities / research centers

Ben Zhao - U. W. 590 S'0132 For More Information Tapestry: OceanStore: Related papers:

Ben Zhao - U. W. 590 S'0133 Backup Nodes Follow…

Ben Zhao - U. W. 590 S'0134 Dynamic Root Mapping  Problem: choosing a root node for every object –Deterministic over network changes –Globally consistent  Assumptions –All nodes with same matching suffix contains same null/non-null pattern in next level of routing map –Requires: consistent knowledge of nodes across network

Ben Zhao - U. W. 590 S'0135 PRR Solution  Given desired ID N, –Find set S of nodes in existing network nodes n matching most # of suffix digits with N –Choose S i = node in S with highest valued ID  Issues: –Mapping must be generated statically using global knowledge –Must be kept as hard state in order to operate in changing environment –Mapping is not well distributed, many nodes in n get no mappings

Ben Zhao - U. W. 590 S'0136 Tapestry Solution  Globally consistent distributed algorithm: –Attempt to route to desired ID N i –Whenever null entry encountered, choose next “higher” non-null pointer entry –If current node S is only non-null pointer in rest of route map, terminate route, f (N) = S  Assumes: –Routing maps across network are up to date –Null/non-null properties identical at all nodes sharing same suffix

Ben Zhao - U. W. 590 S'0137 Analysis Globally consistent deterministic mapping  Null entry  no node in network with suffix   consistent map  identical null entries across same route maps of nodes w/ same suffix Additional hops compared to PRR solution:  Reduce to coupon collector problem Assuming random distribution  With n  ln(n) + cn entries, P(all coupons) = 1-e -c  For n=b, c=b-ln(b), P(b 2 nodes left) = 1-b/e b = 1.8   # of additional hops  Log b (b 2 ) = 2 Distributed algorithm with minimal additional hops

Ben Zhao - U. W. 590 S'0138 Dynamic Mapping Border Cases  Two cases –A. If a node disappeared, and some node did not detect it.  Routing proceeds on invalid link, fails  No backup router, so proceed to surrogate routing –B. If a node entered, has not been detected, then go to surrogate node instead of existing node  New node checks with surrogate after all such nodes have been notified  Route info at surrogate is moved to new node

Ben Zhao - U. W. 590 S'0139 Content-Addressable Networks  Distributed hashtable addressed in d dimension coordinate space  Routing table size: O(d)  Hops: expected O(dN 1/d ) –N = size of namespace in d dimensions  Efficiency via redundancy –Multiple dimensions –Multiple realities –Reverse push of “breadcrumb” caches –Assume immutable objects

Ben Zhao - U. W. 590 S'0140 Chord  Associate each node and object a unique ID in uni- dimensional space  Object O stored by node with highest ID < O  Finger table –Pointer for next node 2 i away in namespace –Table size: Log 2 (n) –n = total # of nodes  Find object: Log 2 (n) hops  Optimization via heuristics Node 0

Ben Zhao - U. W. 590 S'0141 Pastry  Incremental routing like Plaxton / Tapestry  Object replicated at x nodes closest to object’s ID  Routing table size: b(Log b N)+O(b)  Find objects in O(Log b N) hops  Issues: –Does not exploit locality –Infrastructure controls replication and placement –Consistency / security

Ben Zhao - U. W. 590 S'0142 Key Properties  Logical hops through overlay per route  Routing state per overlay node  Overlay routing distance vs. underlying network –Relative Delay Penalty (RDP)  Messages for insertion  Load balancing

Ben Zhao - U. W. 590 S'0143 Comparing Key Metrics  Properties –Parameter –Logical Path Length –Neighbor-state –Routing Overhead (RDP) –Messages to insert –Mutability –Load-balancing Tapestry ChordCANPastry Log b N Log 2 N O(d*N 1/d ) O(d) Base b bLog b N O(1)  O(1) O(1) ? App-dep. App-depImmut. ??? Good Log b N Log 2 N NoneDimen dBase b bLog b N+O(b) O(Log 2 2 N)O(d*N 1/d ) Good O(Log b 2 N)O(Log b N) Designed as P2P Indices