1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer.

1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer

2 “ Storage management and caching in PAST, a large-scale persistent peer-to-peer storage utility ” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University) “ Pastry: scalable, decentralized object location and routing for large-scale peer-to- peer systems ” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University) Sources

3 PASTRY scalable, decentralized object location and routing for large- scale peer-to-peer systems

4 Pastry Generic p2p location and routing substrate (DHT) Self-organizing overlay network (join, departures, locality repair) Consistent hashing Lookup/insert object in < log 2 b N routing steps (expected) O(log N) per-node state Network locality heuristics Scalable, fault resilient, self-organizing, locality aware

5 Pastry: Object distribution objId/key Consistent hashing 128 bit circular id space nodeIds (uniform random) objIds/keys (uniform random) Invariant: node with numerically closest nodeId maintains object nodeIds O 2 128 - 1

6 Pastry: Object insertion/lookup X Route(X) Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible O 2 128 - 1

7 Pastry: Routing table (# 10233102 ) L nodes in leaf set log N Rows (actually log 2 128 = 128/b) 2 b columns L neighbors 2b2b 2b2b 0 1 2 … 7 #row

Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively.  routing efficiency/robustness  fault detection (keep-alive)  application-specific local coordination

9 If (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (R l d exists) forward to R l d else forward to a known node (from ) that (a) shares at least as long a prefix (b) is numerically closer than this node Pastry: Routing procedure Unless |L|/2 adjacent nodes in the leafset failed simultaneously, at least one such node must be alive

10 Pastry: Routing Properties log N steps O(log N) state d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 2b2b 2b2b

11 Pastry: Routing Integrity of overlay: guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: No failures: < log N expected, 128/b + 1 max During failure recovery: O(N) worst case, average case much better 2b2b

12 Pastry: Locality properties Assumption: scalar proximity metric e.g. ping/RTT delay, # IP hops, geographical distance a node can probe distance to any other node Proximity invariant: Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.

13 Pastry: Geometric Routing in proximity space d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 d467c4 65a1f c d13da3 d4213f d462ba Proximity space  The proximity distance traveled by message in each routing step is exponentially increasing (entry in row l is chosen from a set of nodes of size N/2 bl )  The distance traveled by message from its source increases monotonically at each step (message takes exponentially larger strides each step) NodeId space

14 Pastry: Locality properties Simulations show: Expected distance traveled by a message in the proximity space is within a small constant of the minimum Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first The nearest copy in 76% of lookups One of the two nearest – in 92% of lookups

15 Pastry: Self-organization Initializing and maintaining routing tables and leaf sets Node addition Node departure (failure) The goal is to maintain all routing table entries to refer to a near node, among all live nodes with appropriate prefix

16 New node X contacts nearby node A A routes “ join ” message to X, which arrives to Z, closest to X X obtains leaf set from Z, i ’ th row for routing table from i ’ th node from A to Z X informs any nodes that need to be aware of its arrival X also improves its table locality by requesting neighborhood sets from all nodes X knows In practice: optimistic approach Pastry: Node addition

17 Pastry: Node addition X=d46a1c Route(d46a1c) d462ba d4213f d13da3 A = 65a1fc Z=d467c4 d471f1 New node: X=d46a1c

18 d467c4 65a1f c d13da3 d4213f d462ba Proximity space Pastry: Node addition New node: d46a1c d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space X is close to A, B is close to B1. Why X is close to B1? The expected distance from B to its row one entries (B1) is much larger than the expected distance from A to B (chosen from exponentially decreasing set size)

19 Node departure (failure) Leaf set repair (eager – all the time): Leaf set members exchange keep-alive messages In case a node in the leaf set fails, request set from furthest live node in set. Update the leafset and notify the nodes that were added to the leafset Routing table repair (lazy – upon failure): get table from peers in the same row, if not found – from higher rows Neighborhood set repair (eager) Periodically contact neighbors. If a neighbor failed take neighbor lists from other neighbors, check distances, and update your list with the closest nodes found

20 Randomized Routing So far, the routing is deterministic. If a node in the routing path has failed or refuses to pass the message, re-transmitting will not help. Each step, the message must be forwarded to a node whose ID shares at least as long a prefix, but is numerically closer than current node If there are several possible such nodes - choose one randomly, heavily biased towards the closest If routing fail, the client needs to retransmit

21 Pastry: Distance traveled |L|=16, 100k random queries Proximity in emulated network. Nodes paced randomly 30%-40% longer than the optimum. Not bad, considering that Pastry only stores 75 entries in the routing table, instead of 99,999 of the complete routing table

22 Pastry: Summary Generic p2p overlay network Scalable, fault resilient, self-organizing, secure O(log N) routing steps (expected) O(log N) routing table size Network locality properties 2b2b 2b2b

23 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility

24 INTRODUCTION  PAST system  Internet-based, peer-to-peer global storage utility  Characteristics:  strong persistence, high availability (by using k replicas)  scalability (due to efficient Pastry routing)  short insert and query paths  query load balancing and latency reduction (due to wide dispersion, Pastry locality and caching)  security  Composed of nodes connected to internet, each node has 128-bit nodeId  Use Pastry for efficient routing scheme  No support for mutable files, searching, directory lookup

25 INTRODUCTION  Function of nodes :  store replicas of files  initiate and route client requests to insert or retrieve files in PAST  File-related property :  Inserted files have quasi-unique fileId,  File is replicated across multiple nodes  To retrieve file, client must know fileId and decryption key (if necessary)  fileId : 160-bit computed as SHA-1 of file name, owner ’ s public key, random salt number

26 PAST Operation  Insert: fileId = Insert(name, owner-credentials, k, file) 1. fileId computed (hash code of file name, public key, etc.) 2. Request Message reaches one of k nodes closest to fileId 3. Node accepts a replica of the file, forwards message to k-1 nodes existing in leaf set 4. Once k nodes accept, ‘ack’ message with store receipts is passed to client. Clients can be “charged” for the storage  Lookup: file = Lookup(fileId)  Retrieves a copy of the file, if it was inserted earlier and if one of the k nodes that store it are connected to the network. The “closest” node will usually provide the copy  Reclaim: Reclaim(fileId, owner-credentials)  After this, retrieval of the file is no longer guaranteed.  Unlike a “delete” operation, a “Reclaim” does not guarantee that the file will not be accessible. These weaker semantics simplify the algorithm

27 STORAGE MANAGEMENT - why? Ensure availability of files Balance the storage load Provide graceful degradation in performance as the system (globally) runs out of storage

28 STORAGE MANAGEMENT  Responsibility  Replicas of files are maintained by k nodes with nodeId closest to fileId.  Why is this a good thing?  Creates a conflict: what if these k nodes have insufficient storage space, while other nodes have enough space  Challenge: Balance free storage space among nodes  Causes for such load imbalance:  Number of files assigned to each node is different  Size of each inserted file is different  Storage capacity of each node is different  Solution: Replica diversion + File diversion

29 STORAGE MANAGEMENT Replica Diversion  Purpose : balance the remaining free storage space among the nodes in a leafset  Diversion steps of node A (that received insertion request but has insufficient space) 1. choose node B among nodes in leaf set  B does not already holds diverted replica  B is not one of the k closest (where the file will be stored anyway) 2. ask B to store a copy 3. enter a file entry in table with pointer to B 4. send store receipt as usual

30 Replica Diversion – continued 1. If B fails, the replica should be stored elsewhere 2. If A fails, the replica in B should remain available  otherwise the probability that all k replicas are inaccessible is doubled with each replica diversion (1) will be described later (2) Ask the k+1 ’ th closest node C to keep a pointer to B  If A fails, k closest nodes will still hold replicas  If C fails, A asks the new k+1 ’ th node to keep a pointer to B Cost: A and C both store an additional entry in their file tables (the pointer to B) A few additional RPC ’ s during insert and during lookup

31  Node rejects file if: file_size/remaining_storage > t  Meaning: the file would consume more than a fraction t of the remaining storage in the node  Primary replica stores (among the k closest) use t = t pri  Diverted replica stores (not among k closest) use t div  t pri > t div  Some properties:  Avoids unnecessary diversion when node still has space  Prefers diverting large files – minimize number of diversions  Prefers accepting primary replicas than diverted replicas  Primary store A that rejects the file diverts it to B:  B is a node in the leafset of A  B is not already a primary or diverted store for this replica  Has the most free space among all such possible nodes Replica Diversion – continued

32  If the chosen node B also rejects the replica  Nodes that already stored a replica discard it  a negative ack message is returned to the client, causing a File Diversion Replica Diversion – continued

33 STORAGE MANAGEMENT File Diversion  Purpose : balancing the remaining free storage space among different portions of the nodeID space in the network  Client node generates new fileId using different salt value and reissues file insert  This is repeated at most 4 times. If fourth attempt fails:  make smaller file size by fragmenting  make k smaller (number of replicas)

34 STORAGE MANAGEMENT node strategy to maintain k replicas  In Pastry, neighboring nodes in nodeId space exchange keep- alive message. On a timeout:  remove the failed node from leaf set  include a live node with next closest noidId  A change in leaf set affects the replicas:  if failed node stores a file (primary or diverted replica holder), the primary store(s) assign another node to keep the replica  There might not be space in the leafset for another replica. In this case the number of replicas might temporarily drop below k  To cope with failure of primary that diverted – replicate diversion pointers  Optimization – a joining node may, instead of requesting and copying a replica, install a pointer to the previous replica holder (a node that is no longer in the leaf set) in file table (like replica diversion). Then, gradual migration

35 STORAGE MANAGEMENT Fragmenting and File encoding  Instead of replication, it is possible to use erasure coding  For example: Reed Solomon  Suppose the file has n blocks  To tolerate m failures, we can replicate m times: mn blocks  Instead, we can add m checksum blocks (for example), such that any n blocks out of the m+n can restore the file.  This approach fragments the file  Although it seems like erasure coding is better than replication, it has its disadvantages

36 Erasure coding vs. Replication  Some pros and Cons of Erasure Coding  improves balancing of disk utilization in the system  Same availability for much less storage (or – much more availability for the same storage)  Should probably be preferred when there are a lot of failures  With replication, the data object can be downloaded from the replica that is closest to the client, whereas with coding the download latency is bounded by the distance to the n-th closest replica.  The need of coding and decoding adds complexity to the system design  The whole object needs to be downloaded and reconstructed (with replication only one block can be downloaded)  Higher network load (need to contact several nodes to retrieve a file)

37 CACHING  GOAL : minimizing client access latency, maximizing query throughput, balancing query load  k replicas are saved mainly for availability of the file, although they help with balancing access load and proximity-aware routing minimizes access latency. But, sometimes, its not enough.  Examples:  popular object require much more than k replicas to sustain the load and at the same time keep access time and network traffic low.  Suppose that a file is popular among a cluster of clients. Its better if we keep a copy near that cluster.

38 CACHING – cont.  Caching: create and maintain additional copies of highly popular file in “ unused ” disk space of nodes  Evict cached files when storage is needed  cache performance decreases as system utilization increases  During successful insertion and lookup, insert to cache on all routed nodes (unless larger than some fraction c of the free storage)  GreedyDual-Size (GD-S) policy for replacement  A weight w f (=cost(f)/size(f)) is assigned to each cached file  file with lowest w f is replaced  This w f is subtracted from the weight of all remaining cached files

39 EXPERIMENTAL RESULTS Effects of Storage Management  No diversion (t pri = 1, t div = 0):  max utilization 60.8%  51.1% inserts failed - leaf set size : effect of local load balancing  Replica/file diversion (t pri = 0.1, t div =.05):  max utilization > 98%  < 1% inserts failed -Policy- Accept a file if file_size / free_space < t

40 EXPERIMENTAL RESULTS Determine Threshold Values  Insertion Statistics and Utilization as t pri varied, t div = 0.05  Insertion Statistics and Utilization as t div varied, t pri = 0.1 -Policy- Accept a file if file_size / free_space < t  The lower t pri, the less likely that large file can be stored, therefore many small files can be stored instead -> number of stored file increases, but Utilization drops, since large files are rejected at low utilization levels  Similarly, as t div increases, storage utilization improves, but fewer files are successfully inserted, for the same reasons

41 EXPERIMENTAL RESULTS Impact of file and replica diversion  Number of replica diversions is small even at high utilization  at 80% utilization less than 10% replicas are diverted  As long as utilization is below 95%, each file is rarely redirected more than once and file diversions are very rare  Less than 16% of all replicas are diverted when utilization is 95%

42 EXPERIMENTAL RESULTS Caching  Global cache hit ratio and average number of message hops  As storage Util. and file number increases, cached files are replaced by replicas -> cache hit ratio decreases  hit ratio ↓ -> routing hops ↑ (however, no-caching is still worse even at 99% utilization) log 16 2250 = 3 Number of nodes No caching – constant number of hops, until redirection begins and then one more hop is required

43 CONCLUSION  Design and evaluation of PAST  Storage Management, Caching  Nodes and files are assigned uniformly distributed ID  Replicas of file stored at k nodes closest to fildId  Experimental results  Achieve storage utilization of 98%  Below 5% file insertion failures at 95% utilization  Mostly large files are rejected  Caching achieves load balancing, reduces fetch- distance and network traffic

1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer.

Similar presentations

Presentation on theme: "1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer.

Similar presentations

Presentation on theme: "1 Pastry and Past Based on slides by Peter Druschel and Gabi Kliot (CS Department, Technion) Alex Shraer."— Presentation transcript:

Similar presentations

About project

Feedback