1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics.

1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics in Reliable Distributed Computing 21/11/2004 Partially borrowed from Peter Druschel ’ s presentation

2 Outline  Introduction  Pastry overview  PAST Overview  Storage Management  Caching  Experimental Results  Conclusion

3 “ Storage management and caching in PAST, a large-scale persistent peer-to-peer storage utility ” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University) “ Pastry: scalable, decentralized object location and routing for large-scale peer-to- peer systems ” Antony Rowstron (Microsoft Research) Peter Druschel (Rice University) Sources

4 PASTRY

5 Pastry Generic p2p location and routing substrate (DHT) Self-organizing overlay network (join, departures, locality repair) Consistent hashing Lookup/insert object in < log 2 b N routing steps (expected) O(log N) per-node state Network locality heuristics Scalable, fault resilient, self-organizing, locality aware, secure

6 Pastry: API nodeId=pastryInit(Credentials, Applicaton): join local node to Pastry network route(M, X): route message M to node with nodeId numerically closest to X Application callbacks: deliver(M): deliver message M to application forwarding(M, X): message M is being forwarded towards key X newLeaf(L): report change in leaf set L to application

7 Pastry: Object distribution objId/key Consistent hashing 128 bit circular id space nodeIds (uniform random) objIds/keys (uniform random) Invariant: node with numerically closest nodeId maintains object nodeIds O 2 128 - 1

8 Pastry: Object insertion/lookup X Route(X) Msg with key X is routed to live node with nodeId closest to X Problem: complete routing table not feasible O 2 128 - 1

9 Pastry: Routing Tradeoff O(log N) routing table size 2 b * log 2 b N + 2l O(log N) message forwarding steps

10 Pastry: Routing table (# 10233102 ) L nodes in leaf set log 2 b N Rows (actually log 2 b 2 128 = 128/b) 2 b columns L neighbors

Pastry: Leaf sets Each node maintains IP addresses of the nodes with the L numerically closest larger and smaller nodeIds, respectively.  routing efficiency/robustness  fault detection (keep-alive)  application-specific local coordination

12 Pastry: Routing procedure If (destination is within range of our leaf set) forward to numerically closest member else let l = length of shared prefix let d = value of l-th digit in D’s address if (R l d exists) forward to R l d else forward to a known node (from ) that (a) shares at least as long a prefix (b) is numerically closer than this node

13 Pastry: Routing Properties log 2 b N steps O(log N) state d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1

14 Pastry: Routing Integrity of overlay: guaranteed unless L/2 simultaneous failures of nodes with adjacent nodeIds Number of routing hops: No failures: < log 2 b N expected, 128/b + 1 max During failure recovery: O(N) worst case, average case much better

15 Pastry: Locality properties Assumption: scalar proximity metric e.g. ping/RTT delay, # IP hops traceroute, subnet masks a node can probe distance to any other node Proximity invariant: Each routing table entry refers to a node close to the local node (in the proximity space), among all nodes with the appropriate nodeId prefix.

16 Pastry: Geometric Routing in proximity space d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 d467c4 65a1f c d13da3 d4213f d462ba Proximity space  The proximity distance traveled by message in each routing step is exponentially increasing (entry in row l is chosen from a set of nodes of size N/2 bl )  The distance traveled by message from its source increases monotonically at each step (message takes larger and larger strides) NodeId space

17 Pastry: Locality properties Each routing step is local, but there is no guarantee of globally shortest path Nevertheless, simulations show: Expected distance traveled by a message in the proximity space is within a small constant of the minimum Among k nodes with nodeIds closest to the key, message likely to reach the node closest to the source node first

18 Pastry: Self-organization Initializing and maintaining routing tables and leaf sets Node addition Node departure (failure) The goal is to maintain all routing table entries to refer to a near node, among all live nodes with appropriate prefix

19 New node X contacts nearby node A A routes “ join ” message to X, which arrives to Z, closest to X X obtains leaf set from Z, i ’ th row for routing table from i ’ th node from A to Z X informs any nodes that need to be aware of its arrival X also improves its table locality by requesting neighborhood sets from all nodes X knows In practice: optimistic approach Pastry: Node addition

20 Pastry: Node addition X=d46a1c Route(d46a1c) d462ba d4213f d13da3 A = 65a1fc Z=d467c4 d471f1 New node: X=d46a1c

21 d467c4 65a1f c d13da3 d4213f d462ba Proximity space Pastry: Node addition New node: d46a1c d46a1c Route(d46a1c) d462ba d4213f d13da3 65a1fc d467c4 d471f1 NodeId space X is close to A, B is close to B1. Why X is close to B1? The expected distance from B to its row one entries (B1) is much larger than the expected distance from A to B (chosen from exponentially decreasing set size)

22 Node departure (failure) Leaf set repair (eager – all the time): Leaf set members exchange keep-alive messages request set from furthest live node in set Routing table repair (lazy – upon failure): get table from peers in the same row, if not found – from higher rows Neighborhood set repair (eager)

23 Pastry: Security Secure nodeId assignment Randomized routing – pick random node among all potential Byzantine fault-tolerant leaf set membership protocol

24 Pastry: Distance traveled |L|=16, 100k random queries Proximity in emulated network. Nodes paced randomly

25 Pastry: Summary Generic p2p overlay network Scalable, fault resilient, self-organizing, secure O(log N) routing steps (expected) O(log N) routing table size Network locality properties

26 PAST

27 INTRODUCTION  PAST system  Internet-based, peer-to-peer global storage utility  Characteristics:  strong persistence, high availability (by using k replicas)  scalability (due to efficient Pastry routing)  short insert and query paths  query load balancing and latency reduction (due to wide dispersion, Pastry locality and caching)  security  Composed of nodes connected to internet, each node has 128-bit nodeId  Use Pastry for efficient routing scheme  No support for mutable files, searching, directory lookup

28 INTRODUCTION  Function of nodes :  store replicas of files  initiate and route client requests to insert or retrieve files in PAST  File-related property :  Inserted files have quasi-unique fileId,  File is replicated across multiple nodes  To retrieve file, client must know fileId and decryption key (if necessary)  fileId : 160-bit computed as SHA-1 of file name, owner ’ s public key, random salt number

29 PAST Operation  Insert: fileId = Insert(name, owner-credentials, k, file) 1. fileId computed (hash code of file name, public key, etc.) 2. Request Message reaches one of k nodes closest to fileId 3. Node accepts a replica of the file, forwards message to k-1 nodes existing in leaf set 4. Once k nodes accept, ‘ack’ message with store receipt is passed to client  Lookup: file = Lookup(fileId)  Reclaim: Reclaim(fileId, owner-credentials)

30 STORAGE MANAGEMENT why?  Responsibility  Replicas of files be maintained by k nodes with nodeId closest to fileId  Balance free storage space among nodes in PAST  Conflict : K nodes having insufficient storage vs. neighbor nodes having sufficient storage  Cause of load imbalance : 3 differences  Number of files assigned to each node  Size of each inserted file  Storage capacity of each node  Resolution : Replica diversion, File diversion

31 STORAGE MANAGEMENT Replica Diversion  GOAL : balance the remaining free storage space among nodes in leaf set  Diversion steps of node A (that received insertion request but has insufficient space) 1. choose node B among nodes in leaf set except k closest, s.t. B does not already holds diverted replica 2. ask B to store a copy 3. enter an file entry in table with pointer to B 4. send store receipt as usual

32 STORAGE MANAGEMENT Replica Diversion  Policy for accepting a replica by node  Node rejects file if file_size/remaining_storage > t  Threshold t -> t pri (in primary replica), t div (in diverted replica)  Avoids unnecessary diversion when node still has space  Prefer diverting large files – minimize number of diversions  Prefer accepting primary replicas than diverted replicas

33 STORAGE MANAGEMENT File Diversion  GOAL : balancing the remaining free storage space among nodes in PAST network  When all k nodes and their leaf sets have insufficient space  Client node generate new fileId using different salt value  Repeats limit : 3 times  Fourth fail -> make smaller file size by fragmenting

34 STORAGE MANAGEMENT node strategy to maintain k replicas  In Pastry, neighboring nodes exchange keep- alive message  If period T is passed, leaf nodes  removes the failed node from leaf set  includes a live node with next closest noidId  File strategy for node joining and dropping in leaf sets  if failed node is one of k nodes for certain files (primary or diverted replica holder), re-creating replicas held by failed node  To cope with diverter failure – replicate diversion pointers  Optimization – joining node may, instead of requesting all its replicas, install a pointer to the previous replica holder in file table (like replica diversion). Than gradual migration

35 STORAGE MANAGEMENT Fragmenting and File encoding  In Reed-Solomon encoding, to increase high availability  Fragmentation:  improves equal disk utilization  improves bandwidth – parallel download  Higher latency to contact several nodes for retreaval

36 CACHING  GOAL : minimizing client access latency, maximizing query throughput, balancing query load  Create and maintain additional copies of highly popular file in “ unused ” disk space of nodes  During successful insertion and lookup, on all routed nodes  GreedyDual-Size (GD-S) policy for replacement  Applying H f (=cost(f)/size(f)) value to each cached file  File with lowest H f is replaced

37 Security in PAST  Smartcard – private/public key scheme  ensure nodeId / fileId assignment integrity  Against a malicious node  Getting store receipt – prevent fewer than k replicas  File certificate – verify the authenticity of file content  File privacy by clients encryption  Signing routing tables entries  Randomizing the routing scheme, to prevent DOS  Can not completely prevent malicious node to suppress valid entries

38 EXPERIMENTAL RESULTS Effects of Storage Management  No diversion (t pri = 1, t div = 0):  max utilization 60.8%  51.1% inserts failed - leaf set size : effect of local load balancing  Replica/file diversion (t pri = 0.1, t div =.05):  max utilization > 98%  < 1% inserts failed -Policy- Accept a file if file_size / free_space < t

39 EXPERIMENTAL RESULTS Determine Threshold Values  Insertion Statistics and Utilization as t pri varied, t div = 0.05  Insertion Statistics and Utilization as t div varied, t pri = 0.1 -Policy- Accept a file if file_size / free_space < t  As t pri increases, fewer files are successfully inserted, but higher storage utilization is achieved  The lower t pri, the less likely that large file can be stored, therefore many small files can be stored instead. Util drops, cause large files are rejected at low utilization levels  As t div increases, storage utilization improves, but fewer files are successfully inserted,

40 EXPERIMENTAL RESULTS Impact of file and replica diversion  File diversions are negligible for storage utilization below 83%  Number of replica diversions is small even at high utilization:  at 80% utilization less than 10% replicas are diverted  => The overhead imposed by replica and file diversions is small as long as utilization is below 95%

41 EXPERIMENTAL RESULTS File Insertion Failure  File insertion failures vs. storage utilization  Utilization vs. Smaller files’ failure  Failure ratio increases from 90% Utilization  Failed insertions are heavily biased towards large files

42 EXPERIMENTAL RESULTS Caching  Global cache hit ratio and average number of message hops  Dropping hit ratio : Storage Util. and file number increases, replace files in caches  hit ratio ↓ -> routing hops ↑ log 16 2250 = 3

43 CONCLUSION  Design and evaluation of PAST  Storage Management, Caching  Nodes and files are assigned uniformly distributed ID  Replicas of file stored at k nodes closest to fildId  Experimental results  Achieve storage utilization of 98%  Low file insertion failure ratio at high storage utilization  Effective caching achieves load balancing

44 Weakness  Does not support mutable files – read only  No searching, directory lookup  Local fault in segment of network may cause functioning node not to be able to contact outside world, since its routing table is mainly local  No direct support for anonymity or confidentiality  Breaking large node apart – is it good or bad?  Simulation is too sterile  No experimental comparison of PAST to other systems

45 Comparison to other systems

46 Comparison  PASTRY compared to Freenet and Gnutella:  Guaranteed answer in bounded number of steps, while retaining scalabilty of Freenet and self-organization of Freenet and Gnutella  PASTRY Compared to Chord  Chord makes no explicit effort to achieve good network locality  PAST compared to OceanStore  PAST has no support for mutable files, searching, directory lookup  more sophisticated storage semantics could be build on top of PAST  Pastry (and Tapestry) are similar to Plaxton:  routing based on prefixes, generalization of hypercube routing  Plaxton is not self organizing; one node associated per file, thus single point of failure

47 Comparison PAST compared to FarSite FarSite has traditional file system semantics, distributed directory service to locate content. Every node maintains partial list of live nodes, from which it chooses nodes to store replicas LAN assumptions of FarSite may not hold in a wide-area environment PAST compared to CFS CFS built on top of Chord File sharing medium, block oriented, read only Each block is stored on multiple nodes with adjacent Chord nodeIds, caching of popular blocks Increased file retrieval overhead Parallel block retrieval good for large files CFS assumes abundance of free disk space Relies on hosting multiple logical nodes in one physical Chord node, with separate ids, in order to accommodate nodes with big storage capacity => increasing query overhead

48 Comparison PAST compared to LAND Expected constant number of outgoing links in each node Constant number of pointers to each object Constant bound on distortion (stretch): accumulative route cost divided by distance cost Links choice enforces distance upper bound on each stage of the route LAND uses two tier architecture: super-nodes

49 The END

1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics.

Similar presentations

Presentation on theme: "1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics.

Similar presentations

Presentation on theme: "1 Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility Gabi Kliot, Computer Science Department, Technion Topics."— Presentation transcript:

Similar presentations

About project

Feedback