Presentation is loading. Please wait.

Presentation is loading. Please wait.

Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation.

Similar presentations


Presentation on theme: "Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation."— Presentation transcript:

1 Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation And High Availability, Scalable Storage, Dynamic Peer Networks: Pick Two by Rodrigo Rodrigues, Charles Blake and Barbara Liskov presented at HotOS 2003 By Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, *Ion Stoica MIT and *Berkeley node Internet node

2 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:2 Design Goals Spread storage burden evenly (Avoid hot spots) Tolerate unreliable participants Fetch speed comparable to whole-file TCP Avoid O(#participants) algorithms –Centralized mechanisms [Napster], broadcasts [Gnutella] Simplicity –Does simplicity imply provable correctness? More precisely, could you build CFS correctly? –What about performance? CFS attempts to solve these challenges –Does it?

3 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:3 CFS Summary CFS provides peer-to-peer r/o storage Structure: DHash and Chord Claims efficient, robust, and load-balanced –Does CFS achieve any of these qualities? It uses block-level distribution The prototype is as fast as whole-file TCP Storage promise  Redundancy promise –  data must move as members leave! –  lower bound on bandwidth usage

4 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:4 Client-server interface Files have unique names Files are read-only (single writer, many readers) Publishers split files into blocks and place blocks into a hash table. Clients check files for authenticity [SFSRO] FS Clientserver Insert file f Lookup file f Insert block Lookup block node server node

5 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:5 Server Structure DHash stores, balances, replicates, caches blocks DHash uses Chord [SIGCOMM 2001] to locate blocks Why blocks instead of files? easier load balance (remember complexity of PAST) DHash Chord Node 1Node 2 DHash Chord

6 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:6 CFS file system structure The root-block is identified by a public key –signed by corresponding private key Other blocks identified by hash of their contents What is wrong with this organization? –Path of blocks from data-block to root-block modified for every update. –This is okay because system is read-only. root-block public key signature H(D)  directory block  D H(F) inode block  F data block B1 data block B2 H(B2) H(B1)

7 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:7 DHash/Chord Interface lookup() returns list with node IDs closer in ID space to block ID –Sorted, closest first server DHash Chord Lookup(blockID)List of finger table with

8 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:8 DHash Uses Other Nodes to Locate Blocks N40 N10 N5 N20 N110 N99 N80 N50 N60 N68 Lookup(BlockID=45) 1. 2. 3.

9 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:9 Storing Blocks Long-term blocks are stored for a fixed time –Publishers need to refresh periodically Cache uses LRU disk: cacheLong-term block storage

10 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:10 Replicate blocks at r successors N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 r = 2 log N Node IDs are SHA-1 of IP Address Ensures independent replica failure

11 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:11 Lookups find replicas N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 1. 3. 2. 4. Lookup(BlockID=17) RPCs: 1. Lookup step 2. Get successor list 3. Failed block fetch 4. Block fetch

12 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:12 First Live Successor Manages Replicas N40 N10 N5 N20 N110 N99 N80 N60 N50 Block 17 N68 Copy of 17 Node can locally determine that it is the first live successor

13 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:13 DHash Copies to Caches Along Lookup Path N40 N10 N5 N20 N110 N99 N80 N60 Lookup(BlockID=45) N50 N68 1. 2. 3. 4. RPCs: 1. Chord lookup 2. Chord lookup 3. Block fetch 4. Send to cache

14 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:14 Virtual Nodes Allow Heterogeneity Hosts may differ in disk/net capacity Hosts may advertise multiple IDs –Chosen as SHA-1(IP Address, index) –Each ID represents a “virtual node” Host load proportional to # v.n.’s Manually controlled Sybil attach! Node A N60N10N101 Node B N5

15 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:15 Experiment (12 nodes)! (pre-planetlab) One virtual node per host 8Kbyte blocks RPCs use UDP To vu.nl lulea.se ucl.uk To kaist.kr,.ve Caching turned off Proximity routing turned off

16 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:16 CFS Fetch Time for 1MB File Average over the 12 hosts No replication, no caching; 8 KByte blocks Fetch Time (Seconds) Prefetch Window (KBytes)

17 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:17 Distribution of Fetch Times for 1MB Fraction of Fetches Time (Seconds) 8 Kbyte Prefetch 24 Kbyte Prefetch40 Kbyte Prefetch

18 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:18 CFS Fetch Time vs. Whole File TCP Fraction of Fetches Time (Seconds) 40 Kbyte Prefetch Whole File TCP

19 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:19 Robustness vs. Failures Failed Lookups (Fraction) Failed Nodes (Fraction) (1/2) 6 is 0.016 Six replicas per block;

20 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:20 Revisit Assumptions P2P Purist Ideals –Cooperation, Symmetry, Decentralized How realistic are these assumptions? In what domains are they valid? Fastest Flaky Slower Flaky SlowFasterStableFast Slowtest 10.. 100s GB/Node of Idle Cheap Disk Distributed Data Store w/ all the *ilities: High Availability Good Scalability High Reliability Maintainability Flexibility Fault-Tolerant DHT

21 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:21 BW for Redundancy Maintenance Assume average system size, N, stable –P(Leave)/Time = Leaves/Time/N = 1/Lifetime –Join = Leave forever rate = 1/Lifetime –Leaves induce redundancy replacement replacement size x replacement rate –Joins cost the same  Maintenance BW > 2 x Space/Lifetime Space/node < ½ BW/node x Lifetime Quality WAN storage scales with WAN BW and member quality

22 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:22 BW for Redundancy Maintenance II maintenance BW  200 Kbps lifetime = Median 2001-Gnutella session = 1 hour served space = 90 MB/node << donatable storage!

23 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:23 Peer Dynamics The peer-to-peer “dream” –Reliable storage from many unreliable components Robust lookup perceived as critical Bandwidth to maintain redundancy is the hard problem [Blake and Rodrigues 03]

24 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:24 Need Too Much BW to Maintain Redundancy 10M users; 25% avail.; 1 week membership; 100G donation => 50 kbps Wait! It gets worse… HW trends High Availability Scalable Storage Dynamic Membership Must Pick Two

25 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:25 Proposal #1 Server-to-Server DHTs Reduce to a Solved Problem? Not really… –Self-Configuration –Symmetry –Scalability –Dynamic Load balance

26 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:26 Proposal #2 Complete routing information Possible complications: –Memory Requirements –Bandwidth [Gupta,Liskov,Rodrigues 03] –Load balance Multi-hop optimization makes sense only when many very dynamic members serve a little data –(multi-hop not required if N < per host / 40 bytes)

27 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:27 Proposal #3 Decouple networking layer from data layer Layer of indirection –a.k.a. distributed directory, location pointers, pointers, etc Combines a little of proposal #1 and 2 –DHT no longer decides who, what, when, where, why, and how for storage maintenance. –Separate policy from mechanism.

28 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:28 Appendix: Chord Hashes a Block ID to its Successor N32 N10 N100 N80 N60 Circular ID Space Nodes and blocks have randomly distributed IDs Successor: node with next highest ID B33, B40, B52 B11, B30 B112, B120, …, B10 B65, B70 B100 Block ID Node ID

29 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:29 Appendix: Basic Lookup N32 N10 N5 N20 N110 N99 N80 N60 N40 “Where is block 70?” “N80” Lookups find the ID’s predecessor Correct if successors are correct

30 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:30 Appendix: Successor Lists Ensure Robust Lookup N32 N10 N5 N20 N110 N99 N80 N60 Each node stores r successors, r = 2 log N Lookup can skip over dead nodes to find blocks N40 10, 20, 32 20, 32, 40 32, 40, 60 40, 60, 80 60, 80, 99 80, 99, 110 99, 110, 5 110, 5, 10 5, 10, 20

31 P2P Systems 2003©2003 Hakim Weatherspoon/UC BerkeleyCFS:31 Appendix: Chord Finger Table Allows O(log N) Lookups N80 ½ ¼ 1/8 1/16 1/32 1/64 1/128 See [SIGCOMM 2000] for table maintenance


Download ppt "Wide-Area Cooperative Storage with CFS Presented by Hakim Weatherspoon CS294-4: Peer-to-Peer Systems Slides liberally borrowed from the SOSP 2001 CFS presentation."

Similar presentations


Ads by Google