peer-to-peer file systems

peer-to-peer file systems
Presented by: Serge Kreiker

“P2P” in the Internet Napster: A peer-to-peer file sharing application
allow Internet users to exchange files directly simple idea … hugely successful fastest growing Web application 50 Million+ users in January 2001 shut down in February 2001 similar systems/startups followed in rapid succession Napster,Gnutella, Freenet

Napster (xyz.mp3, ) Central Napster server

Napster xyz.mp3 ? Central Napster server

Gnutella

Gnutella xyz.mp3 ?

Gnutella

Gnutella xyz.mp3

So Far Centralized : Napster - Table size – O(n)
- Number of hops – O(1) Flooded queries: Gnutella - Table size – O(1) - Number of hops – O(n)

Storage Management Systems challenges
Distributed Nodes have identical capabilities and responsibilities anonymity Storage management : spread storage burden evenly Tolerate unreliable participants Robustness : surviving massive failures Resilience to DoS attacks, censorship, other node failures. Cache management :cache additional copies of popular files

Routing challanges Efficiency : O(log(N)) messages per lookup
N is the total number of servers Scalability : O(log(N)) state per node Robustness : surviving massive failures

We are going to look at PAST (Rice and Microsoft Research, routing substrate - Pastry) CFS (MIT, routing substrate - Chord)

What is PAST ? Archival storage and content distribution utility
Not a general purpose file system Stores multiple replicas of files Caches additional copies of popular files in the local file system

How it works Built over a self-organizing, Internet-based overlay network Based on Pastry routing scheme Offers persistent storage services for replicated read-only files Owners can insert/reclaim files Clients just lookup

PAST Nodes The collection of PAST nodes form an overlay network
Minimally, a PAST node is an access point Optionally, it contributes to storage and participate in the routing

PAST operations fileId = Insert(name, owner-credentials, k, file);
file = Lookup(fileId); Reclaim(fileId, owner-credentials);

How to map Key IDs to Node IDs?
Insertion fileId computed as the secure hash of name, owner’s public key, salt Stores the file on the k nodes whose nodeIds are numerically closest to the 128 msb of fileId How to map Key IDs to Node IDs? Use Pastry

Insert contd The required storage is debited against the owner’s storage quota A file certificate is returned Signed with owner’s private key Contains: fileId, hash of content, replication factor + others The file & certificate are routed via Pastry Each node of the k replica storing nodes attach a store receipt Ack sent back after all k-nodes have accepted the file

Insert file with fileId=117, k=4
200 1.Node 200 insert file 117 source returns store receipt 122 is one of the 4 closest nodes to 117, 125 was reached first Because it is the nearest node to 200. dest 124 115 120

Lookup: Pastry locates a “near” node that has a copy and retrieves it
Lookup & Reclaim Lookup: Pastry locates a “near” node that has a copy and retrieves it Reclaim: weak consistency After it, a lookup is no longer guaranteed to retrieve the file But, it does not guarantee that the file is no longer available

Pastry: Peer-to-peer routing
Provide generic, scalable indexing, data location and routing Inspiration from Plaxton’s algorithm (used in web content distribution eg. Akamai) and Landmark hierarchy routing Goals Efficiency Scalability Fault Resilience Self-organization (completely decentralized)

Pastry:How it works? Each node has Unique nodeId.
Each Message has a key. Both are uniformly distributed and lie in the same namespace Pastry node routes the message to the node with the closest nodeId to the key. Number of routing steps is O(log N). Pastry takes into account network locality. PAST – uses fileID as key, and stores the file in k closest nodes.

Pastry: Node ID space Each node is assigned a 128-bit node identifier - nodeId. nodeId is assigned randomly when joining the system. (e.g. using SHA-1 hash of its IP or nodes public Key) Nodes with adjacent nodeId’s are diverse in geography, ownership, network attachment, etc. nodeId and keys are in base 2b. b is configuration param with typical value 4.

Pastry:Node ID space 128 bits (=> max. 2128 nodes) L levels
1 … … L–1 b bits L levels b = 128/L bits per level NodeId = sequence of L, base 2b (b-bit) digits 2128|0 1 1 Circular Namespace

Pastry: Node State (1) Each node maintains: routing table-R, neighborhood set-M, leaf set-L. Routing table is organized into log2bN rows with 2b-1 entry each. Each entry n contains the IP address of a close node which ID matches in the first n digits, differs in digit (n+1) Choice of b - tradeoff between size of routing table and length of route.

Leaf set – set of L nodes with closest nodeId to current node.
Pastry: Node State (2) Neighborhood set - nodeId’s , IP addresses of M nearby nodes based on proximity in nodeId space Leaf set – set of L nodes with closest nodeId to current node. L - divided into 2 : L /2 closest larger, L /2 closest smaller. values for L and M are 2b

Example: NodeId=10233102, b=2, nodeId is 16 bit. All numbers in base 4.

Pastry: Routing Requests
Route (my-id, key-id, message) if (key-id in range of my leaf-set) forward to the numerically closest node in leaf set; else forward to a node node-id in the routing table s. th. node-id shares a longer prefix with key-id than my-id; forward to a node node-id that shares the same length prefix with key-id as my-id but is numerically closer Routing takes O(log N) messages

B=2, l=4,key = 1230 2331 X0: 0130,1331,,2331,3001 source 1331 X1: 1030,1123,1211,1301 1211 X2: 1201,1213,1223,12331 1223 dest 1233 L: 1232,1223,1300,1301

Pastry:Node Addition A = 10 X – joining node
A – node nearby X (network proximity) Z – node numerically closest to X2 Routing Table of X leaf-set(X) = leaf-set(Z) neighborhood-set(X) = neighborhood-set(A) routing table X, row i = routing table Ni, row i, where Ni is the ith node encountered along the route from A to Z X notifies all-nodes in leaf-set(X); which update their state. N1 N36 Lookup(216) N2 240 Z = 210

X joins the system , first stage
Route message Key =X A Join message <- LeafSet B1 C2 <-A0 M-set B C Z

Pastry: Node Failures, Recovery
Rely on a soft-state protocol to deal with node failures Neighboring nodes in the nodeId space periodically exchange keepalive msgs unresponsive nodes for a period T removed from leaf-sets recovering nodes contacts last known leaf set, updates its own leaf set, notifies members of its presence. Randomized routing to deal with malicious nodes that can cause repeated query failures

Security Each PAST node and each user of the system hold a smartcard
Private/public key pair is associated with each card Smartcards generate and verify certificates and maintain storage quotas

More on Security Smartcards ensures integrity of nodeId and fileId assignments Store receipts prevent malicious nodes to create fewer than k copies File certificates allow storage nodes and clients to verify integrity and authenticity of stored content, or to enforce the storage quota

Based on local coordination among nodes nearby with nearby nodeIds
Storage Management Based on local coordination among nodes nearby with nearby nodeIds Responsibilities: Balance the free storage among nodes Maintain the invariant that replicas for each file are are stored on k nodes closest to its fileId

Causes for storage imbalance & solutions
The number of files assigned to each node may vary The size of the inserted files may vary The storage capacity of PAST nodes differs Solutions Replica diversion File diversion

Replica diversion Recall: each node maintains a leaf set
l nodes with nodeIds numerically closest to given node If a node A cannot accommodate a copy locally, it considers replica diversion A chooses B in its leaf set and asks it to store the replica Then, enters a pointer to B’s copy in its table and issues a store receipt

Policies for accepting a replica
If (file size/remaining free storage) > t Reject t is a fixed threshold T has different values for primary replica ( nodes among k numerically closest ) and diverted replica ( nodes in the same leaf set, but not k closest ) t(primary) > t(diverted)

File diversion When one of the k nodes declines to store a replica  try replica diversion If the chosen node for diverted replica also declines  the entire file is diverted Negative ack is sent, the client will generate another fileId, and start again After 3 rejections the user is announced

Maintaining replicas Pastry uses keep-alive messages and it adjusts the leaf set after failures The same adjustment takes place at join What happens with the copies stored by a failed node ? How about the copies stored by a node that leaves or enters a new leaf set ?

Maintaining replicas contd
To maintain the invariant ( k copies ) the replicas have to be re-created in the previous cases Big overhead Proposed solution for join: lazy re-creation First insert a pointer to the node that holds them, then migrate them gradually

Caching The k replicas are maintained in PAST for availability
The fetch distance is measured in terms of overlay network hops ( which doesn’t mean anything for the real case ) Caching is used to improve performance

Caching contd PAST uses the “unused” portion of their advertised disk space to cache files When store a new primary or a diverted replica, a node evicts one or more cached copies How it works: a file that is routed through a node by Pastry ( insert or lookup ) is inserted into the local cache f its size < c c is a fraction of the current cache size

Evaluation PAST implemented in JAVA Network Emulation using JavaVM 2 workloads (based on NLANR traces) for file sizes 4 normal distributions of node storage sizes

Key Results STORAGE CACHING
Replica and file diversion improved global storage utilization from 60.8% to 98% compared to without; insertion failures drop to < 5% from 51%. Caveat: Storage capacities used in experiment, 1000x times below what might be expected in practice. CACHING Routing Hops with caching lower than without caching even with 99% storage utilization Caveat: median file sizes very low, likely caching performance will degrade if this is higher.

CFS:Introduction Peer-to-peer read only storage system
Decentralized architecture focusing mainly on efficiency of data access robustness load balance scalability Provides a distributed hash table for block storage Uses Chord to map keys to nodes. Does not provide anonymity strong protection against malicious participants Focus is on providing an efficient and robust lookup and storage layer with simple algorithms.

CFS Software Structure
RPC API Local API FS DHASH DHASH DHASH CHORD CHORD CHORD CFS Client CFS Server CFS Server

CFS: Layer functionalities
The client file system uses the DHash layer to retrieve blocks The Server Dhash and the client DHash layer uses the client Chord layer to locate the servers that hold desired blocks The server DHash layer is responsible for storing keyed blocks, maintaining proper levels of replication as servers come and go, and caching popular blocks Chord layers interact in order to integrate looking up a block identifier with checking for cached copies of the block

Client identifies the root block using a public key generated by
the publisher. Uses the public key as the root block identifier to fetch the root block and checks for the validity of the block using the signature File inode key is obtained by usual search through directory blocks . These contain the keys of the file inode blocks which are used to fetch the inode blocks. The inode block contains the block numbers and their corr. keys which are used to fetch the data blocks.

CFS: Properties decentralized control – no administrative relationship between servers and publishers. scalability – lookup uses space and messages at most logarithmic in the number of servers. availability – client can retrieve data as long as at least one replica is reachable using the underlying network. load balance – for large files, it is done through spreading blocks over a number of servers. For small files, blocks are cached at servers involved in the lookup. persistence – once data is inserted, it is available for the agreed upon interval. quotas – are implemented by limiting the amount of data inserted by any particular IP address efficiency - delay of file fetches is comparable with FTP due to efficient lookup, pre-fetching, caching and server selection.

Chord Consistent hashing maps node IP address + Virtual host number into a m-bit node identifier. maps block keys into the same m bit identifier space. Node responsible for a key is the successor of the key’s id with wrap-around in the m bit identifier space. Consistent hashing balances the keys so that all nodes share equal load with high probability. Minimal movement of keys as nodes enter and leave the network. For scalability, Chord uses a distributed version of consistent hashing in which nodes maintain only O(log N) state and use O(log N) messages for lookup with a high probability.

Chord details two data structures used for performing lookups
Successor list : This maintains the next r successors of the node. The successor list can be used to traverse the nodes and find the node which is responsible for the data in O(N) time. Finger table : ith entry in the finger table contains the identity of the first node that succeeds n by at least 2i –1 on the ID circle. lookup pseudo code find id’s predecessor, its successor is the node responsible for the key to find the predecessor, check if the key lies between the node-id and its successor. Else, using the finger table and successor list, find the node which is the closest predecessor of id and repeat this step. since finger table entries point to nodes at power-of-two intervals around the ID ring, each iteration of above step reduces the distance between the predecessor and the current node by half.

Finger i points to successor of n+2i
112 1/8 Small tables, but multi-hop lookup. Table entries: IP address and Chord ID. Navigate in ID space, route queries closer to successor. Log(n) tables, log(n) hops. Route to a document between ¼ and ½ … 1/16 1/32 1/64 1/128 N80

Chord: Node join/failure
Chord tries to preserve two invariants Each node’s successor is correctly maintained. For every key k, node successor(k) is responsible for k. To preserve these invariants, when a node joins a network Initialize the predecessors, successors and finger table of node n Update the existing finger tables of other nodes to reflect the addition of n Notify higher layer software so that state can be transferred. For concurrent operations and failures, each Chord node runs a stabilization algorithm periodically to update the finger tables and successor lists to reflect addition/failure of nodes. If lookups fail during the stabilization process, the higher layer can lookup again. Chord provides guarantees that the stabilization algorithm will result in a consistent ring.

Chord: Server selection
added to Chord as part of CFS implementation. Basic idea: reduce lookup latency by preferentially contacting nodes likely to be nearby in the underlying network Latencies are measured during finger table creation, so no extra measurements necessary. This works only well for latencies such that low latencies from a to b and from b to c => that the latency is low between a and c Measurements suggest this is true. [A case study of server selection, Masters thesis]

CFS: Node Id Authentication
Attacker can destroy chosen data by selecting a node ID which is the successor of the data key and then deny the existence of the data. To prevent this, when a new node joins the system, existing nodes check If the hash (node ip + virtual number) is same as the professed node id send a random nonce to the claimed IP to check for IP spoofing To succeed, the attacker would have to control a large number of machines so that he can target blocks of the same file (which are randomly distributed over multiple servers)

CFS: Dhash Layer Provides a distributed hash table for block storage
reflects a key CFS design decision – split each file into blocks and randomly distribute the blocks over many servers. This provides good load distribution for large files . disadvantage is that lookup cost increases since lookup is executed for each block. The lookup cost is small though compared to the much higher cost of block fetches. Also supports pre-fetching of blocks to reduce user perceived latencies. Supports replication, caching, quotas , updates of blocks.

CFS: Replication Replicates the blocks on “k” servers to increase availability. Places the replicas at the “k” servers which are the immediate successors of the node which is responsible for the key Can easily find the servers from the successor list (r >=k) Provides fault tolerance since when the successor fails, the next server can serve the block. Since in general successor nodes are not likely to be physically close to each other , since the node id is a hash of the IP + virtual number, this provides robustness against failure of multiple servers located on the same network. The client can fetch the block from any of the “k” servers. Latency can be used as a deciding factor. This also has the side-effect of spreading the load across multiple servers. This works under the assumption that the proximity in the underlying network is transitive.

CFS: Caching Dhash implements caching to avoid overloading servers for popular data. Caching is based on the observation that as the lookup proceeds more and more towards the desired key, the distance traveled across the key space with each hop decreases. This implies that with a high probability, the nodes just before the key are involved in a large number of lookups for the same block. So when the client fetches the block from the successor node, it also caches it at the servers which were involved in the lookup . Cache replacement policy is LRU. Blocks which are cached on servers at large distances are evicted faster from the cache since not many lookups touch these servers. On the other hand, blocks cached on closer servers remain alive in the cache as long as they are referenced.

CFS: Implementation Implemented in 7000 lines of C++ code including 3000 lines of Chord User level programs communicate over UDP with RPC primitives provided by the SFS toolkit. Chord library maintains the successor lists and the finger tables. For multiple virtual servers on the same physical server, the routing tables are shared for efficiency. Each Dhash instance is associated with a chord virtual server. Has its own implementation of the chord lookup protocol to increase efficiency. Client FS implementation exports an ordinary Unix like file system. The client runs on the same machine as the server, uses Unix domain sockets to communicate with the local server and uses the server as a proxy to send queries to non-local CFS servers.

CFS: Experimental results
Two sets of tests To test real-world client-perceived performance , the first test explores performance on a subset of 12 machines of the RON testbed. 1 megabyte file split into 8K size blocks All machines download the file one at a time . Measure the download speed with and without server selection The second test is a controlled test in which a number of servers are run on the same physical machine and use the local loopback interface for communication. In this test, robustness, scalability, load balancing etc. of CFS are studied.

Future Research Support keyword search
By adopting an existing centralized search engine (like Napster) use a distributed set of index files stored on CFS Improve security against malicious participants. Can form a consistent internal ring and can route all lookups to nodes internal to the ring and then deny the existence of the data Content hashes help guard against block substitution. Future versions will add periodic “routing table” consistency check by randomly selected nodes to see try to detect malicious participants. Lazy replica copying to reduce the overhead for hosts which join the network for a short period of time.

Conclusions PAST(Pastry) and CFS(Chord)represent peer-to-peer routing and location schemes for storage The ideas are almost the same in all of them CFS load management is less complex Questions raised at SOSP about them: Is there any real application for them ? Who will trust these infrastructures to store his/her files ?

peer-to-peer file systems

Similar presentations

Presentation on theme: "peer-to-peer file systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

peer-to-peer file systems

Similar presentations

Presentation on theme: "peer-to-peer file systems"— Presentation transcript:

Similar presentations

About project

Feedback