Presentation is loading. Please wait.

Presentation is loading. Please wait.

Coda Descendant of AFS Developed by Mahadev Satyanarayanan and coworkers at Carnegie-Mellon University since 1987 Open Source advanced caching schemes.

Similar presentations


Presentation on theme: "Coda Descendant of AFS Developed by Mahadev Satyanarayanan and coworkers at Carnegie-Mellon University since 1987 Open Source advanced caching schemes."— Presentation transcript:

1 Coda Descendant of AFS Developed by Mahadev Satyanarayanan and coworkers at Carnegie-Mellon University since 1987 Open Source advanced caching schemes that allow a client to continue operation despite being disconnected from a server. An important goal was to achieve a high degree of naming and location transparency so that the system would appear to its users very similar to a pure local file system.

2 Coda is a descendant of version 2 of the Andrew File System (AFS) AFS nodes are partitioned into two groups. One group consists of a relatively small number of dedicated Vice file servers, which are centrally administered. The other group consists of a very much larger collection of Virtue workstations that give users and processes access to the file system, as shown in Fig. 10-1.

3

4 Communication RPC2 offers reliable RPCs on top of the (unreliable) UDP protocol. Each time a remote procedure is called, the RPC2 client code starts a new thread that sends an invocation request to the server and subsequently blocks until it receives an answer. As request processing may take an arbitrary time to complete, the server regularly sends back messages to the client to let it know it is still working on the request. If the server dies, sooner or later this thread will notice that the messages have ceased and report back failure to the calling application.

5 An interesting aspect of RPC2 is its support for side effects. A side effect is a mechanism by which the client and server can communicate using an application-specific protocol. RPC2 allows the client and the server to set up a separate connection for transferring the video data to the client on time. Connection setup is done as a side effect of an RPC call to the server.

6

7 Another feature of RPC2 that makes it different from other RPC systems is its support for multicasting. An important design issue in Coda is that servers keep track of which clients have a local copy of a file. When a file is modified, a server invalidates local copies by notifying the appropriate clients through an RPC.

8

9 Parallel RPCs are implemented by means of the MultiRPC system, which is part of the RPC2 package. An important aspect of MultiRPC is that the parallel invocation of RPCs is fully transparent to the callee. MultiRPC is implemented by essentially executing multiple RPCs in parallel.

10 Synchronization Sharing Files in Coda To accommodate file sharing, Coda uses a special allocation scheme that bears some similarities to share reservations in NFS. To understand how the scheme works, the following is important. When a client successfully opens a file f, an entire copy of f is transferred to the client’s machine. The server records that the client has a copy of f. This approach is similar to open delegation in NFS.

11 Now suppose client A has opened file f for writing. When another client B wants to open f as well, it will fail. This failure is caused by the fact that the server has recorded that client A might have already modified f. On the other hand, had client A opened f for reading, an attempt by client B to get a copy from the server for reading would succeed. An attempt by B to open for writing would succeed as well

12

13 Caching and Replication Client Caching Client-side caching is crucial to the operation of Coda for two reasons. First, caching is done to achieve scalability. Second, caching provides a higher degree of fault tolerance as the client becomes less dependent on the availability of the server. For these two reasons, clients in Coda always cache entire files. when a file is opened for either reading or writing, an entire copy of the file is transferred to the client, where it is subsequently cached.

14 A server is said to record a callback promise for a client. When a client updates its local copy of the file for the first time, it notifies the server, which, in turn, sends an invalidation message to the other clients. Such an invalidation message is called a callback break, because the server will then discard the callback promise it held for the client it just sent an invalidation.

15 The interesting aspect of this scheme is that as long as a client knows it has an outstanding callback promise at the server, it can safely access the file locally. In particular, suppose a client opens a file and finds it is still in its cache. It can then use that file provided the server still has a callback promise on the file for that client. The client will have to check with the server if that promise still holds. If so, there is no need to transfer the file from the server to the client again.

16

17 Server-Side Replication Coda allows file servers to be replicated. As we mentioned, the unit of replication is a collection of files called a volume. The collection of Coda servers that have a copy of a volume, are known as that volume's Volume Storage Group, or simply VSG. In the presence of failures, a client may not have access to all servers in a volume's VSG. A client's Accessible Volume Storage Group (AVSG) for a volume consists of those servers in that volume's VSG that the client can contact at the moment. If the AVSG is empty, the client is said to be disconnected.

18 Coda uses a replicated-write protocol to maintain consistency of a replicated volume. In particular, it uses a variant of Read-One, Write-All (ROWA). When a client needs to read a file, it contacts one of the members in its AVSG of the volume to which that file belongs. However, when closing a session on an updated file, the client transfers it in parallel to each member in the AVSG.

19 This scheme works fine as long as there are no failures, that is, for each client, that client's AVSG of a volume is the same as its VSG. However, in the presence of failures, things may go wrong. Consider a volume that is replicated across three servers S1, S2, and S3, For client A, assume its AVSG covers servers S1 and S2 whereas client B has access only to server S3, as shown in Fig. 11-24.

20

21 Coda uses an optimistic strategy for file replication. In particular, both A and B will be allowed to open a file, f, for writing, update their respective copies, and transfer their copy back to the members in their AVSG. Obviously, there will be different versions of t stored in the VSG. The question is how this inconsistency can be detected and resolved.

22 The solution adopted by Coda is deploying a versioning scheme. In particular, a server Sj in a VSG maintains a Coda version vector CVVi(f) for each file f contained in that VSG. If CVVi(f)[j] = k, thenserver Si knows that server Sj has seen at least version k of file f CVVi(f)[i] is the number of the current version of f stored at server Si. An update of f at server Si will lead to an increment of CVVi(f)[i].

23 Returning to our three-server example, CVVi(f) is initially equal to [1,1,1] for each server Si. When client A reads f from one of the servers in its AVSG, say S1, it also receives CVV1 (f). After updating f, client A multicasts f to each server in its AVSG, that is, S1 and S2, Both servers will then record that their respective copy has been updated, but not that of S3. In other words, CVV1(f) = CVV2(f) = [2,2,1]

24 Meanwhile, client B will be allowed to open a session in which it receives a copy of f from server S3, and subsequently update f as well. When closing its session and transferring the update to S3, server S3 will update its version vector to CVV3(f)=[1,1,2] When the partition is healed, the three servers will need to reintegrate their copies of f. By comparing their version vectors, they will notice that a conflict has occurred that needs to be repaired.

25 XFS Design goals for this file system (from Silicon Graphics) centered on supporting intense I/O performance demands, large (media) files, and file systems with many files and many large files. terabytes of disk space (so many files and directories) huge files hundreds of MB/s of I/O bandwidth

26 Every machine involved in XFS can become server to some les and also client to some other les. XFS uses the storage technology called RAID (redundant array of independent disks) to spread data on multiple disks.

27 RAID The basic idea of RAID is le striping: each file is partitioned into multiple pieces and each of them is stored on different disk. The advantage of file striping includes the gain of parallelism. When accessing a file, all the pieces of that single file on different disks can be accessed in parallel, which could result in linear speedup. In addition, automatic load balance comes for free because popular files are distributed across multiple disks and therefore no single disk will be overloaded. However, file striping comes with disadvantages: when a disk fails, all the files with a part on the disk will be gone. Assuming independent fails, the mean time to failure drops signicantly as the number of disks increases. Therefore, redundancy must be added to make the system usable in practice.

28 Overview of xFS. The xFS file system is based on a serverless model. The entire file system is distributed across machines including clients. Each machine can run a storage server, a metadata server and a client process.

29 A typical distribution of xFS processes across multiple machines.

30 Communication in xFS RPC was substituted with active messages in XFS. RPC performance was not the best and fully decentralization is hard to manage with RPC. In an active message, when a message arrives, an handler is automatically invoked for execution.

31 DISTRIBUTED COORDINATION-BASED SYSTEMS COORDINATION MODELS the coordination part of a distributed system handles the communication and cooperation between processes. It forms the glue that binds the activities performed by processes into a whole. we make a distinction between models along two different dimensions, temporal and referential, as shown in Fig. 13-1.

32

33 When processes are temporally and referentially coupled, coordination takes place in a direct way, referred to as direct coordination. The referential coupling generally appears in the form of explicit referencing in communication. For example, a process can communicate only-if it knows the name or identifier of the other processes it wants to exchange information with. Temporal coupling means that processes that are communicating will both have to be up and running. This coupling is analogous to the transient message- oriented communication

34 A different type of coordination occurs when processes are temporally decoupled, but referentially coupled, which we refer to as mailbox coordination. In this case, there is no need for two communicating processes to execute at the same time in order to let communication take place. Instead, communication takes place by putting messages in a (possibly shared) mailbox

35 The combination of referentially decoupled and temporally coupled systems form the group of models for meeting-oriented coordination. In referentially decoupled systems, processes do not know each other explicitly. In other words, when a process wants to coordinate its activities with other processes, it cannot directly refer to another process. Instead, there is a concept of a meeting in which processes temporarily group together to coordinate their activities. The model prescribes that the meeting processes are executing at the same time.

36 Meeting-based systems are often implemented by means of events, like the ones supported by object- based distributed systems. mechanism for implementing meetings --- publish/subscribe systems. In these systems, processes can subscribe to messages containing information on specific subjects, while other processes produce (i.e., publish) such messages. Most publish/subscribe systems require that communicating processes are active at the same time; hence there is a temporal coupling

37 The most widely-known coordination model is the combination of referentially and temporally decoupled processes, exemplified by generative communication as introduced in the Linda programming system. The key idea in generative communication is that a collection of independent processes make use of a shared persistent data space of tuples. Tuples are tagged data records consisting of a number (but possibly zero) typed fields. Processes can put any type of record into the shared dataspace (i.e., they generat communication records).

38 An interesting feature of these shared dataspaces is that they implement an associative search mechanism for tuples. In other words, when a process wants to extract a tuple from the dataspace, it essentially specifies (some of) the values of the fields it is interested in. Any tuple that matches that specification is then removed from the dataspace and passed to the process. If no match could be found, the process can choose to block until there is a matching tuple.

39 ARCHITECTURES Overall Approach Let us first assume that data items are described by a series of attributes. A data item is said to be published when it is made available for other processes to read. To that end a subscription needs to be passed to the middleware, containing a description of the data items that the subscriber is interested in. Such a description typically consists of some (attribute, value) pairs, possibly combined with (attribute, range) pairs. In the latter case, the specified attribute is expected to take on values within a specified range.

40 We are now confronted with a situation in which subscriptions need to be matched against data items, as shown in Fig. 13-2. When matching succeeds, there are two possible scenarios. In the first case, the middleware may decide to forward the published data to its current set of subscribers, that is, processes with a matching subscription. As an alternative, the middleware can also forward a notification at which point subscribers can execute a read operation to retrieve the published data item.

41 In those cases in which data items are immediately forwarded to subscribers, the middleware will generally not offer storage of data. Storage is either explicitly handled by a separate service, or is the responsibility of subscribers. In other words, we have a referentially decoupled, but temporally coupled system. This situation is different when notifications are sent so that subscribers need to explicitly read the published data. Necessarily, the middleware will have to store data items. In these situations there are additional operations for data management. It is also possible to attach a lease to a data item such that when the lease expires that the data item is automatically deleted.

42 Traditional Architectures The simplest solution for matching data items against subscriptions is to have a centralized client-server architecture. This is a typical solution currently adopted by many publish/subscribe systems, including IBM's WebSphere.

43 JINI Jini is a distributed system architecture developed by Sun Microsystems, Inc. Its main goal is “network plug and play”. A Jini system is a distributed system based on the idea of join together groups of users and the resources required by those users. The overall goal is to turn the network into a flexible, easily administered tool with which resources can be found by human and computational clients.

44 JINI Goals – Enabling users to share services and resources over a network – Providing users easy access to resources anywhere on the network while allowing the network location of the user to change – Simplifying the task of building, maintaining, and altering a network of devices, software, and users

45 Jini and JavaSpaces Jini is a distributed system that consists of a mixture of different but related elements. It is strongly related to the Java programming language, although many of its principles can be implemented equally well in other languages. An important part of the system is formed by a coordination model for generative communication. Jini provides temporal and referential decoupling of processes through a coordination system called JavaSpaces. A JavaSpace is a shared dataspace that stores tuples representing a typed set of references to Java objects. Multiple JavaSpaces may coexist in a single Jini system.

46 when a tuple contains two different fields that refer to the same object, the tuple as stored in a JavaSpace implementation will hold two marshaled (represent an object in to a data format that is suitable for storing or retransmission)copies of that object. A tuple is put into a JavaSpace by means of a write operation, which first marshals the tuple before storing it. Each time the write operation is called on a tuple, another marshaled copy of that tuple is stored in the JavaSpace, as shown in Fig. 13-3. We will refer to each marshaled copy as a tuple instance.

47

48 To read a tuple instance, a process provides another tuple that it uses as a template for matching tuple instances as stored in a JavaSpace. Like any other tuple, a template tuple is a typed set of object references. Only tuple instances of the same type as the template can be read from a JavaSpace. A field in the template tuple either contains a reference to an actual object or contains the value NULL.

49 When a tuple instance is found that matches the template tuple provided as part of a read operation, that tuple instance is unmarshaled and returned to the reading process. There is also a take operation that additionally removes the tuple instance from the JavaSpace. Both operations block the caller until a matching tuple is found. It is possible to specify a maximum blocking time. In addition, there are variants that simply return immediately if no matching tuple existed.

50 TIB/Rendezvous using central servers is to immediately disseminate published data items to the appropriate subscribers using multicasting. This principle is used in TIBlRendezvous, of which the basic architecture is shown in Fig. 13-4. In this approach, a data item is a message tagged with a compound keyword describing its content, such as news. comp.os. books. A subscriber provides (parts of) a keyword, or indicating the messages it wants to receive, such as news.comp. *.books. These keywords are said to indicate the subject of a message.

51

52 if it is known exactly where a subscriber resides, point-to-point messages will generally be used. Each host on such a network will run a rendezvous daemon, which takes care that messages are sent and delivered according to their subject. Whenever a message is published, it is multicast to each host on the network running a rendezvous daemon. Typically, multicasting is implemented using the facilities offered by the underlying network, such as Ip multicasting or hardware broadcasting.

53 Processes that subscribe to a subject pass their subscription to their local daemon. The daemon constructs a table of (process, subject), entries and whenever a message on subject S arrives, the daemon simply checks in its table for local subscribers, and forwards the message to each one. If there are no subscribers for S, the message is discarded immediately.


Download ppt "Coda Descendant of AFS Developed by Mahadev Satyanarayanan and coworkers at Carnegie-Mellon University since 1987 Open Source advanced caching schemes."

Similar presentations


Ads by Google