Chapter 17: Distributed-File Systems Part 1

Name: Chapter 17: Distributed-File Systems Part 1
Uploaded: 2017-10-07T21:31:52+00:00
Duration: PTM22S57
Description: Chapter 17: Distributed-File Systems Part 1

Chapter 17: Distributed-File Systems Part 1

Chapter 17 Distributed-File Systems
Background Naming and Transparency Remote File Access Chapter 17.2 Stateful versus Stateless Service File Replication An Example: AFS

Chapter Objectives To explain the naming mechanism that provides
location transparency and independence; (that is naming and transparency, and) To describe the various methods for accessing distributed files, which amounts to Remote File Access

Background Definition: Distributed file system (DFS) – a distributed implementation of the classical time-sharing model of a file system, where multiple users share files and storage resources Note the word, ‘implementation.’ A DFS manages set of dispersed storage devices The purpose of a distributed file system is to support sharing of files when the files are physically dispersed among the sites of a distributed system. We will discuss how a distributed file system can be both designed and implemented in the face of a number of parameters.

Distributed File System Structure
We can define a distributed system as a  collection of loosely coupled computers interconnected via a communications network. This is a good definition… These computers can share physically dispersed files by using a Distributed File System (DFS). Let’s first start by defining a few critically important terms: services, server, client.

Terms – book definitions
Service – software entity running on one or more machines and providing a particular type of function to unknown clients Server – service software running on a single machine Client – process that can invoke a service using a set of operations that forms its client interface A client interface for a file service is formed by a set of primitive file operations (create, delete, read, write) Client interface of a DFS should be transparent, i.e., not distinguish between local and remote files This is a critical component of any viable distributed system!! Everything should appear ‘local.’ To put things into perspective, the primary hardware component that a file server controls is a set of local and secondary-storage devices (usually disks) on which files are stored and from which they are retrieved according to the client’s requests.

Distributed File System Structure – more
Rather than a single centralized data repository, the system may well have multiple and independent storages devices with widely varying configurations It is this multiplicity and autonomy of clients and servers that distinguish a DFS. The client interface to the DFS should not distinguish between local and remote files A DFS may be implemented as part of a distributed operating system or alternatively by a software layer whose task is to manage the communication between conventional operating systems and file systems. These are quite different and have different implementations!

Naming and Transparency
These are two huge issues in the DFS context. Naming – This refers to the mapping between logical and physical objects; i.e., a mapping between familiar file names and physical blocks of data storage likely on a disk. . This Multilevel mapping must provide an abstraction of a file that hides the details of how and where on the disk the file is actually stored A Transparent DFS hides the location where in the network the file is stored In conventional systems, range of mapping is an address on the disk. In a DFS, this range is expanded to include the specific machine hosting the file. And, this file name may maps to a number of locations where multiple copies of this file may exist. Recall mirror sites for downloading very popular software. For a file being replicated in several sites, the mapping returns a set of the locations of this file’s replicas; both the existence of multiple copies and their location may be hidden Clearly there are issues here as to consistency and more…

Naming Structures Location transparency – file name does not reveal the file’s physical storage location File name is still specific, and refers to a set of physical disk blocks Convenient way to share data Location independence – file name does not need to be changed when the file’s physical storage location changes Better file abstraction Promotes sharing the storage space itself Separates the naming hierarchy form the storage-devices hierarchy if moved to a different location.

Location Transparency vs Location Independence
Independence: Location independence is considered a stronger property than location transparency. Here location independence supports mapping the same file name to different locations at different times. Location independence is thus dynamic and can change. Transparency: Most common, however, is what we call: static location-transparent mapping for user-level names. Such an approach eliminates file migration. Files are permanently located in a specific machine at a specific location on a machine. These are important concepts to understand. Let’s look more deeply into these.

Location Transparency vs Location Independence
Location Independence. Separates data from location  a better abstraction. We usually only care about the file contents and not where it came from! Location independence denotes Some logical file somewhere – Don’t care where. Still hidden. A critical point of understanding and stronger than transparency. Location Independence. Separates the naming hierarchy from the storage hierarchy because the resource may be in several locations. Static location transparency. In more common use and is convenient. Promotes sharing by using file name, as though files were local. Downside is that logical name is still mapped to a physical location.

Diskless Clients Diskless Clients: some real advantages:
Accessing files on remote servers may enable clients to be diskless If so, servers must provide all files and all OS support. A diskless workstation has no kernel. Here, a special boot protocol stored in ROM is invoked that enables the retrieval of one special file (kernel or boot code) from a fixed location. Once the kernel is copied over the network and loaded to the diskless client, its DFS makes all other OS files available. Advantages include lower cost (no disks in client), greater convenience (when an OS upgrade occurs). Disadvantages are added complexity to the boot protocol and performance loss due to use of the network vice local disk.

Current Trends  Clients use both local disks and remote file servers.
 OS and networking software are stored locally.  File systems are stored on remote file systems. Some clients store common software, such as word processors, etc. in their local file systems. Other services pushed from remote file server to client on demand. Advantage of local file system on clients vice pure diskless systems is: Disk drives: rapidly increasing in capacity; decreasing in cost, and Networks cannot assert this Growth: Systems growing more quickly than networks. We need to limit network access to improve system throughput. Let’s change gears and look at Naming…

Naming Schemes Files named by combination of their host name and
local name; guarantees a unique system-wide name In Sun’s network file system (NFS), they attach remote directories to local directories, giving the appearance of a coherent directory tree. What we are after is a total integration of the component file systems Have a single global name structure spans all files in the system and the combined file system appears identical to the structure of a conventional file system.

Implementation Techniques
So we map a transparent file name to some associated location. But in order to manage all this, we bring together sets of files into what is called ‘component units.’ To help in this, we can use replication, local caching or both.

Remote File Access Remote-service mechanism is one transfer approach
The most fundamental approach is a client requests access; Server accommodates the accesses and sends the results back to the client. Commonly done via a remote procedure call (RPC). Essentially we liken this to a traditional disk access on a local machine. This is very simple but fails to realize many benefits of a DFS implemented with caching and more Issues: In a traditional file system, we cache to reduce I/Os and hence we can improve performance., This is clear. But here, In using a DFS, our goal is to not only reduce disk I/O but also to reduce network traffic, which can be very significant!

Basic Caching Scheme In a basic caching scheme, if the desired data is cached, we’re in good shape. Otherwise, we need a copy of the data transmitted over the network to the client’s cache store, where operations are performed on the cache. Caching clearly reduces network traffic by retaining recently accessed disk blocks in a cache, so that repeated accesses to the same information can be handled locally Cache stores may be expensive and are certainly bounded. So, any implementation of caching for improved performance and reduced network traffic must implement control over the cache, such as in LRU. Note that there is still a master copy of the file residing at the server site, but copies of (parts of) the file may be scattered in different caches

Basic Caching Scheme – Cache Consistency
Modification of Cache Contents: Of course, we now have a problem when cached copies are modified and these changes need to be posted back on the server. Cache-consistency problem – this refers to the problem of keeping the cached copies consistent with the master file, which we will discuss ahead. We will consider this problem ahead… Side Note: It is not unreasonable to refer to DFS caching as network virtual memory, except that the backing store is not the local disk but rather associated with a remote server somewhere.

Basic Caching Scheme: Block Size Transferred?
Always an issue.  Approach is to have more data than what is needed in a single request in the cache – kind of like the theory of locality. Want the hit ratio to be as high as possible so that network performance is acceptable and network traffic is controlled. Thus oftentimes large chunks of data are transferred to help in these concerns.  Alternatively, we can transfer individual blocks via client demand. Hit ratio will go down, network traffic will go up to bring in additional data from the server, and overall performance suffers. But individual data transfers are quicker, less cache is needed (although more hunks need to be controlled), and more. When larger blocks are used, cache must be larger Unix uses 4KB and 8KB blocks. Some large caches may use over 1MB! Smaller caches should avoid larger block sizes = may well result in a lower hit ratio and more maintenance (swapping) on the cache in response to additional client requests.

Cache Location Cache can be located in either primary memory, disk, or both! Clearly there are advantages and disadvantages to each; Advantages of Disk Cache: – more reliable if there is a crash ,(don’t need to be re-fetched) non-volatile, larger, etc. Advantages of main memory caches: Memory – permits workstations to be diskless, access is quicker in main memory than from a disk Performance speed up for larger memories. Memory is becoming less and less expensive, and the performance speed up may be more advantageous than the benefits of disk! Of course, server caches (used to speed up disk I/O) will be in main memory regardless of where user caches are located.

A Combination?? As it turns out, many implementations are combinations of caching and remote service. Some implementations are based on remote service but they are supplemented with client- and server-side memory caching for performance considerations. (quote!)

Cache Consistency Issue
Keeping files consistent on the server is critically important in a DFS. Write-through – write data through to disk as soon as they are placed on any cache Reliable, but very poor performance Essentially, we are only getting caching for read access. Not enough. Delayed-write – (a.k.a. write-back caching) modifications initially written to the cache. Later data are written through to the server Good: Write accesses complete quickly; some data may be overwritten before they are written back, and so need never be written at all Bad: Poor reliability; unwritten data will be lost whenever a user machine crashes Variation1 on delayed write – scan cache at regular intervals and flush blocks that have been modified since the last scan Still some performance loss, since write must complete before client should continue Variation2 on delayed write – write-on-close, writes data back to the server when the file is closed (used in the Andrew File System (AFS) implemented at CMU) Best for files that are open for long periods and frequently modified Very poor for files opened for short periods and modified infrequently, as it does not appreciably decrease network traffic and this incurs a performance loss while the file is written through (back to server) when trying to close.

Cachefs and its Use of Caching
Note: write through is in NFS server; in local disk for client. ‘Write back’ is ‘delayed write’ Note: write through is primary memory in the server; disk cache in the client. This graphic uses cachefs (cache file system) with the NFS (Sun), where modified data are written back to local disk cache when written back to the server. This will clearly benefit performance via reads with cache hits and reliability if transmission bad. This will decrease performance for both reads and writes with a cache miss. Vital in any mechanisms to obtain highest cache hit rate possible for best overall performance

Performance and Consistency
Continuing with major problems with DFSs. We know that performance is critically important. Now, the question is: Is locally cached copy of the data consistent with the master copy? and How do we know if the data is not consistent?? If cached copy of data is not current, then access should not be permitted locally and fresh copies are needed from the server. What if two clients open a file simultaneously in conflicting modes? How do we ensure consistency? There are serious issues.

Approaches to Consistency
Client-initiated approach Client initiates a validity check with server. But frequency of validity check is the key point here. Server then checks whether local data are consistent with the master copy But this delays processing, if every access requires a validity check. Or, consistency checks can occur at fixed intervals. Clearly if this is done frequently, this will load down the network and certainly delay local processing. Server-initiated approach Server records, for each client, the (parts of) files it caches It knows who has what and what was sent over the network. When server detects a potential inconsistency, it must react for two clients who have different ‘versions’ of a file. Implementing this requires the server to know if the intended mode is read or write (via the open) Further: what if: If the server detects files opened simultaneously and in conflicting modes (say both indicate write() desires), the server can disable caching for this file and switch to a remote service mode of operation only.

Comparing Caching and Remote Service
We have two choices: Caching and Remote service Each have significant tradeoffs. Caching: Here, many remote accesses handled efficiently by the local cache; most remote accesses will be served as fast as local ones Servers are contracted only occasionally in caching (rather than for each access) Reduces server load and network traffic Enhances potential for scalability Remote Server method handles every remote access across the network; Penalty in network traffic, server load, and performance Total network overhead in transmitting big chunks of data (caching) is lower than a series of responses to specific requests (remote-service approach) Caching is superior in access patterns with infrequent writes

Caching and Remote Service (Cont.)
Caching: The real problem with caching occurs in addressing the cache-consistency problems. A Plus: With infrequent writes, caching is the way to go. A Minus: With frequent writes, substantial overhead incurred due to network traffic, performance, and server load ( it approaches remote service parameters) A major benefit may be realized on machines with caching when execution carried out on machines with either local disks or large main memories  This is a must for caching to realize its potential benefits. Remote access is best done on diskless, small-memory-capacity machines. Here, this paradigm is inter-machine interface simply mirrors the user interface.

End of Chapter 17

Chapter 17: Distributed-File Systems Part 1

Similar presentations

Presentation on theme: "Chapter 17: Distributed-File Systems Part 1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 17: Distributed-File Systems Part 1

Similar presentations

Presentation on theme: "Chapter 17: Distributed-File Systems Part 1"— Presentation transcript:

Similar presentations

About project

Feedback