Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sogang University Advanced Operating Systems (UNIX & Linux File System) Advanced Operating Systems (UNIX & Linux File System) Sang Gue Oh, Ph.D. Email.

Similar presentations


Presentation on theme: "Sogang University Advanced Operating Systems (UNIX & Linux File System) Advanced Operating Systems (UNIX & Linux File System) Sang Gue Oh, Ph.D. Email."— Presentation transcript:

1 Sogang University Advanced Operating Systems (UNIX & Linux File System) Advanced Operating Systems (UNIX & Linux File System) Sang Gue Oh, Ph.D. Email : sgoh@macroimpact.com Email : sgoh@macroimpact.com

2 Sogang University UNIX File System Page 2 File System Framework Device Driver File Storage Service Directory Service Directory API File Operations API File System Interface File System Implementation User Kernel System Call Interface Text name File id Hard Disk Floppy Disk Floppy DiskCD-ROM

3 Sogang University UNIX File System Page 3 Two UNIX File Systems (Local File System) n System V File System t Original file system for UNIX. t All versions of System V as well as several commercial UNIX systems support this file system. n FFS (Fast File System) t Introduced by Berkeley UNIX in release 4.2 BSD. t Provides more performance, robustness, and functionality. t Gained wide commercial acceptance and SVR4 includes this file system. t After integrating with VFS (Virtual File System), this file system is known as the UNIX file system (ufs).

4 Sogang University UNIX File System Page 4 Inode (Index Node) n Index Node. n Contains a description of the disk layout of the file data and other administrative information such as the file owner, access permission, file size, block size, and access times. n Every file unique n Every file has one Inode associated with it (Internal representation of a file) and the Inode # is unique across the system. n When a process refers to a file by name, the kernel maps it to an Inode for further operations. n Two different structures. t On-Disk Inode : Stored in the disk. t In-Core Inode : Maintained in the kernel after reading from the disk. Has more fields in addition to the fields of the disk inode.

5 Sogang University UNIX File System Page 5 Block Addressing Using Inode

6 Sogang University UNIX File System Page 6 Example - Accessing the Block n Block size = 1024 bytes, 10 direct, indirect/double/triple are 1 n Case 1 : Access byte offset 9000 of a file. n Case 2 : Access byte offset 350,000 of a file. 824 9156 428 367 0 4096 228 0 8 9 Single Double Triple Direct Data Block Data Block 367 Case 1 331 9156 9156 Double Indirect (start from 10K + 256K = 272,384) 3333 3313333 Data Block 808 th (9000-1024*8) 0 75 350,000 - 272,384 = 77,616 816 th Case 2

7 Sogang University UNIX File System Page 7 Example - Maximum File Size n Total size = (# of direct block pointer * block size) + (# of block pointers in one block * block size) + ((# of block pointers in one block)**2 * block size) + ((# of block pointers in one block)**3 * block size) n Case : Block size = 1024 bytes (1 Kbyte), pointer size = 4 bytes, 10 direct, 1 indirect, 1 double indirect, and 1 triple indirect pointers. t Maximum File Size = (10 * 1024) + (1024/4 * 1024) + ((1024/4)**2 * 1024) + ((1024/4)**3 * 1024) = 1024 * (10 + 256 + 256**2 + 256**3) bytes n Increasing block size vs. adding one more indirection (quadruply) ?

8 Sogang University UNIX File System Page 8 Directory Structure (System V Case) 73. 38.. 9 file1 0 deleted file 110 subdirectory1 65 archana Inode Number File Name 2 Bytes14 Bytes t t 2 byte Inode # (16 bits) restricts the maximum # of files in the partition (65535). t t Maximum file name size is 14.

9 Sogang University UNIX File System Page 9 File System Layout (System V Case) n File system resides on a single logical disk or partition, and each logical disk may hold one file system at the most. n A partition is viewed as a linear array of blocks. The size a block is multiple of 512 bytes (e.g., 512/1024/2048). This represents the granularity of space allocation for a file. n Physical block # is an index into this array, which is translated into cylinder, track, and sector # via device driver. n B (Boot Area): Only one partition needs to contain. n S (Super Block): Contains metadata about the file system. n Inode List : Has a fixed size which limits the max number of files. The size of an inode is 64 bytes in System V UNIX. BSInode ListData Blocks 0 1 0 1

10 Sogang University UNIX File System Page 10 Super-BlockSuper-Block Superblock contains the following administrative info. t Size in blocks of the file system t Size in blocks of the inode list t Number of free blocks and inodes t Free inode list t Index of the next free inode in the free inode list t Free block list t Index of the next free block in the free block list n It is impractical to keep either free list completely in the super- block. Therefore, it is necessary to manage the free inode and block list.

11 Sogang University UNIX File System Page 11 Free Inode List Management (1) n Assigning a free inode. t Assign from the superblock free inode list. t When the list becomes empty => the kernel scans the disk from the remembered inode to replenish the list. (if di_mode = 0, empty inode) Super-Block Free Inode List 48 83 99 EmptyFree Inodes Next Free Inode Pointer assign 470 Empty remembered 471 475 476 535 Free Inodes

12 Sogang University UNIX File System Page 12 Free Inode List Management (2) n Freeing an allocated inode. t If the superblock free inode list has room, place it and increase index. t else compare the inode # with that of remembered position. If free inode # is less than that in the remembered position, replace it. Super-Block Free Inode List 48 83 99 EmptyFree Inodes Next Free Inode Pointer Place the free inode 471 475 476 535 Free Inodes remembered 471 475 476 500 Free Inodes Free Inode 500

13 Sogang University UNIX File System Page 13 Free Block List Management (1) n Allocating a free block. t Allocate the next available block in the free block list in superblock. t If the allocating block is the last block, the kernel treats it as a pointer to a block that contains a list of free blocks. It reads the block, populates the superblock array with the new list of block numbers. a Superblock b c d block a block b block c Assigned from here copy

14 Sogang University UNIX File System Page 14 Free Block List Management (2) n Freeing an allocated block. t If the superblock list is not full, place it on the superblock free list. t else newly freed block becomes a link block; the kernel writes the superblock list into the block and writes the block to disk. It then places the block number of the newly freed block in the superblock list. b a Superblock b c d Free block a block b block c copy a Case 1 (not full) Case 2 (full)

15 Sogang University UNIX File System Page 15 Analysis of System V File System (1) n Distinguished by its simple design. n This simplicity creates problems in the areas of reliability, performance, and functionality. n Reliability problem. t One copy of superblock - Superblock corruption problem. n Performance problem. t Long seek between two areas (superblock and data blocks) increases I/O times. t Inodes are allocated randomly with no attempt to group related inodes (files in the same directory): random disk access. t As files are created and deleted, the order of blocks in the free list becomes completely random -> Disk block allocation problem.  This slows down sequential access operations.

16 Sogang University UNIX File System Page 16 Analysis of System V File System (2) t Disk block size problem. l SVR2 (512 bytes), SVR3 (1024 bytes) l Small blocks require many indirect block accesses. Increasing the block size allows more data to be read in a single disk access -> improves performance. On the other hand, this also wastes more disk space. l Need for a more flexible approach to allocating space to files. n Functionality problem. t File name size : 14 bytes. t Number of inodes : 65535 -> restricts the maximum # of files in the file system.

17 Sogang University UNIX File System Page 17 The Fast File System (FFS) n Retained the old file system abstraction. n Changed the underlying implementation. t Bigger blocks (4096 bytes or larger) t Group related information using cylinder group (reduce long seek). t Use bit map for maintaining free blocks. t Variable length directory entries and long file name (max 255 bytes). n Up to 10 times faster than the old file system. n Functional enhancements t Long file names -> 255 bytes. t Symbolic links (possibly refer to a file on a different file system). t Rename (previously required link and unlink commands - hard link). t Quotas

18 Sogang University UNIX File System Page 18 Hard Disk Structure n n UNIX views the disk as a linear array of blocks (multiple of sectors). n n Addressing starts from increasing sector #, head #, and cylinder #.

19 Sogang University UNIX File System Page 19 File System Layout (FFS) n FFS further divides the partition into one or more cylinder groups, each containing a small set of consecutive cylinders. n This allows UNIX to store related data in the same cylinder group (e.g., inode & data block : avoid long seek). n The fields of cylinder group t A redundant copy of superblock: varying offset. t Space for static number of inodes (default: one inode per 2048 bytes). t The bit map of available blocks; cf. free list t Summary information describing the usage of data blocks. B Cylinder Group 0Cylinder Group n S S...

20 Sogang University UNIX File System Page 20 Blocks and Fragments (1) n Different file systems on the same machine can have different block sizes. n The block size is a power of two greater than or equal to a minimum 4096. Most implementations add an upper limits of 8192 bytes (Compare with those of System V - 512 or 1024 bytes). n The 2**32 bytes (4 gigabytes) can be addressed with only two levels of indirection (i.e., FFS does not use the triple indirect block, although some variants use it to support file sizes greater than 4 gigabytes.) n Typical UNIX systems have numerous small files that need to be stored efficiently. n FFS solves this problem by allowing each block to be divided into one or more fragments (1, 2, 4, or 8 fragments, allowing a lower bound of 512 bytes each).

21 Sogang University UNIX File System Page 21 Blocks and Fragments (2) n The last block of file is not a complete disk block. n Write system call t If enough space left in an already allocated fragment, => write into the available space. t If no fragmented blocks available, => full block is written, the remaining new data is written to a block with the necessary fragments or a full block. n As the file grows, this scheme generates frequent copy. FFS allows only direct blocks to contain fragments. Bits in map Fragment numbers Block numbers XXXXXXOOOOXXOOOO 0-3 4-7 8-11 12-15 0 1 2 3

22 Sogang University UNIX File System Page 22 Allocation Policies n FFS aims to colocate related information on the disk and optimize sequential access. n Allocation Policies t Use the next available block rotationally closest to the request block on the same cylinder. t If there are no blocks available on the same cylinder, use a block within the same cylinder group. t If that cylinder group is entirely full, hash the cylinder group number to choose another cylinder group to look for a free block. t Finally if the hash fails, apply an exhaustive search to all cylinder groups.

23 Sogang University UNIX File System Page 23 History of Linux File Systems n First Linux File System : Minix File System. t 14 characters file names t Maximal file size : 64 MB t Lack in performance n April 1992 : Extended File System (Extfs). t Variable length file names (up to 255 characters) t Maximal file size : 2 GB t Most successful in Linux community n January 1993 : New Extended File System (renamed as Second Extended File System - Ext2fs) n January 1993 : XIA File System

24 Sogang University UNIX File System Page 24 The Virtual File System (VFS) (1) n Software layer in the kernel that provides the file system interface to user space programs. n Manages kernel level file abstractions in one format for all file systems (implement file system independent operations). n Receives file oriented system calls from user level. t write, open, stat, link n VFS switch: Translate them into the internal ones to interface with a specific file system module (file system dependent part). n File system provides methods to VFS that the VFS switch can call: Many are optional. n Also receives requests from other parts of the kernel, mostly from memory management.

25 Sogang University UNIX File System Page 25 The Virtual File System (VFS) (2) n n Before the VFS n n After the VFS Minix File System Buffer Cache Device Driver VFS MinixExt2fs Buffer Cache Device Driver

26 Sogang University UNIX File System Page 26 The Virtual File System (VFS) (3) n VFS maintains its own superblock/inode (i.e., in-core inode) structure and each file system dependent translator converts the contents of the external superblock/file descriptor (i.e., disk inode) from/to the VFS superblock/inode. n VFS assumes that: (in the file system stored on the disk) t The first sector of the disk is a boot block although it does not use it. t A superblock contains disk-specific information, such as the number of bytes in a disk block. t External file descriptor (disk inode) on the disk describe the characteristics of each file. t Data blocks linked into each file contain the data. Before VFS can manage a particular file system type, it has to be registered with register_filesystem() call (fs/super.c) - usually registered when the machine is booted.

27 Sogang University UNIX File System Page 27 Mounting a File System (1) n Before a file system can be used, it has to be mounted. n Mounting appends a new file system into an existing directory hierarchy, which allows heterogeneous file systems to be combined in the system’s directory hierarchy. When a file system is mounted, VFS creates an instance of the super_block data structure ( defined in include/linux/fs.h ) to hold information for the file system. VFS then calls the new file system’s read_super() function ( defined in fs/super.c ) to retrieve the information contained in the new file system’s superblock and save it into the super_block structure. After the mount, the VFS can use the super_operations ( defined in include linux/fs.h ) functions to handle on-disk superblock and inodes.

28 Sogang University UNIX File System Page 28 Mounting a File System (2) FS dependent data super operations FS independent data super_block remount_fs statfs write_super put_super put_inode write_inode notify_change read_inode super_operations

29 Sogang University UNIX File System Page 29 Mounting a File System (3) struct super_block { ….. unsigned long s_blocksize; ….. struct file_system_type *s_type; struct super_operations *s_op; ….. union { /* File system specific information */ struct minix_sb_info minix_sb; struct ext2_sb_info ext2_sb; struct hpfs_sb_info hpfs_sb; struct ntfs_sb_info ntfs_sb; struct msdos_sb_info msdos_sb; } u; ….. } struct super_operations { void (*read_inode) (struct inode *); void (*write_inode) (struct inode *); void (*put_inode) (struct inode *); void (*delete_inode) (struct inode *); int (*notify_change) (struct dentry *, …); void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*statfs) (struct super_block *, … ); int (*remount_fs) (struct super_block *,..); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); }; /* Disk dependent functions such as handling on-disk superblock, inodes, etc */

30 Sogang University UNIX File System Page 30 VFS Inode (1) n Every file operation is made on an inode. n The kernel translates file pathnames into inode numbers. n The VFS maintains a table of inodes in use. Inodes are referenced by the structure inode :  ( define in include/linux/fs.h ): t FS independent data t Pointer to FS dependent operations (inode_operations - defined in include/linux/fs.h) t FS dependent data n Usually one operation table per inode type (regular file, directory, symbolic link, …).

31 Sogang University UNIX File System Page 31 VFS Inode (2) FS dependent data inode operations super block FS independent data inode super_block permissionrmdir truncatemkdir bmapsymlink follow_linkunlink readlinklink renamelookup mknodcreate inode_operations

32 Sogang University UNIX File System Page 32 VFS Inode (3) struct inode_operations { struct file_operations * default_file_ops; int (*create) (struct inode *, …); struct dentry * (*lookup) (struct inode, …); int (*link) (struct dentry *, …); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *, …); int (*mkdir) (struct inode *, …); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *, …); int (*rename) (struct inode *, …); int (*readlink) (struct dentry *, char *,int); ….. }; struct inode { ….. uid_t i_uid; gid_t i_gid; ….. time_t i_atime; ….. unsigned long i_blksize; unsigned long i_blocks; ….. struct inode_operations *i_op; struct super_block *i_sb; ….. union { /* File system specific information */ struct pipe_inode_info pipe_i; struct minix_inode_info minix_i; struct ext2_inode_info ext2_i; ….. } u; ….. }

33 Sogang University UNIX File System Page 33 Opening a File - VFS to Disk Translation Fd Table struct file_struct File structure table table struct file POSIX API (System Calls) LinuxInode Disk Inode Disk Inode From Fromtask_struct struct inode

34 Sogang University UNIX File System Page 34 Inode Cache (1) Inode cache is implemented as a hash table ( fs/inode.c ). t Hash value : inode number, device identifier n Inodes are connected by two separate lists t Hash list t Type list l in_use : valid inode, hashed if i_nlink > 0 l dirty : valid inode, hashed if i_nlink > 0, dirty. l unused : ready to be re-used. Not hashed n Inode State Transitions t unused  hash : Allocate a new inode. iget() calls ext2_read_inode(). t dirty  in_use : Synchronize inodes. Call ext2_write_inode() t hash  unused : Clear inodes. If i_nlink = 0, iput() calls ext2_delete_inode() when i_count falls to 0.

35 Sogang University UNIX File System Page 35 Inode Cache (2) inode_hashtable[ ] HASH_SIZE i_hash i_list inode inode_in_use inode_unused inode_dirty i_hash i_list i_hash i_list i_hash i_list i_hash i_list i_hash i_list i_hash i_list mark_inode_dirty clear_inode delete_inode read_inode write_inode (sync) dirty  used used  dirty Iget() Iput()

36 Sogang University UNIX File System Page 36 Directory Cache Speed up commonly used directories ( fs/dcache.c ). n Directory cache consists of a hash table. t device number, directory’s name n Two-level LRU list. t When it is first looked up, added onto the end of the first level list. t If the entry accessed again, it is promoted to the end of the second level list.

37 Sogang University UNIX File System Page 37 Buffer Cache Management (1) n Hashed by device number and block number. n Two functional parts t Free block list (free_list). l One list per each buffer size (512, 1K,…8K). t Hash table (lru_list: LRU list for each buffer type). l Three LRU list - CLEAN, LOCKED, DIRTY. l If found in hash list, put_last_lru(). l If not in hash list, allocate free list according to its size. –remove from free list, insert to lru list.

38 Sogang University UNIX File System Page 38 Buffer Cache Management (2)

39 Sogang University UNIX File System Page 39 Buffer Cache Management (3) Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprev b_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free Buffer_head b_pprevb_next b_prev_freeb_next_free hash_table lru_list[BUF_DIRTY ] hash function

40 Sogang University UNIX File System Page 40 Ext2 File System n Influenced by BSD FFS (also called UFS - Unix File System). n Divides the partition into Block Group (c.f. Cylinder Group). n Keeping a) data blocks close to their inodes, b) file inodes close to their directory inode -> reduce seek time and speed up accessing to data.

41 Sogang University UNIX File System Page 41 Ext2fs Super Block n Contains a description of the basic size and shape of file system. Usually superblock in Block Group 0 is read ( size: 1024 bytes ). n Each Block Group contains a duplicate copy in case of file system corruption. Superblock contains : ( struct ext2_super_block: include/linux/ext2_fs.h ) t number of blocks (total and reserved) t number of inodes t number of free blocks and inodes t block and fragment size t number of blocks and inodes per group t last mount and write times t FS state t current and maximal mount count t last check time and check interval t mount option default values

42 Sogang University UNIX File System Page 42 Group Descriptors n Provides information on the block groups. All descriptors are duplicated in each group ( size: 32 bytes ). n Each descriptor describes a block group :  ( struct ext2_group_desc : include/linux/ext2_fs.h ) t block bitmap location t inode bitmap location t inode table location t number of free blocks t number of free inodes t number of allocated directories (used by the allocation routines)

43 Sogang University UNIX File System Page 43 Bitmaps and Inode Table n The size of the bitmaps is one block. n This restricts the size of a block group to 8192 blocks for blocks of 1024 bytes. n Inode table is a vector of inodes with 128 bytes in size.

44 Sogang University UNIX File System Page 44 Directory Structure in Ext2fs n Directory : linked list of variable length entries Each entry contains: ( struct ext2_dir_entry : include/linux/ext2_fs.h ) t the inode number t the entry length (rounded up to a multiple of 4) t the file name length t the file name (maximum of 255) n Example f2212i3long_file_name1440i2file1516i1 01656

45 Sogang University UNIX File System Page 45 Inode Structure in Ext2fs n Block size : 1K ~ 4K n Data block : i_data[15] t direct block : 12 t indirect block : 3 cf. - Sys V: i_addr[13] - UFS: i_db[12] i_ib[3] n struct ext2_inode : include/linux/ext2_fs.h

46 Sogang University UNIX File System Page 46 Data Allocation in Ext2fs (1) n Block groups are used to cluster together related inodes and data. n Use “next-32-bits” search (within 32 blocks) for allocating a block’s successor (target-oriented allocation). n If that fails, searches forward : t within the group for an entire free byte in the bitmap (8 free blocks). t within the group for any free bit in the bitmap, if that fails. t searches subsequent groups in a similar manner, if that fails. n Pre-allocate up to 8 adjacent blocks when allocating a new block: t de-allocates extra blocks on file close. t pre-allocation achieves good performances. t pre-allocation hit rates are around 75% even on very full file systems. t #define EXT2_PREALLOCATE 8 -> include/linux/ext2_fs.h.

47 Sogang University UNIX File System Page 47 Data Allocation in Ext2fs (2) n Achieves good locality : t of related files through block groups.  of related blocks through 8-bits clustering of block allocations. n Reduce CPU overhead : t The size of bitmaps that must be searched is limited. t Block pre-allocation reduces allocation overhead. t The physical and logical location of the last block is recorded in each inode. t Rapidly detects and deals with sequential allocations.

48 Sogang University UNIX File System Page 48 Extension of Ext2fs n File Undeletion n Access Control List t File protection per user and/or per group n Automatic File Compression t Files stored in gziped format t Decompression on the fly during reads

49 Sogang University UNIX File System Page 49 Need for New File System (1) : Technologies Need for New File System (1) : Technologies n Three important components in file system design : processors, disks, and main memory. n CPU speed is increasing at an exponential rate, while the improvement in disk speed is slower. Focused mainly on disk transfer bandwidth and no major improvements for access time (e.g., seek time). -> No major speed up in applications. n Main memory is increasing in size, which makes large file cache possible. t Absorb a greater fraction of the read requests (80 ~ 90 % hit ratio). t Disk traffic will become more and more dominated by writes. t File cache can be used as a write buffers that allow blocks to be collected before writing for a single transfer. t Large buffer may result in large data loss. -> Solution: periodic update or synchronous write.

50 Sogang University UNIX File System Page 50 Need for New File System (2) : Workloads Need for New File System (2) : Workloads n Among different file system workloads, one of the most difficult workloads for file system design is found in office and engineering environments. n Office and engineering applications: t Tend to be dominated by accesses to small files (only a few kilobytes). t Small files usually result in small random disk I/Os.  The creation and deletion times for such files are often dominated by updates to file system metadata. n Workloads dominated by sequential accesses to large files : supercomputing applications or multimedia applications. t A number of techniques exist to ensure that files are laid out sequentially on disk (e.g., FFS, Ext2, etc.). t Use large block sizes. t I/O performance tends to be limited by the disk I/O bandwidth.

51 Sogang University UNIX File System Page 51 Problems with Traditional File Systems (1) Problems with Traditional File Systems (1) n Performance Problems t Kernel algorithms force a large number of synchronous I/O operations, resulting in extremely long completion times. FFS writes data block asynchronously, while metadata are written synchronously. -> simplified crash recovery but reduced write performance. t Disk layout in FFS restricts to using only a fraction of the total disk bandwidth. l FFS is designed to read or write a single block in each I/O request. l If two blocks are on consecutive sectors on the disk, the disk would rotate past the next block (due to kernel processing time). l Introduce the concept of rotational delay (or rotdelay) - block interleaving. l Typically complete rotation time is around 15 ms and kernel needs 4 ms. l If block size is 4 Kbytes and each track has 8 blocks, rotdelay = 2. l Restrict the disk bandwidth to about 1/3 of the total bandwidth.

52 Sogang University UNIX File System Page 52 Problems with Traditional File Systems (2) Problems with Traditional File Systems (2) l l Solution : 1) read/write the entire track in each operation. 2) on-disk cache (disk reads store the entire track in the cache). l l Write operations still suffer from the rotational delay problem.

53 Sogang University UNIX File System Page 53 Problems with Traditional File Systems (3) Problems with Traditional File Systems (3) t Predominance of disk writes. Large buffer cache absorbs many of the disk read requests. l For consistency, update daemon periodically flush dirty blocks to disk. l Some disk operations require synchronous disk updates. t Many of the synchronous writes turn out to be quite unnecessary. l Due to strong locality, the same block is very likely to be modified. l Many files have a very short lifetime. t Disk head seek problem. t Since writes account for most of the disk activity, the operating system needs to find other ways to solve these problems.

54 Sogang University UNIX File System Page 54 Problems with Traditional File Systems (4) Problems with Traditional File Systems (4) n Metadata Updates  In order to prevent file system corruption, metadata updates need to be written in a precise order. l For example, if a file is deleted, the kernel must remove the directory entry, free the inode, and free the disk blocks used by the file. t In traditional file systems, such ordering is achieved through synchronous writes. t Since the attributes (e.g., inode) for a file are separate from the file’s contents, it takes several disk I/Os, each preceded by a seek, to do disk operations (e.g., 5 I/Os to create a new file in FFS). t When writing small files, less than 5 % of the disk’s potential bandwidth is used for new data; the rest of the time is spent seeking.

55 Sogang University UNIX File System Page 55 Problems with Traditional File Systems (5) Problems with Traditional File Systems (5) n Crash Recovery t Ordering metadata writes helps control the damage caused by a system crash but does not eliminate it. t Sequence of operations to rebuild file system by fsck: l Read and check all inodes and build a bitmap of used data blocks. l Record inodes numbers and block addresses of all directories. l Validate the structure of the directory tree, making sure that all links are accounted for. Validate directory contents to account for all the files. l If any directories could not be attached to the tree in phase 2, put them in the lost+found directory. l If any file could not be attached to a directory, put it in the lost+found directory. l Check the bitmaps and summary counts for each cylinder group. t Fsck may experience a long delay before they can restart after crash.

56 Sogang University UNIX File System Page 56 Problems with Traditional File Systems (6) Problems with Traditional File Systems (6) n Security t UNIX based access control mechanism (permission bits) is not enough in a large computing environments. t Need finer granularity control scheme. l ACL (Access Control List) - allows the file owner to explicitly allow or restrict different types of access to specific users and groups. t UNIX inodes are not designed to hold such a list, so the file system must find other ways of implementing ACLs. n Size t Unnecessary size restrictions on the size of the file system and of individual files.

57 Sogang University UNIX File System Page 57 Journaling Approach (1) Journaling Approach (1) n Record all file system changes in an append-only log file. t Use database logging technique: keep track of changes to make sure that all updates on the disk are done safely. n The log is written sequentially, in large chunks at a time, which results in efficient disk utilization and high performance. n After a crash, only the log needs to be examined, which means quicker recovery and higher reliability. n Write performance can be improved due to no seek operations. n Basic Characteristics: t What to log ? l All modifications including data blocks (Logging File System) vs. only metadata changes (Journaling File System). t Log operations or values ?

58 Sogang University UNIX File System Page 58 Journaling Approach (2) Journaling Approach (2) t Log-enhanced File Systems vs. Log-structured File System. l Log-enhanced File System : retain the traditional on-disk structures,such as inodes and superblocks, and use the log as a supplement record. Log-structured File System : the log is the only representation of the file system on disk -> requires full logging (data as well as metadata). t Garbage collection : Finite sized log (logically circular file). t Group commit : Need to write the log in large chunks. There is a tradeoff between performance and reliability. t Retrieval : Need an efficient way (indexing technique) of retrieving data from the log in case of cache miss.

59 Sogang University UNIX File System Page 59 Static File System vs. Logging File System Static File System vs. Logging File System File Header File Header Before Write Before Write File Data File Data Before Write File Header File Header After Write After Write File Data File Data After Write After Write Static File System Update Blocks 2 & 3 Logging File System 123 123 12 3 12 3


Download ppt "Sogang University Advanced Operating Systems (UNIX & Linux File System) Advanced Operating Systems (UNIX & Linux File System) Sang Gue Oh, Ph.D. Email."

Similar presentations


Ads by Google