Presentation on theme: "Jeff's Filesystem Papers Review Part II. Review of "The Design and Implementation of a Log-Structured File System""— Presentation transcript:
Jeff's Filesystem Papers Review Part II. Review of "The Design and Implementation of a Log-Structured File System"
The Design and Implementation of a Log Structured File System By Mendel Rosenblum and John K Ousterhout UCB Ousterhout introduced the idea in an earlier paper with a few different configurations, this describes the concept as it exists after they implemented it in a new FS called Sprite. Some empirical research was done after the implementation and they did prove that LFS is a good idea. This presentation is an academic review, the ideas presented are either quotes or paraphrases of the reviewed document.
Intro Why? (problem statement) CPU's getting faster Memory getting faster Disks not Amdahl's Law ƒ Bottlenecks move around, as CPU gets faster, bottleneck moves to memory or Disk, etc. We need to find a way to use disks more efficiently Assumption ƒ Caching files in RAM improves READ performance of Filesystem, significantly more than WRITE performance. ƒ Therefore Disk activity will become more Write- centric in the future.
2.1 Disk improvement is in area of Price/Capacity and Physical size, not in seek-time Even if IO improves, Seek-time will still be killer Memory getting cheaper/faster.. therefore use memory cache to ease disk bottleneck. Caches difference between cache and buffer? ƒ Buffer is used between 2 different speed devices, cache is to speed up subsequent similar or proximate accesses. Caches can reorder writes to write more efficiently.
2.2 Workloads 3 classes of File access patterns (from different paper) scientific processing - read and write large file sequentially transaction processing - many simultaneous requests, small chunks of data engineering/office applications - access large number of small files in sequence Engineering/office is the killer, and that is what LFS is designed for.
2.3 Problems with Existing FS's UNIX FFS (fast file system...also Berkeley Developed) puts files sequentially on disk inode data in fixed location on disk directory data at another location Total of 5 seeks to create a new file (bad). file data is written asynchronously so that program can continue w/o waiting for FS, BUT Metadata is written synchronously so program is blocked when messing with things like inode data.
3 LFS Buffer a sequence of FS changes in file cache and then write them sequentially to disk in a chunk to avoid seeks. Essentially all data is merely a log entry. Creates 2 problems... How to read from log How to keep freespace on disk ƒ I.E. you start writing and writing forward forever eventually you will wrap at end of disk.
3.1 How to Read Reads are at same speed as FFS after the inode is located. Locating the Inode is what slows down FFS and where LFS is better. FFS has Inode in static portion of disk unrelated to physical location of data. LFS stores Inode in proximity to data at the head (end) of log. Because of this another (but much smaller) map is needed of the inodes. ƒ So small that it is kept in cache all the time to not cause excessive seeks. ƒ called the checkpoint region.
3.2 Free Space Management Log Wraparound. Choices. Don't defragment, just write to next free block GC-Style Stop everything and copy Incremental Continuous and Copy Solution - Segments Divide Disk into segments ƒ segment size chosen for optimal usage. ƒ segment is written contiguously, and the disk is compacted in segments to avoid fragmentation. This defrag is known as segment-cleaning
3.3 Segment Cleaning Should be pretty obvious how to do it 3 steps Read a number of non-clean segments into memory get only live (in use portion of segments) data Write live data back to disk in clean segments other logistical considerations in segment cleaning update Inodes update fixed structures such as checkpoint region ƒ remember these are in cache, and as we will see later they are dumped to disk at predetermined intervals. There is some other stuff dealt with as each segment has a header and other stuff, Read the paper for details.
3.4 Segment Cleaning - how to configure When to do it? Low priority, or when diskspace is needed ƒ the authors choose when diskspace is needed with watermarks. How many segments to clean at one time? The more segments cleaned at one time, the more intelligent the cleaning can be and the better organized the disk. ƒ watermarks chosen above Which segments to clean?...coming Since you can write the data back to disk any way you want to, you should write it back in the most efficient manner for its predicted use...coming
3.5-6 Determination of Segment-Cleaning configuration. Here the authors went into empirical studies. Wrote simulator and played with config of segment- cleaning to determine a good policy. Results/Conclusions differentiate between hot and cold segments based on past history ƒ A hot segment is one that is likely to be written ƒ A cold segment is one that is unlikely to be written They came up with a policy called cost-benefit ƒ cleans cold segments that are at least 75% full ƒ cleans hot segments that are at least 15% full The utilization and the "temperature" of a segment are maintained in an in-memory table.
4 Crash Recovery FFS Major problem is that entire disk must be scanned ƒ most-recently-written data could be anywhere on disk LFS Most-recently-written data is at one location on disk. Uses checkpoints and roll-forward to maintain consistency. ƒ Borrowed ideas from dB technology
4.1 Crash Recovery Checkpoint Region 2 copies maintained at fixed location on disk ƒ written to alternately, in case of crash while updating checkpoint data At points in time: ƒ IO is blocked ƒ All cache data is written to end of log ƒ Checkpoint data from cache is written to disk. ƒ IOs then re-enabled ƒ could instead be done at points based on amount of data written note similarity to GC techniques. Skipping roll-forward techniques as it is very complex and depends on segment header info, read the paper for more info, it just enhances checkpointing
5 Empirical test results Comparison to FFS/SunOS Basically, it is significantly better for small files and it is better or as good for large files in all cases except: large files that were originally written random and are later accessed sequentially. Crash-recovery was not rigorously empirically tested against FFS/SunOS.