Presentation is loading. Please wait.

Presentation is loading. Please wait.

G22.3250-001 Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.

Similar presentations


Presentation on theme: "G22.3250-001 Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees."— Presentation transcript:

1 G22.3250-001 Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees

2 Altogether Now: The Three Questions  What is the problem?  What is new or different?  What are the contributions and limitations?

3 Some Background  Inode  On-disk data structure containing a file’s metadata and pointers to its data  Defines FS-internal namespace  Inode numbers  Originated in Unix FFS  Vnode  Kernel data structure for representing files  Provides standard API for accessing different FS’s  Originated in Sun OS

4 Motivation  On one side: I/O bottleneck  It’s becoming harder to utilize increasing I/O capacity and bandwidth  112 9 GB disk drives provide 1 TB of storage  High end drives provide 500 MB/sec sustained disk bandwidth  On the other side: I/O intensive applications  Editing of uncompressed video  30 MB/sec per stream, 108 GB for one hour  Streaming compressed video on demand  2.7 TB for 1,000 movies, 200 movies require 100 MB/sec

5 Scalability Problems of Existing File Systems  Slow crash recovery  fsck needs to scan entire disk  No support for large file systems  32 bit block pointers address only 4 million blocks  At 8 KB per block, 32 TB  No support for large, sparse files  64 bit block pointers require more levels of indirection  Are also quite inefficient  Fixed-size extents are still too limiting

6 Scalability Problems of Existing File Systems (cont.)  No support for large, contiguous files  Bitmap structures for tracking free and allocated blocks do not scale  Hard to find large regions of contiguous space  But, we need contiguous allocation for good utilization of bandwidth  No support for large directories  Linear layout (inode number, name entries) does not scale  In-memory hashing imposes high memory overheads  No support for large numbers of files  Inodes preallocated during file system creation

7 XFS in a Nutshell  Use 64 bit block addresses  Support for larger files systems  Use B+ trees and extents  Support for larger number of files, larger files (which may be sparse or contiguous), larger directories  Better utilization of I/O bandwidth  Log metadata updates  Faster crash recovery

8 XFS Architecture  I/O manager  I/O requests  Directory manager  File system name space  Space manager  Free space, inode & file allocation  Transaction manager  Atomic metadata updates  Unified buffer cache  Volume manager  Striping, concatenation, mirroring

9 Storage Scalability  Allocation groups  Are regions with their own free space maps and inodes  Support AG-relative block and inode pointers  Reduce size of data structures  Improve (thread) parallelism of metadata management  Allow concurrent accesses to different allocation groups  Unlike FFS, are motivated (mostly) by scalability and parallelism and not by locality  Free space  Two B+ trees describing extents (what’s a B+ tree?)  One indexed by starting block (used when?)  One indexed by length of extent (used when?)

10 Storage Scalability (cont.)  Large files  File storage tracked by extent map  Each entry: block offset in file, length in blocks, starting block on disk  Small extent map organized as list in inode  Large extent map organized as B+ tree rooted in inode  Indexed by block offset in file  Large number of files  Inodes allocated dynamically  In chunks of 64  Inode locations tracked by B+ tree  Only points to inode chunks

11 Storage Scalability (cont.)  Large directories  Directories implemented as (surprisingly) B+ trees  Map 4 byte hashes to directory entries (name, inode number)  Fast crash recovery  Enabled by write ahead log  For all structural updates to metadata  E.g., creating a file  directory block, new inode, inode allocation tree block, allocation group header block, superblock  Independent of actual data structures (just binary data)  However, still need disk scavengers for catastrophic failures  The customers spoke…

12 Performance Scalability  Allocating files contiguously  On-disk allocation is delayed until flush  Uses (cheap and plentiful) memory to improve I/O performance  Typically enables allocation in one extent  Even for random writes (think memory-mapped files)  Avoids allocation for short-lived files  Extents have large range: 21 bit length field  Two million file system blocks  Block size can vary by file system  Small blocks for file systems with many small files  Large blocks for file systems with mostly large files  What prevents long-term fragmentation?

13 Performance Scalability (cont.)  Performing file I/O  Read requests issued for large I/O buffers  Followed by multiple read ahead requests for sequential reads  Writes are clustered to form larger, asynch. I/O requests  Delayed allocation helps with buffering writes  Direct I/O lets applications bypass cache and use DMA  Applications have control over I/O, while still accessing file system  But also need to align data on block boundaries and issue requests on multiples of block size  Reader/writer locking supports more concurrency  Several processes on different CPUs can access the same file  Direct I/O leaves serialization entirely to applications

14 Performance Scalability (cont.)  Accessing and updating metadata  Updates performed in asynchronous write-ahead log  Modified data still only flushed after log update has completed  But metadata not locked, multiple updates can be batched  Log may be placed on different device from file system  Including non-volatile memory (NV-RAM)  Log operation is simple, but log is centralized (think SMP)  Provide buffer space, copy, write out, notify  Copying can be done by processors performing transaction

15 Experiences

16 I/O Throughput  What can we conclude?  Read speed, difference between creates and writes, parallelism

17 Benchmark Results (The Marketing Dept. Speaketh)  Datamation sort  3.52 seconds (7 seconds previous record)  Indy MinuteSort  1.6 GB sorted in 56 seconds (1.08 GB previously)  SPEC SFS  8806 SPECnfs instead of 7023 SPECnfs with EFS  12% increase with mostly small & synchronous writes on similar hardware

18 Directory Lookups Why this noticeable break?

19 Are You Pondering What I Am Pondering?  Are these systems/approaches contradictory?  Recoverable virtual memory  Simpler is better and better performing  XFS  Way more complex is better and better performing

20 What Do You Think?


Download ppt "G22.3250-001 Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees."

Similar presentations


Ads by Google