File Systems: Designs Kamen Yotov CS 614 Lecture, 04/26/2001.

File Systems: Designs Kamen Yotov CS 614 Lecture, 04/26/2001

Overview The Design and Implementation of a Log-Structured File System Sequential Structure Speeds up Writes & Crash Recovery The Zebra Stripped Network File System Stripping across multiple servers RAID equivalent data recovery

Log-structured FS: Intro Order of magnitude faster!?! The future is dominated by writes Main memory increases Reads are handled by cache Logging is old, but now is different NTFS, Linux Kernel 2.4 Challenge – finding free space Bandwidth utilization 65-75% vs. 5-10%

Design issues of the 1990’s Importance factors CPU – exponential growth Main memory – caches, buffers Disks – bandwidth, access time Workloads Small files – single random I/Os Large files – bandwidth vs. FS policies

Problems with current FS Scattered information 5 I/O operations to access a file under BSD Synchronous writes May be only the meta-data, but it’s enough Hard to benefit from the faster CPUs Network file systems More synchrony in the wrong place

Log-structured FS (LFS) Fundamentals Buffering many small write operations Writing them at once on a single, continuous disk block Simple or not? How to retrieve information? How to manage the free space?

LFS: File Location and Reading Index structures to permit random access retrieval Again inodes, …but at random positions on the log! inode map: indexed, memory resident Writes are better, while reads are at least as good!

Example: Creating 2 Files inode directory data inode map

LFS: Free Space Management Need large free chucks of space Threading Excessive fragmentation of free space Not better than other file systems Copying Can be to different places… Big costs Combination

Threading & copying

Solution: Segmentation Large fixed-size blocks (e.g. 1MB) Threading through segments Copying inside segments Transfer longer than seeking Segment cleaning Which are the live chucks To what file they belong and at what position (inode update) Segment summary block(s) File version stamps

Segment Cleaning Policies When should the cleaner execute? Continuously, At night, When exhausted How many segments to clean at once? Which segments are to be cleaned? Most fragmented ones or… How should the live blocks be grouped when written back? Locality for future reads…

Measuring & Analysing Write cost Average amount of time busy for byte of data written, including cleaning overhead 1.0 is perfect – full bandwidth, no overhead Bigger is worse LFS: seek and rotational latency negligible, so it’s just total  data! Performance trade-off: utilization vs. speed The key: bimodal segment distribution!

Simulation & Results Purpose: Analyze different cleaning policies Harsh model File system is modeled as a set of 4K files At each step a file is chosen and rewritten Uniform: Each with equal likelihood to be chosen Hot-and-cold: The 10-90 formula Runs until write cost is stabilized

Write Cost vs. Disk Utilization 2 4 6 8 10 12 14 16 18 0.10.20.90.30.40.50.60.70.8 FFS today FFS improved LFS uniform LFS hot-and-cold No variance Disk utilization Write cost

Hot & Cold Segments Why is locality worse than no locality? Free space valuable in cold segments Value based on data stability Approximate stability with age Cost-benefit policy Benefit: Amount of: Space cleaned (inverse of utilization of segment) Time stays free (timestamp of youngest block) Cost (read + write live data)

Segment Utilization Distributions 1 2 3 4 5 6 7 8 9 0.10.20.90.30.40.50.60.70.8 Fraction of segments (  0.001) Segment utilization Uniform Hot-and-cold (greedy) Hot-and-cold (cost-benefit)

Write Cost vs. Disk Utilization (revisited) 2 4 6 8 10 12 14 16 18 0.10.20.90.30.40.50.60.70.8 FFS today FFS improvedLFS uniform LFS cost-benefit No variance Disk utilization Write cost

Crash Recovery Currently file systems require full scan Log-based systems are definitely better Check-pointing (two-phase, trade-offs) (Meta-)information – log Checkpoint region – fixed position inode map blocks segment usage table time & last segment written Roll-forward

Crash Recovery (cont.) Naïve method: On a crash, just use the latest checkpoint and go from there! Roll-forward recovery Scan segment summary blocks for new inodes If just data, but no inode, assume incomplete and ignore Adjust utilization of segments Restore consistency between directory entries and inodes (special records in the log for the purpose)

Experience with Sprite LFS Part of Sprite Network Operating System All implemented, roll-forwarding disabled Short 30 second check-pointing interval Not more complicated to implement than a normal “allocation” file system NTFS and Ext2 even more… Not great improvement to the user as few applications are disk-bound!

So, let’s go micro! Micro-benchmarks were produced 10 times faster when creating small files Faster in reading of order preserved Only case slower in Sprite is Write file randomly Read it sequentially Produced locality differs a lot!

Sprite LFS vs. Sun OS 4.0.3 Sizes 4KB block 1MB segment x10 speed-up in writes/deletes Temporal locality Saturate CPU!!! Random write Size 8KB block Slow on writes/deletes Logical locality Keep disk busy Sequential read

Related Work WORM media – always been logged Maintain indexes No deletion necessary Garbage collection Scavenging = Segment cleaning Generational = Cost-benefit scheme Difference: random vs. sequential Logging similar to database systems Use of the log differs (like NTFS, Linux 2.4) Recovery is like “redo log”-ging

Zebra Networked FS: Intro Multi-server networked file system Clients stripe data through Redundancy ensures fault-tolerance & recoverability Suitable for multimedia & parallel tasks Borrows from RAID and LFS principles Achieves speed-ups from 20% to 5x

Zebra: Background RAID Definitions Stripes Fragments Problems Bandwidth bottleneck Small files Differences with Distributed File Systems stripe dataparity

Per file vs. Per client stripping RAID standard 4 I/Os for small files 2 reads 2 writes LFS Data distribution Parity distribution Storage efficient 1 4 2 5 3 6 123456 large file small file (1) small file (2) 1 4 2 5 3 6 123456 many files (LFS)

Zebra: Network LSF Logging between clients and servers (as opposed to file server and disks) Per client stripping More efficient storage space usage Parity mechanism is simplified No overhead for small files Never needs to be modified Typical distributed computing problems

Zebra: Components File Manager Stripe Cleaner Storage Servers Clients File Manager and Stripe cleaner may reside on a Storage Server as separate processes – useful for fault tolerance! Fast Network File Manager Stripe Cleaner Storage Server Storage Server Client …

Zebra: Component Dynamics Clients Location, fetching & delivery of fragments Striping, parity computation, writing Storage servers Bulk data repositories Fragment operations Store, Append, Retrieve, Delete, Identify Synchronous, non overwrite semantics File Manager Meta-data repository Just pointers to blocks RPC bottleneck for many small files Can run as a separate process on a Storage Server Stripe Cleaner Similar to the Sprite LFS we discussed Runs as a separate, user mode process

Zebra: System Operation - Deltas Communication via Deltas Fields File ID, File version, Block # Old & New block pointers Types Update, Cleaner, Reject Reliable, because stored in the log Replay after crashes

Zebra: System Operation (cont.) Writing files Flushes on Threshold age (30 s) Cache full & dirty Application fsync File manager request Striping Deltas update Concurrent transfers Reading files Nearly identical to conventional FS Good client caching Consistency Stripe cleaning Choosing which to… Space utilization through deltas Stripe Status File

Zebra: Advanced System Operations Adding Storage Servers Scalable Restoring from crashes Consistency & Availability Specifics due to distributed system state Internally inconsistent stripes Stripe information inconsistent with File Manager Stripe cleaner state consistency with Storage Servers Logging and check-pointing Fast recoveries after failures

Prototyping Most of the interesting parts only on paper Included All UNIX file commands, file system semantics Functional cleaner Clients construct fragments and write parities File Manager and Storage Servers checkpoint Some advanced crash recovery methods omitted Metadata not yet stored on Storage Servers Clients do not automatically reconstruct fragments upon a Storage Server crash Storage Servers do not reconstruct fragments on recovery File Manager and Cleaner not automatically restarted

Measurements: Platform Cluster of DECstation-5000 Model 200 100 Mb/s FDDI local network ring 20 SPECint 32 MB RAM 12 MB/s memory to memory copy 8 MB/s memory to controller copy RZ57 1GB disks, 15ms seek 2 MB/s native transfer bandwidth 1.6 MB/s real transfer bandwidth (due to controller) Caching disk controllers (1MB)

Measurements: Results (1)

Zebra: Conclustions Pros Applies parity and log structure to network file systems Performance Scalability Cost-effective servers Availability Simplicity Cons Lacks name caching, causing severe performance degradations Not well suited for transaction processing Metadata problems Small reads are problematic again

File Systems: Designs Kamen Yotov CS 614 Lecture, 04/26/2001.

Similar presentations

Presentation on theme: "File Systems: Designs Kamen Yotov CS 614 Lecture, 04/26/2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

File Systems: Designs Kamen Yotov CS 614 Lecture, 04/26/2001.

Similar presentations

Presentation on theme: "File Systems: Designs Kamen Yotov CS 614 Lecture, 04/26/2001."— Presentation transcript:

Similar presentations

About project

Feedback