Presentation is loading. Please wait.

Presentation is loading. Please wait.

File System Reliability

Similar presentations

Presentation on theme: "File System Reliability"— Presentation transcript:

1 File System Reliability

2 Today’s Lecture Review of Disk Scheduling Storage reliability

3 Disk Scheduling The operating system is responsible for using hardware efficiently — for the disk drives, this means having a fast access time and disk bandwidth. Access time has two major components Seek time is the time for the disk are to move the heads to the cylinder containing the desired sector. Rotational latency is the additional time waiting for the disk to rotate the desired sector to the disk head. Minimize seek time Seek time  seek distance Disk bandwidth is the total number of bytes transferred, divided by the total time between the first request for service and the completion of the last transfer.

4 Disk Performance Parameters
The actual details of disk I/O operation depend on many things A general timing diagram of disk I/O transfer is shown here. The actual details of disk I/O operation depend on the computer system, the operating system, and the nature of the I/O channel and disk controller hardware. A general timing diagram of disk I/O transfer is shown in Figure 11.6. Dave Bremer Otago Polytechnic, NZ ©2008, Prentice Hall

5 Disk Scheduling (Cont.)
Several algorithms exist to schedule the servicing of disk I/O requests. We illustrate them with a request queue (0-199). 98, 183, 37, 122, 14, 124, 65, 67 Head pointer 53

6 FCFS: First Come First Serve
Illustration shows total head movement of 640 cylinders.

7 SSTF: Shortest Seek Time First
Selects the request with the minimum seek time from the current head position. SSTF scheduling is a form of SJF scheduling; may cause starvation of some requests. Illustration shows total head movement of 236 cylinders.

8 SSTF (Cont.)

9 SCAN The disk arm starts at one end of the disk, and moves toward the other end, servicing requests until it gets to the other end of the disk, where the head movement is reversed and servicing continues. Sometimes called the elevator algorithm. Illustration shows total head movement of 208 cylinders.

10 SCAN (Cont.)

11 C-SCAN Provides a more uniform wait time than SCAN.
The head moves from one end of the disk to the other. servicing requests as it goes. When it reaches the other end, however, it immediately returns to the beginning of the disk, without servicing any requests on the return trip. Treats the cylinders as a circular list that wraps around from the last cylinder to the first one.

12 C-SCAN (Cont.)

13 C-LOOK Version of C-SCAN
Arm only goes as far as the last request in each direction, then reverses direction immediately, without first going all the way to the end of the disk.

14 C-LOOK (Cont.)

15 Selecting a Disk-Scheduling Algorithm
SSTF is common and has a natural appeal SCAN and C-SCAN perform better for systems that place a heavy load on the disk. Performance depends on the number and types of requests. Requests for disk service can be influenced by the file-allocation method. The disk-scheduling algorithm should be written as a separate module of the operating system, allowing it to be replaced with a different algorithm if necessary. Either SSTF or LOOK is a reasonable choice for the default algorithm.

16 Reliability Main Points
Problem posed by machine/disk failures Transaction concept Three approaches to reliability Careful sequencing of file system operations Copy-on-write (WAFL, ZFS) Journalling (NTFS, linux ext4) Approaches to availability RAID

17 File System Reliability
What can happen if disk loses power or machine software crashes? Some operations in progress may complete Some operations in progress may be lost Overwrite of a block may only partially complete File system wants durability (as a minimum!) Data previously stored can be retrieved (maybe after some recovery step), regardless of failure

18 Storage Reliability Problem
Single logical file operation can involve updates to multiple physical disk blocks inode, indirect block, data block, bitmap, … With remapping, single update to physical disk block can require multiple (even lower level) updates At a physical level, operations complete one at a time Want concurrent operations for performance How do we guarantee consistency regardless of when crash occurs?

19 Types of Storage Media Volatile storage – information stored here does not survive system crashes Example: main memory, cache Nonvolatile storage – Information usually survives crashes Example: disk and tape Stable storage – Information never lost Not actually possible, so approximated via replication or RAID to devices with independent failure modes Goal is to assure transaction atomicity where failures cause loss of information on volatile storage

20 Reliability Approach #1: Careful Ordering
Sequence operations in a specific order Careful design to allow sequence to be interrupted safely Post-crash recovery Read data structures to see if there were any operations in progress Clean up/finish as needed Approach taken in FAT, FFS (fsck), and many app-level recovery schemes (e.g., Word)

21 FAT: Append Data to File
Add data block Add pointer to data block Update file tail to point to new MFT entry Update access time at head of file

22 FAT: Append Data to File
Add data block Add pointer to data block Update file tail to point to new MFT entry Update access time at head of file

23 FAT: Append Data to File
Normal operation: Add data block Add pointer to data block Update file tail to point to new MFT entry Update access time at head of file Recovery: Scan MFT If entry is unlinked, delete data block If access time is incorrect, update

24 FAT: Create New File Normal operation: Allocate data block
Update MFT entry to point to data block Update directory with file name -> file number What if directory spans multiple disk blocks? Update modify time for directory Recovery: Scan MFT If any unlinked files (not in any directory), delete Scan directories for missing update times

25 FFS: Create a File Normal operation: Allocate data block
Write data block Allocate inode Write inode block Update bitmap of free blocks Update directory with file name -> file number Update modify time for directory Recovery: Scan inode table If any unlinked files (not in any directory), delete Compare free block bitmap against inode trees Scan directories for missing update/access times Time proportional to size of disk

26 FFS: Move a File Normal operation: Remove filename from old directory
Add filename to new directory Recovery: Scan all directories to determine set of live files Consider files with valid inodes and not in any directory New file being created? File move? File deletion?

27 Application Level Normal operation:
Write name of each open file to app folder Write changes to backup file Rename backup file to be file (atomic operation provided by file system) Delete list in app folder on clean shutdown Recovery: On startup, see if any files were left open If so, look for backup file If so, ask user to compare versions

28 Careful Ordering Pros Cons
Works with minimal support in the disk drive Works for most multi-step operations Cons Can require time-consuming recovery after a failure Difficult to reduce every operation to a safely interruptible sequence of writes Difficult to achieve consistency when multiple operations occur concurrently

29 Reliability Approach #2: Copy on Write File Layout
To update file system, write a new version of the file system containing the update Never update in place Reuse existing unchanged disk blocks Seems expensive! But Updates can be batched Almost all disk writes can occur in parallel Approach taken in network file server appliances (WAFL, ZFS)

30 Copy on Write/Write Anywhere

31 Copy on Write/Write Anywhere

32 Copy on Write Batch Update

33 Copy on Write Garbage Collection
For write efficiency, want contiguous sequences of free blocks Spread across all block groups Updates leave dead blocks scattered For read efficiency, want data read together to be in the same block group Write anywhere leaves related data scattered => Background coalescing of live/dead blocks

34 Copy On Write Pros Cons Correct behavior regardless of failures
Fast recovery (root block array) High throughput (best if updates are batched) Cons Potential for high latency Small changes require many writes Garbage collection essential for performance

35 Reliability Approach #3: Log Structured File Systems
Log structured (or journaling) file systems record each update to the file system as a transaction All transactions are written to a log A transaction is considered committed once it is written to the log However, the file system may not yet be updated The transactions in the log are asynchronously written to the file system When the file system is modified, the transaction is removed from the log If the file system crashes, all remaining transactions in the log must still be performed

36 Transaction Assures that operations happen as a single logical unit of work, in its entirety, or not at all Challenge is assuring atomicity despite computer system failures Transaction - collection of instructions or operations that performs single logical function Here we are concerned with changes to stable storage – disk Transaction is series of read and write operations Terminated by commit (transaction successful) or abort (transaction failed) operation Aborted transaction must be rolled back to undo any changes it performed

37 Transaction Concept Transaction is a group of operations Atomic: operations appear to happen as a group, or not at all (at logical level) At physical level, only single disk/flash write is atomic Consistency: sequential memory model Isolation: other transactions do not see results of earlier transactions until they are committed Durable: operations that complete stay completed Future failures do not corrupt previously stored data

38 Logging File Systems Instead of modifying data structures on disk directly, write changes to a journal/log Intention list: set of changes we intend to make Log/Journal is append-only Once changes are on log, safe to apply changes to data structures on disk Recovery can read log to see what changes were intended Once changes are copied, safe to remove log

39 Log-Based Recovery Record to stable storage information about all modifications by a transaction Most common is write-ahead logging Log on stable storage, each log record describes single transaction write operation, including Transaction name Data item name Old value New value <Ti starts> written to log when transaction Ti starts <Ti commits> written when Ti commits Log entry must reach stable storage before operation on data occurs

40 Log-Based Recovery Algorithm
Using the log, system can handle any volatile memory errors Undo(Ti) restores value of all data updated by Ti Redo(Ti) sets values of all data in transaction Ti to new values Undo(Ti) and redo(Ti) must be idempotent Multiple executions must have the same result as one execution If system fails, restore state of all updated data via log If log contains <Ti starts> without <Ti commits>, undo(Ti) If log contains <Ti starts> and <Ti commits>, redo(Ti)

41 Redo Logging Prepare Commit Redo Garbage collection Recovery
Write all changes (in transaction) to log Commit Single disk write to make transaction durable Redo Copy changes to disk Garbage collection Reclaim space in log Recovery Read log Redo any operations for committed transactions Garbage collect log

42 Before Transaction Start

43 After Updates Are Logged

44 After Commit Logged

45 After Copy Back

46 After Garbage Collection

47 Redo Logging Prepare Commit Redo Garbage collection Recovery
Write all changes (in transaction) to log Commit Single disk write to make transaction durable Redo Copy changes to disk Garbage collection Reclaim space in log Recovery Read log Redo any operations for committed transactions Garbage collect log

48 Performance Log written sequentially Asynchronous write back
Often kept in flash storage Asynchronous write back Any order as long as all changes are logged before commit, and all write backs occur after commit Can process multiple transactions Transaction ID in each log entry Transaction completed iff its commit record is in log

49 Concurrent Transactions
Must be equivalent to serial execution – serializability Could perform all transactions in critical section Inefficient, too restrictive Concurrency-control algorithms provide serializability

50 Serializability Consider two data items A and B
Consider Transactions T0 and T1 Execute T0, T1 atomically Execution sequence called schedule Atomically executed transaction order called serial schedule For N transactions, there are N! valid serial schedules

51 Schedule 1: T0 then T1

52 Nonserial Schedule Nonserial schedule allows overlapped execute
Resulting execution not necessarily incorrect Consider schedule S, operations Oi, Oj Conflict if access same data item, with at least one write If Oi, Oj consecutive and operations of different transactions & Oi and Oj don’t conflict Then S’ with swapped order Oj Oi equivalent to S If S can become S’ via swapping nonconflicting operations S is conflict serializable

53 Schedule 2: Concurrent Serializable Schedule

54 Locking Protocol Ensure serializability by associating lock with each data item Follow locking protocol for access control Locks Shared – Ti has shared-mode lock (S) on item Q, Ti can read Q but not write Q Exclusive – Ti has exclusive-mode lock (X) on Q, Ti can read and write Q Require every transaction on item Q acquire appropriate lock If lock already held, new request may have to wait Similar to readers-writers algorithm

55 Two-phase Locking Protocol
Generally ensures conflict serializability Each transaction issues lock and unlock requests in two phases Growing – obtaining locks Shrinking – releasing locks Does not prevent deadlock

56 Timestamp-based Protocols
Select order among transactions in advance – timestamp-ordering Transaction Ti associated with timestamp TS(Ti) before Ti starts TS(Ti) < TS(Tj) if Ti entered system before Tj TS can be generated from system clock or as logical counter incremented at each entry of transaction Timestamps determine serializability order If TS(Ti) < TS(Tj), system must ensure produced schedule equivalent to serial schedule where Ti appears before Tj

57 Timestamp-based Protocol Implementation
Data item Q gets two timestamps W-timestamp(Q) – largest timestamp of any transaction that executed write(Q) successfully R-timestamp(Q) – largest timestamp of successful read(Q) Updated whenever read(Q) or write(Q) executed Timestamp-ordering protocol assures any conflicting read and write executed in timestamp order Suppose Ti executes read(Q) If TS(Ti) < W-timestamp(Q), Ti needs to read value of Q that was already overwritten read operation rejected and Ti rolled back If TS(Ti) ≥ W-timestamp(Q) read executed, R-timestamp(Q) set to max(R-timestamp(Q), TS(Ti))

58 Timestamp-ordering Protocol
Suppose Ti executes write(Q) If TS(Ti) < R-timestamp(Q), value Q produced by Ti was needed previously and Ti assumed it would never be produced Write operation rejected, Ti rolled back If TS(Ti) < W-tiimestamp(Q), Ti attempting to write obsolete value of Q Write operation rejected and Ti rolled back Otherwise, write executed Any rolled back transaction Ti is assigned new timestamp and restarted Algorithm ensures conflict serializability and freedom from deadlock

59 Schedule Possible Under Timestamp Protocol

60 Redo Log Implementation

61 Transaction Isolation
Process A move file from x to y mv x/file y/ Process B grep across x and y grep x/* y/* > log What if grep starts after changes are logged, but before commit?

62 Two Phase Locking Two phase locking: release locks only AFTER transaction commit Prevents a process from seeing results of another transaction that might not commit

63 Transaction Isolation
Process A Lock x, y move file from x to y mv x/file y/ Commit and release x,y Process B Lock x, y, log grep across x and y grep x/* y/* > log Commit and release x, y, log Grep occurs either before or after move

64 Serializability With two phase locking and redo logging, transactions appear to occur in a sequential order (serializability) Either: grep then move or move then grep Other implementations can also provide serializability Optimistic concurrency control: abort any transaction that would conflict with serializability

65 Caveat Most file systems implement a transactional model internally
Copy on write Redo logging Most file systems provide a transactional model for individual system calls File rename, move, … Most file systems do NOT provide a transactional model for user data Historical artifact (imo)

66 Question Do we need the copy back?
What if update in place is very expensive? Ex: flash storage, RAID

67 Log Structure Log is the data storage; no copy back Recovery
Storage split into contiguous fixed size segments Flash: size of erasure block Disk: efficient transfer size (e.g., 1MB) Log new blocks into empty segment Garbage collect dead blocks to create empty segments Each segment contains extra level of indirection Which blocks are stored in that segment Recovery Find last successfully written segment

68 Storage Availability Storage reliability: data fetched is what you stored Transactions, redo logging, etc. Storage availability: data is there when you want it More disks => higher probability of some disk failing Data available ~ Prob(disk working)^k If failures are independent and data is spread across k disks For large k, probability system works -> 0

69 Hardware Reliability Solution RAID
RAID (originally redundant array of inexpensive disks; now commonly redundant array of independent disks ) Replicate data for availability RAID 0: no replication RAID 1: mirror data across two or more disks Google File System replicated its data on three disks, spread across multiple racks RAID 5: split data across disks, with redundancy to recover from a single disk failure RAID 6: RAID 5, with extra redundancy to recover from two disk failures

70 RAID 1: Mirroring Replicate writes to both disks
Reads can go to either disk

71 Parity Parity block: Block1 xor block2 xor block3 …
parity block Can reconstruct any missing block from the others

72 RAID 5: Rotating Parity RAID 5 consists of block-level striping with distributed parity. Unlike in RAID 4, parity information is distributed among the drives. It requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost.[12] RAID 5 requires at least three disks.[13] In comparison to RAID 4, RAID 5's distributed parity evens out the stress of a dedicated parity disk among all RAID members. Additionally, read performance is increased since all RAID members participate in serving of the read requests.[14]

73 RAID Update Mirroring RAID-5: to write one block
Write every mirror RAID-5: to write one block Read old data block Read old parity block Write new data block Write new parity block Old data xor old parity xor new data RAID-5: to write entire stripe Write data blocks and parity

74 Non-Recoverable Read Errors
Disk devices can lose data One sector per 10^15 bits read Causes: Physical wear Repeated writes to nearby tracks What impact does this have on RAID recovery?

75 Read Errors and RAID recovery
Example 10 1 TB disks, and 1 fails Read remaining disks to reconstruct missing data Probability of recovery = (1 – 10^15)^(9 disks * 8 bits * 10^12 bytes/disk) = 93%

76 RAID 6 Dual Redundancy Two different parity calculations are carried out stored in separate blocks on different disks. Can recover from two disks failing Two different parity calculations are carried out and stored in separate blocks on different disks. Thus, a RAID 6 array whose user data require N disks consists of N+2 disks. P and Q are two different data check algorithms. One of the two is the exclusive-OR calculation used in RAID 4 and 5. The other is an independent data check algorithm. This makes it possible to regenerate data even if two disks containing user data fail.

Download ppt "File System Reliability"

Similar presentations

Ads by Google