2 Today’s LectureReview of Disk SchedulingStorage reliability
3 Disk SchedulingThe operating system is responsible for using hardware efficiently — for the disk drives, this means having a fast access time and disk bandwidth.Access time has two major componentsSeek time is the time for the disk are to move the heads to the cylinder containing the desired sector.Rotational latency is the additional time waiting for the disk to rotate the desired sector to the disk head.Minimize seek timeSeek time seek distanceDisk bandwidth is the total number of bytes transferred, divided by the total time between the first request for service and the completion of the last transfer.
5 Disk Scheduling (Cont.) Several algorithms exist to schedule the servicing of disk I/O requests.We illustrate them with a request queue (0-199).98, 183, 37, 122, 14, 124, 65, 67Head pointer 53
6 FCFS: First Come First Serve Illustration shows total head movement of 640 cylinders.
7 SSTF: Shortest Seek Time First Selects the request with the minimum seek time from the current head position.SSTF scheduling is a form of SJF scheduling; may cause starvation of some requests.Illustration shows total head movement of 236 cylinders.
9 SCANThe disk arm starts at one end of the disk, and moves toward the other end, servicing requests until it gets to the other end of the disk, where the head movement is reversed and servicing continues.Sometimes called the elevator algorithm.Illustration shows total head movement of 208 cylinders.
11 C-SCAN Provides a more uniform wait time than SCAN. The head moves from one end of the disk to the other. servicing requests as it goes. When it reaches the other end, however, it immediately returns to the beginning of the disk, without servicing any requests on the return trip.Treats the cylinders as a circular list that wraps around from the last cylinder to the first one.
15 Selecting a Disk-Scheduling Algorithm SSTF is common and has a natural appealSCAN and C-SCAN perform better for systems that place a heavy load on the disk.Performance depends on the number and types of requests.Requests for disk service can be influenced by the file-allocation method.The disk-scheduling algorithm should be written as a separate module of the operating system, allowing it to be replaced with a different algorithm if necessary.Either SSTF or LOOK is a reasonable choice for the default algorithm.
16 Reliability Main Points Problem posed by machine/disk failuresTransaction conceptThree approaches to reliabilityCareful sequencing of file system operationsCopy-on-write (WAFL, ZFS)Journalling (NTFS, linux ext4)Approaches to availabilityRAID
17 File System Reliability What can happen if disk loses power or machine software crashes?Some operations in progress may completeSome operations in progress may be lostOverwrite of a block may only partially completeFile system wants durability (as a minimum!)Data previously stored can be retrieved (maybe after some recovery step), regardless of failure
18 Storage Reliability Problem Single logical file operation can involve updates to multiple physical disk blocksinode, indirect block, data block, bitmap, …With remapping, single update to physical disk block can require multiple (even lower level) updatesAt a physical level, operations complete one at a timeWant concurrent operations for performanceHow do we guarantee consistency regardless of when crash occurs?
19 Types of Storage MediaVolatile storage – information stored here does not survive system crashesExample: main memory, cacheNonvolatile storage – Information usually survives crashesExample: disk and tapeStable storage – Information never lostNot actually possible, so approximated via replication or RAID to devices with independent failure modesGoal is to assure transaction atomicity where failures cause loss of information on volatile storage
20 Reliability Approach #1: Careful Ordering Sequence operations in a specific orderCareful design to allow sequence to be interrupted safelyPost-crash recoveryRead data structures to see if there were any operations in progressClean up/finish as neededApproach taken in FAT, FFS (fsck), and many app-level recovery schemes (e.g., Word)
21 FAT: Append Data to File Add data blockAdd pointer to data blockUpdate file tail to point to new MFT entryUpdate access time at head of file
22 FAT: Append Data to File Add data blockAdd pointer to data blockUpdate file tail to point to new MFT entryUpdate access time at head of file
23 FAT: Append Data to File Normal operation:Add data blockAdd pointer to data blockUpdate file tail to point to new MFT entryUpdate access time at head of fileRecovery:Scan MFTIf entry is unlinked, delete data blockIf access time is incorrect, update
24 FAT: Create New File Normal operation: Allocate data block Update MFT entry to point to data blockUpdate directory with file name -> file numberWhat if directory spans multiple disk blocks?Update modify time for directoryRecovery:Scan MFTIf any unlinked files (not in any directory), deleteScan directories for missing update times
25 FFS: Create a File Normal operation: Allocate data block Write data blockAllocate inodeWrite inode blockUpdate bitmap of free blocksUpdate directory with file name -> file numberUpdate modify time for directoryRecovery:Scan inode tableIf any unlinked files (not in any directory), deleteCompare free block bitmap against inode treesScan directories for missing update/access timesTime proportional to size of disk
26 FFS: Move a File Normal operation: Remove filename from old directory Add filename to new directoryRecovery:Scan all directories to determine set of live filesConsider files with valid inodes and not in any directoryNew file being created?File move?File deletion?
27 Application Level Normal operation: Write name of each open file to app folderWrite changes to backup fileRename backup file to be file (atomic operation provided by file system)Delete list in app folder on clean shutdownRecovery:On startup, see if any files were left openIf so, look for backup fileIf so, ask user to compare versions
28 Careful Ordering Pros Cons Works with minimal support in the disk driveWorks for most multi-step operationsConsCan require time-consuming recovery after a failureDifficult to reduce every operation to a safely interruptible sequence of writesDifficult to achieve consistency when multiple operations occur concurrently
29 Reliability Approach #2: Copy on Write File Layout To update file system, write a new version of the file system containing the updateNever update in placeReuse existing unchanged disk blocksSeems expensive! ButUpdates can be batchedAlmost all disk writes can occur in parallelApproach taken in network file server appliances (WAFL, ZFS)
33 Copy on Write Garbage Collection For write efficiency, want contiguous sequences of free blocksSpread across all block groupsUpdates leave dead blocks scatteredFor read efficiency, want data read together to be in the same block groupWrite anywhere leaves related data scattered=> Background coalescing of live/dead blocks
34 Copy On Write Pros Cons Correct behavior regardless of failures Fast recovery (root block array)High throughput (best if updates are batched)ConsPotential for high latencySmall changes require many writesGarbage collection essential for performance
35 Reliability Approach #3: Log Structured File Systems Log structured (or journaling) file systems record each update to the file system as a transactionAll transactions are written to a logA transaction is considered committed once it is written to the logHowever, the file system may not yet be updatedThe transactions in the log are asynchronously written to the file systemWhen the file system is modified, the transaction is removed from the logIf the file system crashes, all remaining transactions in the log must still be performed
36 TransactionAssures that operations happen as a single logical unit of work, in its entirety, or not at allChallenge is assuring atomicity despite computer system failuresTransaction - collection of instructions or operations that performs single logical functionHere we are concerned with changes to stable storage – diskTransaction is series of read and write operationsTerminated by commit (transaction successful) or abort (transaction failed) operationAborted transaction must be rolled back to undo any changes it performed
37 Transaction ConceptTransaction is a group of operationsAtomic: operations appear to happen as a group, or not at all (at logical level)At physical level, only single disk/flash write is atomicConsistency: sequential memory modelIsolation: other transactions do not see results of earlier transactions until they are committedDurable: operations that complete stay completedFuture failures do not corrupt previously stored data
38 Logging File SystemsInstead of modifying data structures on disk directly, write changes to a journal/logIntention list: set of changes we intend to makeLog/Journal is append-onlyOnce changes are on log, safe to apply changes to data structures on diskRecovery can read log to see what changes were intendedOnce changes are copied, safe to remove log
39 Log-Based RecoveryRecord to stable storage information about all modifications by a transactionMost common is write-ahead loggingLog on stable storage, each log record describes single transaction write operation, includingTransaction nameData item nameOld valueNew value<Ti starts> written to log when transaction Ti starts<Ti commits> written when Ti commitsLog entry must reach stable storage before operation on data occurs
40 Log-Based Recovery Algorithm Using the log, system can handle any volatile memory errorsUndo(Ti) restores value of all data updated by TiRedo(Ti) sets values of all data in transaction Ti to new valuesUndo(Ti) and redo(Ti) must be idempotentMultiple executions must have the same result as one executionIf system fails, restore state of all updated data via logIf log contains <Ti starts> without <Ti commits>, undo(Ti)If log contains <Ti starts> and <Ti commits>, redo(Ti)
41 Redo Logging Prepare Commit Redo Garbage collection Recovery Write all changes (in transaction) to logCommitSingle disk write to make transaction durableRedoCopy changes to diskGarbage collectionReclaim space in logRecoveryRead logRedo any operations for committed transactionsGarbage collect log
47 Redo Logging Prepare Commit Redo Garbage collection Recovery Write all changes (in transaction) to logCommitSingle disk write to make transaction durableRedoCopy changes to diskGarbage collectionReclaim space in logRecoveryRead logRedo any operations for committed transactionsGarbage collect log
48 Performance Log written sequentially Asynchronous write back Often kept in flash storageAsynchronous write backAny order as long as all changes are logged before commit, and all write backs occur after commitCan process multiple transactionsTransaction ID in each log entryTransaction completed iff its commit record is in log
49 Concurrent Transactions Must be equivalent to serial execution – serializabilityCould perform all transactions in critical sectionInefficient, too restrictiveConcurrency-control algorithms provide serializability
50 Serializability Consider two data items A and B Consider Transactions T0 and T1Execute T0, T1 atomicallyExecution sequence called scheduleAtomically executed transaction order called serial scheduleFor N transactions, there are N! valid serial schedules
52 Nonserial Schedule Nonserial schedule allows overlapped execute Resulting execution not necessarily incorrectConsider schedule S, operations Oi, OjConflict if access same data item, with at least one writeIf Oi, Oj consecutive and operations of different transactions & Oi and Oj don’t conflictThen S’ with swapped order Oj Oi equivalent to SIf S can become S’ via swapping nonconflicting operationsS is conflict serializable
54 Locking ProtocolEnsure serializability by associating lock with each data itemFollow locking protocol for access controlLocksShared – Ti has shared-mode lock (S) on item Q, Ti can read Q but not write QExclusive – Ti has exclusive-mode lock (X) on Q, Ti can read and write QRequire every transaction on item Q acquire appropriate lockIf lock already held, new request may have to waitSimilar to readers-writers algorithm
55 Two-phase Locking Protocol Generally ensures conflict serializabilityEach transaction issues lock and unlock requests in two phasesGrowing – obtaining locksShrinking – releasing locksDoes not prevent deadlock
56 Timestamp-based Protocols Select order among transactions in advance – timestamp-orderingTransaction Ti associated with timestamp TS(Ti) before Ti startsTS(Ti) < TS(Tj) if Ti entered system before TjTS can be generated from system clock or as logical counter incremented at each entry of transactionTimestamps determine serializability orderIf TS(Ti) < TS(Tj), system must ensure produced schedule equivalent to serial schedule where Ti appears before Tj
57 Timestamp-based Protocol Implementation Data item Q gets two timestampsW-timestamp(Q) – largest timestamp of any transaction that executed write(Q) successfullyR-timestamp(Q) – largest timestamp of successful read(Q)Updated whenever read(Q) or write(Q) executedTimestamp-ordering protocol assures any conflicting read and write executed in timestamp orderSuppose Ti executes read(Q)If TS(Ti) < W-timestamp(Q), Ti needs to read value of Q that was already overwrittenread operation rejected and Ti rolled backIf TS(Ti) ≥ W-timestamp(Q)read executed, R-timestamp(Q) set to max(R-timestamp(Q), TS(Ti))
58 Timestamp-ordering Protocol Suppose Ti executes write(Q)If TS(Ti) < R-timestamp(Q), value Q produced by Ti was needed previously and Ti assumed it would never be producedWrite operation rejected, Ti rolled backIf TS(Ti) < W-tiimestamp(Q), Ti attempting to write obsolete value of QWrite operation rejected and Ti rolled backOtherwise, write executedAny rolled back transaction Ti is assigned new timestamp and restartedAlgorithm ensures conflict serializability and freedom from deadlock
61 Transaction Isolation Process Amove file from x to ymv x/file y/Process Bgrep across x and ygrep x/* y/* > logWhat if grep starts after changes are logged, but before commit?
62 Two Phase LockingTwo phase locking: release locks only AFTER transaction commitPrevents a process from seeing results of another transaction that might not commit
63 Transaction Isolation Process ALock x, ymove file from x to ymv x/file y/Commit and release x,yProcess BLock x, y, loggrep across x and ygrep x/* y/* > logCommit and release x, y, logGrep occurs either before or after move
64 SerializabilityWith two phase locking and redo logging, transactions appear to occur in a sequential order (serializability)Either: grep then move or move then grepOther implementations can also provide serializabilityOptimistic concurrency control: abort any transaction that would conflict with serializability
65 Caveat Most file systems implement a transactional model internally Copy on writeRedo loggingMost file systems provide a transactional model for individual system callsFile rename, move, …Most file systems do NOT provide a transactional model for user dataHistorical artifact (imo)
66 Question Do we need the copy back? What if update in place is very expensive?Ex: flash storage, RAID
67 Log Structure Log is the data storage; no copy back Recovery Storage split into contiguous fixed size segmentsFlash: size of erasure blockDisk: efficient transfer size (e.g., 1MB)Log new blocks into empty segmentGarbage collect dead blocks to create empty segmentsEach segment contains extra level of indirectionWhich blocks are stored in that segmentRecoveryFind last successfully written segment
68 Storage AvailabilityStorage reliability: data fetched is what you storedTransactions, redo logging, etc.Storage availability: data is there when you want itMore disks => higher probability of some disk failingData available ~ Prob(disk working)^kIf failures are independent and data is spread across k disksFor large k, probability system works -> 0
69 Hardware Reliability Solution RAID RAID (originally redundant array of inexpensive disks; now commonly redundant array of independent disks )Replicate data for availabilityRAID 0: no replicationRAID 1: mirror data across two or more disksGoogle File System replicated its data on three disks, spread across multiple racksRAID 5: split data across disks, with redundancy to recover from a single disk failureRAID 6: RAID 5, with extra redundancy to recover from two disk failures
70 RAID 1: Mirroring Replicate writes to both disks Reads can go to either disk
71 Parity Parity block: Block1 xor block2 xor block3 … parity blockCan reconstruct any missing block from the others
72 RAID 5: Rotating ParityRAID 5 consists of block-level striping with distributed parity. Unlike in RAID 4, parity information is distributed among the drives. It requires that all drives but one be present to operate. Upon failure of a single drive, subsequent reads can be calculated from the distributed parity such that no data is lost. RAID 5 requires at least three disks.In comparison to RAID 4, RAID 5's distributed parity evens out the stress of a dedicated parity disk among all RAID members. Additionally, read performance is increased since all RAID members participate in serving of the read requests.
73 RAID Update Mirroring RAID-5: to write one block Write every mirrorRAID-5: to write one blockRead old data blockRead old parity blockWrite new data blockWrite new parity blockOld data xor old parity xor new dataRAID-5: to write entire stripeWrite data blocks and parity
74 Non-Recoverable Read Errors Disk devices can lose dataOne sector per 10^15 bits readCauses:Physical wearRepeated writes to nearby tracksWhat impact does this have on RAID recovery?
75 Read Errors and RAID recovery Example10 1 TB disks, and 1 failsRead remaining disks to reconstruct missing dataProbability of recovery =(1 – 10^15)^(9 disks * 8 bits * 10^12 bytes/disk)= 93%
76 RAID 6 Dual RedundancyTwo different parity calculations are carried outstored in separate blocks on different disks.Can recover from two disks failingTwo different parity calculations are carried out and stored in separate blocks on different disks.Thus, a RAID 6 array whose user data require N disks consists of N+2 disks.P and Q are two different data check algorithms.One of the two is the exclusive-OR calculation used in RAID 4 and 5.The other is an independent data check algorithm.This makes it possible to regenerate data even if two disks containing user data fail.