Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765.

Slides:



Advertisements
Similar presentations
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Advertisements

TRANSACTION PROCESSING SYSTEM ROHIT KHOKHER. TRANSACTION RECOVERY TRANSACTION RECOVERY TRANSACTION STATES SERIALIZABILITY CONFLICT SERIALIZABILITY VIEW.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
Chapter 4 : File Systems What is a file system?
CS-3013 & CS-502, Summer 2006 More on File Systems1 More on Disks and File Systems CS-3013 & CS-502 Operating Systems.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Lecture 36: Chapter 6 Today’s topic –RAID 1. RAID Redundant Array of Inexpensive (Independent) Disks –Use multiple smaller disks (c.f. one large disk)
1 CSIS 7102 Spring 2004 Lecture 8: Recovery (overview) Dr. King-Ip Lin.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Performance/Reliability of Disk Systems So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Recovery 10/18/05. Implementing atomicity Note, when a transaction commits, the portion of the system implementing durability ensures the transaction’s.
1 Storage (cont’d) Disk scheduling Reducing seek time (cont’d) Reducing rotational latency RAIDs.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Quick Review of May 1 material Concurrent Execution and Serializability –inconsistent concurrent schedules –transaction conflicts serializable == conflict.
CS 333 Introduction to Operating Systems Class 18 - File System Performance Jonathan Walpole Computer Science Portland State University.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
1 Transaction Management Database recovery Concurrency control.
Ext3 Journaling File System “absolute consistency of the filesystem in every respect after a reboot, with no loss of existing functionality” chadd williams.
Crash recovery All-or-nothing atomicity & logging.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
File System Reliability. Main Points Problem posed by machine/disk failures Transaction concept Reliability – Careful sequencing of file system operations.
RAID Systems CS Introduction to Operating Systems.
The Design and Implementation of a Log-Structured File System Presented by Carl Yao.
Transactions and Recovery
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Transactions and Reliability. File system components Disk management Naming Reliability  What are the reliability issues in file systems? Security.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
FFS, LFS, and RAID Andy Wang COP 5611 Advanced Operating Systems.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
More on File SystemsCS-502 Fall More on File Systems CS-502, Operating Systems Fall 2007 (Slides include materials from Operating System Concepts,
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
1 Recitation 8 Disk & File System. 2 Disk Scheduling Disks are at least four orders of magnitude slower than main memory –The performance of disk I/O.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
JOURNALING VERSUS SOFT UPDATES: ASYNCHRONOUS META-DATA PROTECTION IN FILE SYSTEMS Margo I. Seltzer, Harvard Gregory R. Ganger, CMU M. Kirk McKusick Keith.
CE Operating Systems Lecture 20 Disk I/O. Overview of lecture In this lecture we will look at: Disk Structure Disk Scheduling Disk Management Swap-Space.
UNIX File and Directory Caching How UNIX Optimizes File System Performance and Presents Data to User Processes Using a Virtual File System.
HANDLING FAILURES. Warning This is a first draft I welcome your corrections.
1 File Systems: Consistency Issues. 2 File Systems: Consistency Issues File systems maintains many data structures  Free list/bit vector  Directories.
Reliability and Recovery CS Introduction to Operating Systems.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
CSE 451: Operating Systems Spring 2012 Journaling File Systems Mark Zbikowski Gary Kimura.
Chapter 10 Recovery System. ACID Properties  Atomicity. Either all operations of the transaction are properly reflected in the database or none are.
CS333 Intro to Operating Systems Jonathan Walpole.
Transactions. Transaction: Informal Definition A transaction is a piece of code that accesses a shared database such that each transaction accesses shared.
Outline for Today Journaling vs. Soft Updates Administrative.
File Systems 2. 2 File 1 File 2 Disk Blocks File-Allocation Table (FAT)
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
Storage and File structure COP 4720 Lecture 20 Lecture Notes.
Lecture 20 FSCK & Journaling. FFS Review A few contributions: hybrid block size groups smart allocation.
JOURNALING VERSUS SOFT UPDATES: ASYNCHRONOUS META-DATA PROTECTION IN FILE SYSTEMS Margo I. Seltzer, Harvard Gregory R. Ganger, CMU M. Kirk McKusick Keith.
Transactional Recovery and Checkpoints. Difference How is this different from schedule recovery? It is the details to implementing schedule recovery –It.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
File System Performance CSE451 Andrew Whitaker. Ways to Improve Performance Access the disk less  Caching! Be smarter about accessing the disk  Turn.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Storage Systems CSE 598d, Spring 2007 Lecture 13: File Systems March 8, 2007.
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
CS Introduction to Operating Systems
File System Consistency
Database Recovery Techniques
Transactions and Reliability
RAID Non-Redundant (RAID Level 0) has the lowest cost of any RAID
Filesystems 2 Adapted from slides of Hank Levy
Overview Continuation from Monday (File system implementation)
Printed on Monday, December 31, 2018 at 2:03 PM.
Database Recovery 1 Purpose of Database Recovery
Andy Wang COP 5611 Advanced Operating Systems
Presentation transcript:

Transactions and Reliability Andy Wang Operating Systems COP 4610 / CGS 5765

Motivation File systems have lots of metadata:  Free blocks, directories, file headers, indirect blocks Metadata is heavily cached for performance

Problem System crashes OS needs to ensure that the file system does not reach an inconsistent state Example: move a file between directories  Remove a file from the old directory  Add a file to the new directory What happens when a crash occurs in the middle?

UNIX File System (Ad Hoc Failure- Recovery) Metadata handling:  Uses a synchronous write-through caching policy A call to update metadata does not return until the changes are propagated to disk  Updates are ordered  When crashes occur, run fsck to repair in- progress operations

Some Examples of Metadata Handling Undo effects not yet visible to users  If a new file is created, but not yet added to the directory Delete the file Continue effects that are visible to users  If file blocks are already allocated, but not recorded in the bitmap Update the bitmap

UFS User Data Handling Uses a write-back policy  Modified blocks are written to disk at 30-second intervals Unless a user issues the sync system call  Data updates are not ordered  In many cases, consistent metadata is good enough

Example: Vi Vi saves changes by doing the following 1. Writes the new version in a temp file Now we have old_file and new_temp file 2. Moves the old version to a different temp file Now we have new_temp and old_temp 3. Moves the new version into the real file Now we have new_file and old_temp 4. Removes the old version Now we have new_file

Example: Vi When crashes occur  Looks for the leftover files  Moves forward or backward depending on the integrity of files

Transaction Approach A transaction groups operations as a unit, with the following characteristics:  Atomic: all operations either happen or they do not (no partial operations)  Serializable: transactions appear to happen one after the other  Durable: once a transaction happens, it is recoverable and can survive crashes

More on Transactions A transaction is not done until it is committed Once committed, a transaction is durable If a transaction fails to complete, it must rollback as if it did not happen at all Critical sections are atomic and serializable, but not durable

Transaction Implementation (One Thread) Example: money transfer Begin transaction x = x – 1; y = y + 1; Commit

Transaction Implementation (One Thread) Common implementations involve the use of a log, a journal that is never erased A file system uses a write-ahead log to track all transactions

Transaction Implementation (One Thread) Once accounts of x and y are on a log, the log is committed to disk in a single write Actual changes to those accounts are done later

Transaction Illustrated x = 1; y = 1; x = 1; y = 1;

Transaction Illustrated x = 1; y = 1; x = 0; y = 2;

Transaction Illustrated x = 1; y = 1; x = 0; y = 2; begin transaction old x: 1 old y: 1 new x: 0 new y: 2 commit Commit the log to disk before updating the actual values on disk

Transaction Steps Mark the beginning of the transaction Log the changes in account x Log the changes in account y Commit Modify account x on disk Modify account y on disk

Scenarios of Crashes If a crash occurs after the commit  Replays the log to update accounts If a crash occurs before the commit  Rolls back and discard the transaction A crash cannot occur during the commit  Commit is built as an atomic operation  e.g. writing a single sector on disk

Two-Phase Locking (Multiple Threads) Logging alone not enough to prevent multiple transactions from trashing one another (not serializable) Solution: two-phase locking 1. Acquire all locks 2. Perform updates and release all locks Thread A cannot see thread B’s changes until thread A commits and releases locks

Transactions in File Systems Almost all file systems built since 1985 use write-ahead logging  Windows NT, Solaris, OSF, etc + Eliminates running fsck after a crash + Write-ahead logging provides reliability - All modifications need to be written twice

Log-Structured File System (LFS) If logging is so great, why don’t we treat everything as log entries? Log-structured file system  Everything is a log entry (file headers, directories, data blocks)  Write the log only once Use version stamps to distinguish between old and new entries

More on LFS New log entries are always appended to the end of the existing log  All writes are sequential  Seeks only occurs during reads Not so bad due to temporal locality and caching Problem:  Need to create contiguous space all the time

RAID and Reliability So far, we assume that we have a single disk What if we have multiple disks?  The chance of a single-disk failure increases RAID: redundant array of independent disks  Standard way of organizing disks and classifying the reliability of multi-disk systems  General methods: data duplication, parity, and error- correcting codes (ECC)

RAID 0 No redundancy Uses block-level striping across disks  i.e., 1 st block stored on disk 1, 2 nd block stored on disk 2 Failure causes data loss

Non-Redundant Disk Array Diagram (RAID Level 0) open(foo)read(bar)write(zoo) File System

Mirrored Disks (RAID Level 1) Each disk has a second disk that mirrors its contents  Writes go to both disks + Reliability is doubled + Read access faster - Write access slower - Expensive and inefficient

Mirrored Disk Diagram (RAID Level 1) open(foo)read(bar)write(zoo) File System

Memory-Style ECC (RAID Level 2) Some disks in array are used to hold ECC + More efficient than mirroring + Can correct, not just detect, errors - Still fairly inefficient  e.g., 4 data disks require 3 ECC disks

Memory-Style ECC Diagram (RAID Level 2) open(foo)read(bar)write(zoo) File System

Bit-Interleaved Parity (RAID Level 3) Uses bit-level striping across disks  i.e., 1 st bit stored on disk 1, 2 nd bit stored on disk 2 One disk in the array stores parity for the other disks + More efficient than Levels 1 and 2 - Parity disk doesn’t add bandwidth

Parity Method Disk 1: 1001 Disk 2: 0101 Disk 3: 1000 Parity: 0100 = 1001 xor 0101 xor 1000 To recover disk 2  Disk 2: 0101 = 1001 xor 1000 xor 0100

Bit-Interleaved RAID Diagram (Level 3) open(foo)read(bar)write(zoo) File System

Block-Interleaved Parity (RAID Level 4) Like bit-interleaved, but data is interleaved in blocks + More efficient data access than level 3 -Parity disk can be a bottleneck -Small writes

To update just one block Do we need to read in the entire stripe?

To update just one block Do we need to read in the entire stripe?  old_parity = old_block1  old_block2  old_block3  new_parity = new_block1  old_block2  old_block3 old_party  new_party = old_block1  new_block1 new_parity = new_block1  old_block1  old_parity

Block-Interleaved Parity Diagram (RAID Level 4) open(foo)read(bar)write(zoo) File System

Block-Interleaved Distributed-Parity (RAID Level 5) Sort of the most general level of RAID Spreads the parity out over all disks +No parity disk bottleneck +All disks contribute read bandwidth –Requires 4 I/Os for small writes

Block-Interleaved Distributed-Parity Diagram (RAID Level 5) open(foo)read(bar)write(zoo) File System