A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau.

Slides:



Advertisements
Similar presentations
CS 346 – April 4 Mass storage –Disk formatting –Managing swap space –RAID Commitment –Please finish chapter 12.
Advertisements

Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy.
The Linux Kernel: Memory Management
Fault-Tolerant Systems Design Part 1.
R.A.I.D. Copyright © 2005 by James Hug Redundant Array of Independent (or Inexpensive) Disks.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Lecture 12 Page 1 CS 111 Online Devices and Device Drivers CS 111 On-Line MS Program Operating Systems Peter Reiher.
04/14/2008CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying the textbook.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.
File Management Systems
1 Storage (cont’d) Disk scheduling Reducing seek time (cont’d) Reducing rotational latency RAIDs.
1 Web Server Administration Chapter 3 Installing the Server.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
I/O Systems CS 3100 I/O Hardware1. I/O Hardware Incredible variety of I/O devices Common concepts ◦Port ◦Bus (daisy chain or shared direct access) ◦Controller.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 2: Managing Hardware Devices.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
G Robert Grimm New York University Scheduler Activations.
UQC113S2 Interrupt driven IO. We have already seen the hardware support required to facilitate interrupts We will now look at the higher levels of software.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.
Operating System Organization
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Disk and I/O Management
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
CS 346 – Chapter 10 Mass storage –Advantages? –Disk features –Disk scheduling –Disk formatting –Managing swap space –RAID.
Redundant Array of Independent Disks
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 2: Managing Hardware Devices.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
I/O Systems I/O Hardware Application I/O Interface
Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Introduction to Biometrics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #23 Biometrics Standards - II November 14, 2005.
Fault-Tolerant Systems Design Part 1.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Chapter 13: I/O Systems. 13.2/34 Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem Transforming I/O Requests to Hardware.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
Deconstructing Storage Arrays Timothy E. Denehy, John Bent, Florentina I. Popovici, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau University of Wisconsin,
CS 505: Thu D. Nguyen Rutgers University, Spring CS 505: Computer Structures Fault Tolerance Thu D. Nguyen Spring 2005 Computer Science Rutgers.
Fault-Tolerant Systems Design Part 1.
EXT2C: Increasing Disk Reliability Brian Pellin, Chloe Schulze CS736 Presentation May 3 th, 2005.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Efficient Software-Based Fault Isolation Authors: Robert Wahbe Steven Lucco Thomas E. Anderson Susan L. Graham Presenter: Gregory Netland.
Improving the Reliability of Commodity Operating Systems Michael M. Swift, Brian N. Bershad, Henry M. Levy Presented by Ya-Yun Lo EECS 582 – W161.
Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.
I/O Software CS 537 – Introduction to Operating Systems.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
RAID Technology By: Adarsha A,S 1BY08A03. Overview What is RAID Technology? What is RAID Technology? History of RAID History of RAID Techniques/Methods.
WSRR 111 Coerced Cache Eviction and Discreet Mode Journaling: Dealing with Misbehaving Disks Abhishek Rajimwale, Vijay Chidambaram, Deepak Ramamurthi Andrea.
Network-Attached Storage. Network-attached storage devices Attached to a local area network, generally an Ethernet-based network environment.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Tgt: Framework Target Drivers FUJITA Tomonori NTT Cyber Solutions Laboratories Mike Christie Red Hat, Inc Ottawa Linux.
Introduction to Operating Systems Concepts
Module 12: I/O Systems I/O hardware Application I/O Interface
Outline What does the OS protect? Authentication for operating systems
Verification and Testing
CS 554: Advanced Database System Notes 02: Hardware
RAID RAID Mukesh N Tekwani
O.S Lecture 13 Virtual Memory.
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
TECHNICAL SEMINAR PRESENTATION
CSE451 Virtual Memory Paging Autumn 2002
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
RAID RAID Mukesh N Tekwani April 23, 2019
Module 12: I/O Systems I/O hardwared Application I/O Interface
Presentation transcript:

A Software Layer for Disk Fault Injection Jake Adriaens Dan Gibson CS 736 Spring 2005 Instructor: Remzi Arpaci-Dusseau

Outline 1. Introduction, Motivation, & Challenges 2. Related Work 3. Implementation Details & IDE Driver 4. Fault Model 5. Methods & Evaluation 6. Summary

Overview - 1 Software system for modeling IDE disk faults in an x86/Linux-based computer Modification to IDE driver for read/write event interception

Overview - 2 Disks faults described at a high level Faults passed to kernel-level module On read/write event: – IDE driver calls kernel module to perform request modification – Before write event, module may modify data to-be- written – After read event, module may modify data read from disk

Motivation – Why purposely cause disk failures? Commodity HW (and SW!) fails, usually at unexpected times – Causing failures at expected times can help improve fault tolerance measures Can be used to determine fault tolerance of systems – Various flavors of RAID need fault injection

Motivation Faults can happen at the worst time – In the middle of a PowerPoint presentation…

Challenges Drivers are typically written with reliability in mind – May have error detection / correction measures Should these be removed? Fooled? Applauded? Low-level drivers critically affect performance and stability of the system – Disk faults need not be “stable,” but shouldn’t have unusual “side effects”

Challenges Failure models difficult to justify – Disk manufacturers don’t offer details on how/why their disks fail Failstop model is widely used: models complete, detected disk failure Other models must be chosen generally to account for many different disks, controllers, etc.

Outline 1. Introduction, Motivation, & Challenges 2. Related Work 3. Implementation Details & IDE Driver 4. Fault Model 5. Methods & Evaluation 6. Summary

Related Work Software fault injection – Huang et. al. (and many others) use software fault injection for modifying cached web pages (ACM/ProcWWW) – Jarboui et. al. inject software faults into the Linux kernel and observe system behavior – Nagaraja et. al. inject faults into cluster-based systems

Related Work Disk Faults, Modeling, Detection – Kaaniche et. al. inject disk faults to study RAID behavior – Kari et. al. presents fault detection and diagnosis techniques (separate studies) – Various other RAID and/or FS papers use some form of fault injection to model failures

Related Work Hardware Fault Injection

Outline 1. Introduction, Motivation, & Challenges 2. Related Work 3. Implementation Details & IDE Driver 4. Fault Model 5. Methods & Evaluation 6. Summary

Implementation Core components – User-level parser – In-kernel injection module – In-driver upcalls – System calls Added ~20 lines to IDE driver code Kernel module is demand-loaded, ~250 lines in size 2 System calls, inject_fault and getdrivesize, ~ 120 lines

Implementation – User-level Console Used for fault definition – Console interface for fault definition – Processes batch files – Checks faults for validity Sector ranges, probability, etc. (more later) – Passes faults to kernel module

Implementation – IDE Driver Modification Added “upcalls” to injection module – Pass I/O requests to module for modification – Provide callback service on I/O completion Added special-purpose code for certain fault models – Failstop model requires in-driver actions

Implementation – Kernel Module Receives fault lists from user-level console Called by IDE driver to perform insertion when: – LBA sector (SCSI-like) becomes known – sector may be modified – Write is initiated – data to be written may be modified – Read completes – data may be modified before returning control to I/O initiator

Implementation – System Calls Added two system calls – inject_faults () Used to pass fault definitions to kernel module from user space – getsectors () Used to determine raw sector ranges of IDE devices by name (there are other ways to do this)

Implementation Faults Defined Faults Injected Disk Request I/O Initiated Upcall Modified Request Bus Traffic I/O Returns Control Returns

IDE Driver ( Linux Kernel) Important structures – struct request Information about an IDE request – READ / WRITE – Number of sectors – Etc – struct ide_drive_s (_t) Information about a drive – Drive name (eg. “ hdc ”) – Sizing/addressing information – Etc

IDE Driver ( Linux Kernel) Functions – ide_do_rw_disk (3 versions) Common choke-point for reads & writes Many other similar functions, only this one in use Two versions, swapped by preprocessor directives (one for DMA, one for PIO)

Outline 1. Introduction, Motivation, & Challenges 2. Related Work 3. Implementation Details 4. Fault Model 5. Methods & Evaluation 6. Summary

Failure Model Models selected to represent “generic IDE” disk – No modeling of specific failure (i.e. Western Digital’s “classic” servo malfunction) – Models based on ranges of affected logical sectors (ala SCSI)

Failure Model – Fault Types sectorfail – Models inability of a given sector (block) or sector range to store data reliably – Excited on read of sector: Data read is permuted in some way: – Randomized – Set to specific value – Added to offset – Shifted by one or more bytes

Failure Model – Fault Types sectorro – Writes to block have no effect on stored value – Excited on writes to sector: Write requests ignored sectorwrong – Traffic to a given block is directed to a different block – Excited on reads & writes Address permuted, similarly to data

Failure Model – Fault Types transaddr – Sector number wrong for first fault excitation, but right for all others – Excited on reads & writes Sector permuted as in sectorwrong transdata – Data is wrong for first fault excitation Data permuted as in sectorfail

Failure Model – Fault Types failstop – Drive is totally unresponsive—performs no reads or writes – Differs from traditional Failstop in that our failstop is invisible Drive does not report any errors, simply fails to perform reads or writes to any sector

Outline 1. Introduction, Motivation, & Challenges 2. Related Work 3. Implementation Details 4. Fault Model 5. Methods & Evaluation 6. Summary

Verification of Faults (?) Faults excited and observed by microbenchmarks tailored to individual fault types Techniques similar to latent fault detection (Kari et. al., and other studies) Verification of faults is fault-specific

Verification - sectorfail Corrupts data when read from disk 1. Write known data to disk - observe location using printk statement 2. Inject sectorfail fault at location of file on disk. 3. Unmount/remount FS (flush cache) 4. Attempt to read faulty file (with cat )

Verification - sectorro Ignores writes to a given location 1. Write known data to disk 2. Inject sectorro fault 3. Flush file cache 4. Write different data to same location 5. Flush file cache 6. Read data from (1) from disk

Verification - sectorwrong Changes address (sector) to another sector number 1. Write known data to disk 2. Flush file cache 3. Inject sectorwrong fault—redirect to known location 4. Read from file – observe data from other sector

Verification - transdata Data modified after read, but only the first time 1. Verify sectorfail functionality 2. Flush file cache 3. Re-read, expect correct data

Verification - transaddr Sector number modified before reads & writes 1. Verify sectorwrong functionality 2. Flush file cache 3. Repeat read, expect correct data

Verification - failstop Easy! 1. Install failstop fault 2. Attempt to access any portion of affected drive 3. Expect bad things – Usually causes kernel panic

Evaluation Execution time overhead of injection SW – Overhead << standard dev. of runtime for unaffected regions of disk space – Overhead << standard dev. of runtime for affected regions – Averaged over 250 accesses Avg. (ms)Std.Dev. No injection Unaffected region Affected Region

Outline 1. Introduction, Motivation, & Challenges 2. Related Work 3. Implementation Details 4. Fault Model 5. Methods & Evaluation 6. Summary

Summary Present five new failure models for disk accesses, and the ability to inject them Verified fault manifestation – Did not verify potential side effects ? Fault injection has no noticeable effect on access times – Small SW overhead much smaller than access time to physical device