Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy.

Similar presentations


Presentation on theme: "Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy."— Presentation transcript:

1 Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy

2 Introduction Device drivers fail more than anything else Device drivers fail more than anything else XP: 85% of all crashes XP: 85% of all crashes Linux: 7x the bug rate of the mainline kernel Linux: 7x the bug rate of the mainline kernel Existing work protects the kernel Existing work protects the kernel Applications left to fend for themselves Applications left to fend for themselves

3 Principles Device driver failures should be concealed from the drivers clients Device driver failures should be concealed from the drivers clients Recovery logic should be centralized in a single subsystem Recovery logic should be centralized in a single subsystem Driver recovery logic should be generic Driver recovery logic should be generic Recovery services should have low overhead when not needed Recovery services should have low overhead when not needed

4 Shadow Drivers Conceals driver failure from application Conceals driver failure from application Logs driver activity Logs driver activity Driver state (ioctls) Driver state (ioctls) IO requests/calls IO requests/calls On failure On failure Intercepts IO requests Intercepts IO requests Resets driver state by replaying log Resets driver state by replaying log Model is abstract enough to be implemented for wide range of drivers Model is abstract enough to be implemented for wide range of drivers

5 Why programs crash Most drivers fail due to bugs that result from unexpected inputs or events [34] Most drivers fail due to bugs that result from unexpected inputs or events [34] [34] V. Orgovan, Systems Crash Analyst, Windows Core OS Group, Microsoft Corp. private communication, 2004 [34] V. Orgovan, Systems Crash Analyst, Windows Core OS Group, Microsoft Corp. private communication, 2004 Do we really need a reference for this? Do we really need a reference for this? What sort of reference is that anyway? What sort of reference is that anyway?

6 Driver Faults Deterministic Deterministic Set sequence of repeatable configuration or IO requests Set sequence of repeatable configuration or IO requests Unrecoverable with generic tools Unrecoverable with generic tools Transient Transient Infrequent inputs or environment settings Infrequent inputs or environment settings Fail-stop Fail-stop Kernel is protected from failing drivers Kernel is protected from failing drivers Faults are detected before collateral damage occurs Faults are detected before collateral damage occurs Shadow drivers require transient and fail-stop behavior Shadow drivers require transient and fail-stop behavior

7 Nooks Earlier work in kernel protection Earlier work in kernel protection Provides fail-stop facilities Provides fail-stop facilities Detects memory violations Detects memory violations Excessive CPU usage Excessive CPU usage Bad kernel parameters Bad kernel parameters 75% success rate 75% success rate Simply reboots the driver after a fault Simply reboots the driver after a fault

8 Shadow Driver Operation Passive Mode Passive Mode Normal operation Normal operation Monitors all explicit communication Monitors all explicit communication Replicated procedure calls Replicated procedure calls Not DMA Not DMA Logs driver configuration Logs driver configuration Active Mode Active Mode Recovery operation Recovery operation Reinitializes driver to known state Reinitializes driver to known state Impersonates driver to the kernel Impersonates driver to the kernel

9 Taps Mechanism allowing replication and redirection of communication channels Mechanism allowing replication and redirection of communication channels Passive Operation Passive Operation Calls driver function then shadow function Calls driver function then shadow function Active mode Active mode Redirects all calls to shadow driver Redirects all calls to shadow driver

10 Passive Taps

11 Active Taps

12 Shadow Manager Controls all shadow drivers Controls all shadow drivers Manages recovery operations Manages recovery operations Controls Tap insertion Controls Tap insertion Monitors device failures Monitors device failures

13 General Infrastructure Nooks Nooks Isolation service Isolation service Redirection mechanism Redirection mechanism Object tracking service Object tracking service Shadow Manager Shadow Manager Installs shadow drivers Installs shadow drivers

14 Architecture

15 Passive Monitoring Tracks IO requests Tracks IO requests Connection-oriented: offset/positioning Connection-oriented: offset/positioning Request-oriented: pending request log Request-oriented: pending request log Logs configuration commands Logs configuration commands Only information stored in a persistent log Only information stored in a persistent log Does not replicate driver state Does not replicate driver state Tracks kernel objects obtained by driver Tracks kernel objects obtained by driver Prevents memory leaks Prevents memory leaks Many of the replicated calls are no-ops Many of the replicated calls are no-ops Read/write to sound device Read/write to sound device

16 Active Mode Recovery Impersonates driver to kernel and applications Impersonates driver to kernel and applications Recovers driver Recovers driver Stops failed driver Stops failed driver Reinitializes driver Reinitializes driver Transfers state back into driver Transfers state back into driver

17 Stopping the Failed Driver Shadow manager Shadow manager Signals shadow driver of failure Signals shadow driver of failure Switches taps to redirection Switches taps to redirection Shadow Driver Shadow Driver Disables hardware device Disables hardware device Garbage collects unnecessary resources Garbage collects unnecessary resources

18 Reinitializing the Driver Shadow driver uses cached data section Shadow driver uses cached data section Initializes driver Initializes driver Reattaches driver to kernel resources Reattaches driver to kernel resources Reenables hardware resources Reenables hardware resources

19 Transferring Driver State Shadow Driver resubmits any outstanding IO requests Shadow Driver resubmits any outstanding IO requests Possible replication of IO Possible replication of IO If device cannot handle duplicate IO, request is canceled If device cannot handle duplicate IO, request is canceled Replays logged configuration commands Replays logged configuration commands Shadow Driver signals Shadow Manager Shadow Driver signals Shadow Manager Taps set back to passive mode Taps set back to passive mode

20 Proxying of Requests Depends on driver mechanics and interface Depends on driver mechanics and interface Possible actions Possible actions Respond with recorded information Respond with recorded information Silently drop request Silently drop request Queue request for later Queue request for later Block request Block request Report driver busy Report driver busy

21 Limitations Requires dynamic loading and unloading Requires dynamic loading and unloading Requires explicit communication channels Requires explicit communication channels DMA doesnt work DMA doesnt work Assumes driver failure has no external effects Assumes driver failure has no external effects Requires effective isolation and protection service Requires effective isolation and protection service Cannot make real-time guarantees Cannot make real-time guarantees

22 Evaluation Performance Performance Overhead during passive mode Overhead during passive mode Fault-Tolerance Fault-Tolerance Does it work Does it work Limitations Limitations How many failures can be dealt with How many failures can be dealt with Code Size Code Size Amount of kernel modification needed Amount of kernel modification needed Either the advisor is a jerk or the grad students need a social life Either the advisor is a jerk or the grad students need a social life

23 Tested Drivers

24 Tested Applications

25 Performance Three configurations Three configurations Linux-Native: Stock kernel Linux-Native: Stock kernel Linux-Nooks: kernel protection Linux-Nooks: kernel protection Linux-SD: Shadow driver implementation Linux-SD: Shadow driver implementation No additional penalty vs Linux-Nooks No additional penalty vs Linux-Nooks Only 1-3% performance hit vs Linux-Native Only 1-3% performance hit vs Linux-Native

26 Relative Performance

27 CPU Utilization

28 Fault Tolerance Bugs culled from bug-fixes posted to the linux- kernel mailing list Bugs culled from bug-fixes posted to the linux- kernel mailing list Bugs were replicated inside each driver Bugs were replicated inside each driver Placed bugs in rarely taken paths Placed bugs in rarely taken paths Unusual hardware conditions Unusual hardware conditions Forced driver to take unusual path Forced driver to take unusual path What is the difference between that and adding a faulting ioctl? What is the difference between that and adding a faulting ioctl?

29 Fault Tolerance

30 Recovery Behavior Not completely seamless Not completely seamless Noticeable gap during recovery Noticeable gap during recovery Possible temporary data loss Possible temporary data loss

31 Limitations How do shadow drivers perform with non fail-stop errors How do shadow drivers perform with non fail-stop errors Large scale fault injection experiments Large scale fault injection experiments Cases Cases Failure detected Failure detected Recovery hidden from application? Recovery hidden from application? Failure not detected Failure not detected

32 What would you do for a PhD? In total, we ran 2100 trials across the three drivers and six applications. Between trials, we reset the system and reloaded the driver. For each trial, we injected five random errors into the driver while the application was using it. We ensured the errors were transient by removing them during recovery. After injection, we visually observed the impact on the application and the system to determine whether a failure or recovery had occurred.

33 Undetected Failures 3 Cases 3 Cases IO requests that never complete IO requests that never complete Driver Device interaction Driver Device interaction Certain bad parameters/return codes Certain bad parameters/return codes Need better understanding of driver semantics Need better understanding of driver semantics

34 Fault Outcomes

35 Code Size


Download ppt "Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy."

Similar presentations


Ads by Google