Presentation is loading. Please wait.

Presentation is loading. Please wait.

IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS

Similar presentations


Presentation on theme: "IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS"— Presentation transcript:

1 IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS
MICHAEL M. SWIFT, BRIAN N BERSHARD, HENRY M. LEVY Presenter: Shyam Sunder Santoshi Visamsetty 11/13/2007

2 Outline Introduction Motivation Previous work Nooks Performance
Architecture Implementation Performance Conclusion 11/13/2007

3 Features of a Good Operating System
High Performance High Scalability High Reliability 11/13/2007

4 Reliability Problems in Operating Systems
Crashes caused by: Device Drivers Other Extensions such as File Systems, Virus Detectors, Network Protocols etc.. 11/13/2007

5 Causes of System Crashes in Windows NT
Source: June 2000 11/13/2007

6 Crashes in Windows XP Source: Jan 2003 11/13/2007

7 “The most notable reality is that the Windows operating system is not responsible for a majority of PC crashes in our data set. Poorly-written device drivers contribute most of the crashes in our data.” -- Windows XP Kernel Crash Analysis by Archana Ganapathi, Viji Ganapathi and David Patterson, University of California, Berkeley, 2006 11/13/2007

8 Why Device Drivers? Device Drivers access the system memory and hardware directly. Device Drivers and other Extensions account for 70% of the code as in Linux release. Faulty Code might cause the crash. 11/13/2007

9 Motivation Reliability remains a crucial but an unsolved problem.
Rising Costs of Failures Increasing Prevalence of OS Extensions Extensions are leading cause of OS Failure Extensions are optional components that reside in the kernel address space and typically communicate with the kernel through published interfaces. 11/13/2007

10 Previous Approaches to Enhance Reliability
Microkernels Type Safe Languages New Hardware : Ring and Segment Architectures Transaction-based systems 11/13/2007

11 Nooks Approach Conventional Processor Architecture
Conventional Programming Language Conventional OS Architecture Existing Extensions Nooks virtualizes only the interface between the kernel and extension. Virtualization techniques typically run several entire Operating Systems on top of a virtual machine; so faulty extensions in one OS can cause only a few applications to fail. The challenge for reliable extensibility is not in virtualizing the hardware . VM’s also cause slow IPC and intelligent scheduling. 11/13/2007

12 Goals Isolation Recovery Backward Compatibility 11/13/2007

13 Nooks Architecture Two Core Principles:
Design for fault resistance, not fault tolerance. Design for mistakes, not abuse. From the second principle, Nooks chooses to occupy the design space between unprotected and safe. 11/13/2007

14 Nooks: Implementation
Implemented on Linux Kernel. Isolated Kernel Extensions are wrapped by Nooks wrapper stubs. All extensions execute at ring 0. Nooks does not use Intel x86 protection rings or memory segmentation mechanisms. 11/13/2007

15 Nooks Layered Architecture
11/13/2007

16 Functions of Nooks 11/13/2007

17 Isolation Prevent extension errors from damaging the kernel.
Every extension executes within its lightweight kernel protection domain. Tasks: Protection-Domain Management Inter-Domain Control Transfer Protection-domain management involves the creation, manipulation and maintenance of light-weight protection domains. Isolation services support control flow in both directions between extension domains and kernel domains. 11/13/2007

18 Isolation(Contd…) Extension Procedure Call (XPC)
XPC is a control-transfer mechanism for isolating extensions within the kernel. XPC occurs between asymmetric trusted domains. 11/13/2007

19 Isolation: Implementation
Two Parts: Memory Management Extension Procedure Call To provide extensions with read access to the kernel, Nook’s memory management code maintains a synchronized copy of the kernel page table for each domain. Each light-weight domain has private structures like a dynamic local heap, a pool of stacks, physical memory mappings and kernel memory bufffers. Nooks currently does not protect the kernel from DMA by a device into the Kernel Address Space. 11/13/2007

20 Protection of Kernel Address Space
To provide extensions with read access to the kernel, Nook’s memory management code maintains a synchronized copy of the kernel page table for each domain. 11/13/2007

21 Isolation (Contd..) Extension Procedure Call (XPC):
Transfer control between extension and kernel domains. Two Functions: nooks_driver_call nooks_kernel_call 11/13/2007

22 Isolation (Contd…) Deferred Call Mechanism
Maintains two queues: Extension-domain-queue Kernel-domain-queue Changes to the Linux-Kernel: Maintain Coherency between the Kernel and Extension page tables. Handle Exceptions. Handle Co-location of task structure. 11/13/2007

23 Interposition Integrates existing extensions into the Nooks Environment. Tasks: All Extension to Kernel and Kernel to Extension control flows through the XPC mechanism All data transfer between the kernel and extension is viewed and managed by Nook’s object-tracking mechanism. 11/13/2007

24 Interposition ( Contd…)
Wrapper Stubs: Interface between the extension, Nooks Isolation Manager (NIM) and the Kernel . Kernel views the stub as an extension’s function entry point. Extensions view the stub as the Kernel’s extension API. 11/13/2007

25 Interposition: Implementation
Interposes Wrapper stubs between extensions and the kernel Wrappers provide transparency and protects control and data transfers in both directions Changes to the Kernel: Standard Module Loader Module Initialization Code Protection of Data Transfers The Linux Kernel exports many objects that are only read by the extensions. These objects are linked directly into the extension so that they are freely read. Macros and Inline functions that directly modify kernel objects are changed into wrapped function calls.For object modifications, that are not performance critical, Nooks converts object access into an XPC within the kernel. For Data Structures, shadow copy of the kernel object is created within the extension’s domain. The contents of the kernel object and shadow object are synchronized before and after XPC’s into the extension. 11/13/2007

26 Wrappers Two types of Wrappers: Performs three tasks: Kernel Wrappers
Extension Wrappers Performs three tasks: Checks Parameters for Validity by verifying with the object tracker and memory manager that pointers are valid. Object Tracking Code creates a copy of kernel objects on the local heap or stack within the extension’s protection domain. Wrappers perform an XPC into the kernel or extension to execute the desired function. 11/13/2007

27 Control Flow of Extension and kernel Wrappers
11/13/2007

28 Wrappers (Contd…) Wrapper Code Sharing:
248 Wrappers were implemented to isolate 463 imported and exported functions. Implies that wrapper code is shared among multiple drivers . 11/13/2007

29 Code Sharing among Wrappers
11/13/2007

30 Object-Tracking Tasks:
Maintains a list of kernel data structures that are manipulated by an extension. Controls all modification to those structures. Provides object information for clean-up when an extension fails. Object-Tracking code copies kernel objects into an extension domain so they can be modified and copy them back after changes have been applied. 11/13/2007

31 Object Tracking : Implementation
Manages Manipulation of Kernel Objects by extensions. Records all kernel objects and types in use by extensions. Performs Two tasks: Records the addresses of all objects in use by an extension Records an association between the kernel and extension versions of the object. Garbage Collection 11/13/2007

32 Recovery Software Faults: Hardware Faults:
Occurs when extension invokes a kernel service improperly. Recovery policy determines whether Nooks triggers recovery or returns control to the extension with an error code when possible. Hardware Faults: Occurs when extension attempts to read unmapped memory. Triggers Recovery. For Software Faults, a policy is maintained because there may be a few kernel data structures which may be in use by other extensions and also that other extensions which 11/13/2007

33 Recovery: Implementation
Two parts: Release of resource by Recovery Manager. Coordination of Recovery through the user-mode agent. Nooks recovery manager is tasked with returning the system to a clean state from which it can continue. The user-mode recovery agent facilitates flexible recovery. Nooks disables interrupt processing for the device controlled by the extension, preventing live lock that could occur if device interrupts are not properly dismissed. 11/13/2007

34 Recovery: Implementation (contd..)
Recovery Manager walks the list of objects known to the object tracker and releases, frees or unregisters all objects that will not be accessed by external devices. It uses a recovery function which releases the objects to the kernel and removes all the references from the kernel into the extension. 11/13/2007

35 Implementation Limitations
Complete Isolation or fault-tolerance is not achieved. Runs extensions in kernel mode, so cannot prevent extensions from deliberately executing privileged instructions. Limited to drivers that can be killed and restarted safely. As a result of the above limitations, crashes may still occur. It is true for device drivers which can be dynamically loaded when hardware devices are connected to the system. 11/13/2007

36 Reliability Test Test Methodology: synthetic fault-injection
Extensions Isolated: 11/13/2007

37 Test Environment Four Programs: VMware Virtual Machine
Sound Drivers: play a short MP3 file. Network Drivers: ICMP ping and TCP streaming tests. VFAT: untars and compiles a number of files. kHTTPd: Web Load Generator. VMware Virtual Machine 400 trials were run for each extensions in both Native and Nooks mode. 11/13/2007

38 Test Results System Crashes: Native Mode: 317 crashes for 400 trials
Nooks : Eliminated 313 (99%) , 4 resulted in deadlock. e1000, pcnet 32 are interrupt oriented. VFAT, sb, kHTTPd are process-oriented. 11/13/2007

39 Test Results (Contd…) 11/13/2007

40 Test Results (contd…) Non-Fatal Extension Failures:
For e1000 and pcnet32, failures that left the device in a non-functional state were not detected by Nooks. For VFAT and sb, Nooks reduced the number of non-fatal extensions. For kHTTPd, only a small number of injected faults were caught by Nooks. 11/13/2007

41 Test Results (Contd…) 11/13/2007

42 Recovery Errors For network, sb and kHTTPd extensions, errors are recovered straight forwardly. For VFAT, 90% of the cases resulted in on-disk corruption. Reason: Fault injection occurs after files and directories are created and abrupt shutdown and restart of file system leaves it in a corrupted state. 11/13/2007

43 Recovery Errors (Contd…)
Solution: Synchronize the disks with in-memory disk cache before releasing resources on a VFAT recovery. Result: No. of corruption cases reduced from 90% to 10% 11/13/2007

44 Other Tests For Manually Injected Errors, such as improper initializations, removing Null Checks, Nooks automatically detected and recovered from all such failures. Latent Bugs: Nooks revealed several latent bugs in existing kernel extensions such as kHTTPd and 3COM 3c90x Ethernet Driver. 11/13/2007

45 Summary of Reliability Tests
99% of the system crashes were detected and recovered. Nearly 60% of non-fatal extension failures were recovered. 11/13/2007

46 Performance:Benchmarks
Extension XPC Rate (per sec) Nooks Relative Performance Native CPU Util. (%) Nooks CPU Util(%) Play-mp3 (128 Kbps) sb 150 1 4.8 4.6 Receive Stream e1000 8,923 0.92 15.2 15.5 Send-Stream 60,352 0.91 21.4 39.3 Compile-Local VFAT 22,653 0.78 97.5 96.8 Serve-simple-web-page kHTTPd 61,183 0.44 96.6 Serve-complex-web-page 1,960 0.97 90.5 92.6 11/13/2007

47 Comparative Time-chart for Compilation BenchMark
11/13/2007

48 Summary of Benchmark Results:
Nooks provides a substantial reliability improvement at costs that depends on extensions being isolated. Moreover, performance depends on the CPU utilization imposed by the workload. 11/13/2007

49 Conclusion Nooks can be implemented with modest engineering efforts.
Extensions can be isolated without any change to extension code. Isolation and Recovery dramatically improve system reliability But, when performance matters for high XPC frequency extensions, isolation may not be appropriate. 11/13/2007

50 QUESTIONS AND COMMENTS
11/13/2007

51 THANK YOU 11/13/2007


Download ppt "IMPROVING THE RELIABILITY OF COMMODITY OPERATING SYSTEMS"

Similar presentations


Ads by Google