1 Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures Feng Qin Joseph Tucek Jagadeesan Sundaresan Yuanyuan Zhou Presentation by.

Slides:



Advertisements
Similar presentations
Remus: High Availability via Asynchronous Virtual Machine Replication
Advertisements

Recovering Device Drivers Michael M Swift, Muthukaruppan Annamalai, Brian N Bershad and Henry Levy.
Threads, SMP, and Microkernels
Part IV: Memory Management
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
1 Deadlock Solutions: Avoidance, Detection, and Recovery CS 241 March 30, 2012 University of Illinois.
SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang.
Computer Systems/Operating Systems - Class 8
6/9/2015B.Ramamurthy1 Process Description and Control B.Ramamurthy.
Process Description and Control
Page 1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
3.5 Interprocess Communication
Chapter 11 Operating Systems
Real-Time Kernels and Operating Systems. Operating System: Software that coordinates multiple tasks in processor, including peripheral interfacing Types.
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
Distributed Deadlocks and Transaction Recovery.
1 CS503: Operating Systems Part 1: OS Interface Dongyan Xu Department of Computer Science Purdue University.
Operating System A program that controls the execution of application programs An interface between applications and hardware 1.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 2: System Structures.
1 Operating System Overview Chapter 2 Advanced Operating System.
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
Presentation of Failure- Oblivious Computing vs. Rx OS Seminar, winter 2005 by Lauge Wullf and Jacob Munk-Stander January 4 th, 2006.
Chapter 41 Processes Chapter 4. 2 Processes  Multiprogramming operating systems are built around the concept of process (also called task).  A process.
IBM OS/2 Warp Mike Storck Matt Kerster Mike Roe Patrick Caldwell.
Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Ihr Logo Operating Systems Internals & Design Principles Fifth Edition William Stallings Chapter 2 (Part II) Operating System Overview.
SPECULATIVE EXECUTION IN A DISTRIBUTED FILE SYSTEM E. B. Nightingale P. M. Chen J. Flint University of Michigan.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
Middleware Services. Functions of Middleware Encapsulation Protection Concurrent processing Communication Scheduling.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
System Components ● There are three main protected modules of the System  The Hardware Abstraction Layer ● A virtual machine to configure all devices.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Threads-Process Interaction. CONTENTS  Threads  Process interaction.
Oracle Architecture - Structure. Oracle Architecture - Structure The Oracle Server architecture 1. Structures are well-defined objects that store the.
Time Management.  Time management is concerned with OS facilities and services which measure real time.  These services include:  Keeping track of.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
Self Recovery in Server Programs The University of California, Riverside Vijay Nagarajan Dennis JeffreyRajiv Gupta International Symposium on Memory Management.
1.3 Operating system services An operating system provide services to programs and to the users of the program. It provides an environment for the execution.
Presented by: Daniel Taylor
Bugs (part 2) CPS210 Spring 2006.
Faults and fault-tolerance
Topics Covered What is Real Time Operating System (RTOS)
Operating System Structure
Introduction to Operating Systems
Faults and fault-tolerance
Process Description and Control
Operating Systems : Overview
Threads Chapter 4.
Process Description and Control
Process Description and Control
Operating Systems : Overview
Process Description and Control
Process Description and Control
Operating Systems : Overview
Process Description and Control
Operating System Introduction.
Operating Systems : Overview
Chapter 2 Processes and Threads 2.1 Processes 2.2 Threads
Chapter 2 Operating System Overview
COMP755 Advanced Operating Systems
Presentation transcript:

1 Rx: Treating Bugs as Allergies – A Safe Method to Survive Software Failures Feng Qin Joseph Tucek Jagadeesan Sundaresan Yuanyuan Zhou Presentation by Mark Lawson

2 Motivation Applications require high availability Applications require high availability Server application downtime leads to lost productivity and lost business Server application downtime leads to lost productivity and lost business Average cost of an hour of downtime can exceed six million dollars Average cost of an hour of downtime can exceed six million dollars Almost every organization in today’s e- commerce world is dependent on their systems being highly available Almost every organization in today’s e- commerce world is dependent on their systems being highly available

3 Motivation Software defects make up 40% of all system failures Software defects make up 40% of all system failures Programmers are aware of this and rigorously test applications before release Programmers are aware of this and rigorously test applications before release Doesn’t always help, bugs are tricky bastards Doesn’t always help, bugs are tricky bastards “to achieve higher system availability, mechanisms must be devised to allow systems to survive the effects of uneliminated software bugs to the largest extent possible” “to achieve higher system availability, mechanisms must be devised to allow systems to survive the effects of uneliminated software bugs to the largest extent possible”

4 Rebooting Techniques Idea: Restart program or parts of program (microreboot) after it crashes Idea: Restart program or parts of program (microreboot) after it crashes Problems: Problems: Designed for hardware failures, not software Designed for hardware failures, not software Deterministic software failures cannot be dealt with as they will occur every time Deterministic software failures cannot be dealt with as they will occur every time Restarting takes time Restarting takes time

5 General checkpointing and recovery Idea: Checkpoint -> Rollback upon failure -> Re-execute Idea: Checkpoint -> Rollback upon failure -> Re-execute Problems: Problems: Similar problems to restarting techniques, such as inability to handle deterministic bugs Similar problems to restarting techniques, such as inability to handle deterministic bugs

6 Application specific recovery mechanisms Idea: Multi-process model, each client connection is new process, kill process if it fails Idea: Multi-process model, each client connection is new process, kill process if it fails Problems: Problems: Still has issues with dealing with deterministic errors Still has issues with dealing with deterministic errors If shared data is the problem, killing and restarting processes will not restore it to consistent state If shared data is the problem, killing and restarting processes will not restore it to consistent state

7 Other methods Failure-oblivious computing Failure-oblivious computing Idea: Provide artificial values for out-of-bound reads Idea: Provide artificial values for out-of-bound reads Reactive immune system Reactive immune system Idea: Creates emulators to run “faulty” regions of a program Idea: Creates emulators to run “faulty” regions of a program Problems: Problems: Considered by authors as “unsafe” because they mask behaviors and speculate as to what the program wants to achieve Considered by authors as “unsafe” because they mask behaviors and speculate as to what the program wants to achieve Immune system has large overheads Immune system has large overheads

8 Rx real-world metaphor Idea: Treat software bugs as real-world allergies Idea: Treat software bugs as real-world allergies In real life allergens can be dealt with by changing living environment In real life allergens can be dealt with by changing living environment Removing cat hair from area allows me to breathe better Removing cat hair from area allows me to breathe better Successfully removing allergen from environment allows one to determine cause of allergy Successfully removing allergen from environment allows one to determine cause of allergy No cat hair = no sneezing  allergic to cats No cat hair = no sneezing  allergic to cats

9 Rx metaphor implemented Bugs resemble allergies Bugs resemble allergies Bugs can be dealt with by changing execution environment Bugs can be dealt with by changing execution environment When a bug is detected, rollback to checkpoint and alter execution environment to deal with detected issues When a bug is detected, rollback to checkpoint and alter execution environment to deal with detected issues Least-intrusive changes can be tried first and more drastic changes can be implemented until a good execution environment is found Least-intrusive changes can be tried first and more drastic changes can be implemented until a good execution environment is found

10 The Main Idea

11 Rx Architecture

12 Sensors Dynamically monitor applications execution to determine software failures Dynamically monitor applications execution to determine software failures Sends information to control unit Sends information to control unit Two types of sensors Two types of sensors Sensor to monitor software errors (assertion failures, access violations) Sensor to monitor software errors (assertion failures, access violations) Sensor to monitor software bugs (buffer overflows, access to freed memory) Sensor to monitor software bugs (buffer overflows, access to freed memory)

13 Checkpoint and Rollback CR component takes a snapshot of application and stores it in main memory CR component takes a snapshot of application and stores it in main memory Stores memory and file states Stores memory and file states During rollback all of these states can be re- implemented and the program can be continued from this previous checkpoint During rollback all of these states can be re- implemented and the program can be continued from this previous checkpoint Multiple checkpoints can be stored in case Rx needs to rollback to an earlier checkpoint Multiple checkpoints can be stored in case Rx needs to rollback to an earlier checkpoint Keeps enough to be “2-competitive” Keeps enough to be “2-competitive”

14 Execution Environment Changes Memory management based Memory management based Addresses bugs that are memory based such as buffer overflows, dangling pointers etc. Addresses bugs that are memory based such as buffer overflows, dangling pointers etc. Ex: Padding to prevent buffer overflows, zero-filling new buffers Ex: Padding to prevent buffer overflows, zero-filling new buffers Timing based Timing based Addresses bugs that are related to asynchronous events like data races Addresses bugs that are related to asynchronous events like data races Ex: Increasing length of scheduling time slot can avoid context switches in buggy critical sections Ex: Increasing length of scheduling time slot can avoid context switches in buggy critical sections User request based User request based Deals with the fact that it is impossible to test every possible user request Deals with the fact that it is impossible to test every possible user request Ex: Dropping user requests during re-execution to deal with unexpected requests (LAST RESORT!) Ex: Dropping user requests during re-execution to deal with unexpected requests (LAST RESORT!)

15 Environment Wrappers Perform environmental changes for application during re-execution Perform environmental changes for application during re-execution Memory wrapper Memory wrapper Intercepts memory-related library calls, adjusts according to what control unit specifies Intercepts memory-related library calls, adjusts according to what control unit specifies Message wrapper Message wrapper Changes message delivery environment Changes message delivery environment Process scheduling Process scheduling Changes processes priority to deal with scheduling issues Changes processes priority to deal with scheduling issues Signal delivery Signal delivery Keeps track of signals in order to control when they are sent Keeps track of signals in order to control when they are sent Dropping user requests Dropping user requests Drops requests that may be causing errors Drops requests that may be causing errors

16 Proxy Handles re-execution of requests, making crashes oblivious to clients Handles re-execution of requests, making crashes oblivious to clients In normal mode the proxy simply relays messages between client and server, keeping track of them In normal mode the proxy simply relays messages between client and server, keeping track of them In recovery mode handles three tasks: In recovery mode handles three tasks: Replays requests from client since last checkpoint Replays requests from client since last checkpoint Implements message-related environmental changes Implements message-related environmental changes Buffers client requests until server has come back from software failure Buffers client requests until server has come back from software failure

17 Control Unit Controls the whole Rx system Controls the whole Rx system Perform three functions: Perform three functions: Directs CR to rollback at software failures Directs CR to rollback at software failures Diagnoses failures based on “symptoms” and previous knowledge of failures Diagnoses failures based on “symptoms” and previous knowledge of failures Provides information on failures for programmers Provides information on failures for programmers The control unit stores information on failures and what recoveries worked for future reference The control unit stores information on failures and what recoveries worked for future reference

18 Design and Implementation Issues Inter-server communication Inter-server communication Server communication is key so that multiple servers can be rolled back to achieve system stability Server communication is key so that multiple servers can be rolled back to achieve system stability Multi-threaded process checkpointing Multi-threaded process checkpointing Force all threads to be at user level to ensure accurate checkpointing due to threads running simultaneously Force all threads to be at user level to ensure accurate checkpointing due to threads running simultaneously

19 Evaluation Tested on 4 server applications (Apache httpd, MySQL, Squid, CVS) Tested on 4 server applications (Apache httpd, MySQL, Squid, CVS)

20 Overall Results

21 Throughput and Avg Response Time

22 Recovery Time

23 Rx Advantages Comprehensive Comprehensive Can survive many common software defects Can survive many common software defects Safe Safe Does not change program, only environment it runs in Does not change program, only environment it runs in Noninvasive Noninvasive Few to no modifications required in software (no mods in any of the tested systems) Few to no modifications required in software (no mods in any of the tested systems) Efficient Efficient No rebooting (mostly) with little overhead No rebooting (mostly) with little overhead Learns from previous solutions Learns from previous solutions Informative Informative Bugs are shown and details are given on the nature of the bug Bugs are shown and details are given on the nature of the bug

24 Issues Unavoidable Bug/Failures Unavoidable Bug/Failures Accumulative memory leaks cannot be detected by Rx Accumulative memory leaks cannot be detected by Rx Only solution is program restart Only solution is program restart Worst case scenario 2x time for normal restart Worst case scenario 2x time for normal restart Did not happen in any of the tests Did not happen in any of the tests

25 Questions/Complaints?

26 What do they mean with “execution environment”? “almost everything that is external to the target application but can affect the execution of the target application” “almost everything that is external to the target application but can affect the execution of the target application” 3 levels: 3 levels: Lowest: Hardware (processor, devices) Lowest: Hardware (processor, devices) Middle: OS kernel (scheduling, virtual memory management, device drivers) Middle: OS kernel (scheduling, virtual memory management, device drivers) Highest: libraries (standard, third-party) Highest: libraries (standard, third-party)

27 Throughput and Avg Response Time

28 Avg Space Overhead per Checkpoint

29 Different bug arrival rates