Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by: Daniel Taylor

Similar presentations


Presentation on theme: "Presented by: Daniel Taylor"— Presentation transcript:

1 Presented by: Daniel Taylor
Rx: Treating Bugs As Allergies – A Safe Method to Survive Software Failures Presented by: Daniel Taylor

2 Outline Motivation Approaches to surviving failures Rx Approach
Rx Design Experimental Results Future Work Evaluation Discussion

3 Motivation System Availability
Gartner report: 1 hour of downtime = $6 million Affected by software failures Software defects cause up to 40% of system failures Memory-related and concurrency bugs account for over 60% of system vulnerabilities

4 Motivation Treat bugs as allergies Examples of environmental bugs
Memory management Buffer overflows Dangling pointers Timing Races Message ordering User Request Malicious users Bad requests

5 Approaches to surviving failures
1) Rebooting/System restart Designed for hardware failures Fail in fixing deterministic bugs Unavailability Warm-up period Micro-rebooting

6 Approaches to surviving failures
2) Checkpointing and recovery Checkpoint, rolback on failure, re-execute Designed for hardware failures Fail in fixing deterministic bugs Progressive retry – method to re-order messages Only works for message ordering bugs N-version programming – run different implementation on re-execution Requires extra software development

7 Approaches to surviving failures
3) Application-specfic recovery Multi-process model Spawn new processes if old ones fail Cannot deal with deterministic bugs Cannot deal with shared data corruption Exception handling Programmer must expect failures

8 Approaches to surviving failures
4) Non-conventional methods Failure-oblivious computing Artificial values for buffer overflows Reactive immune systems Speculative error code for crashed functions Unsafe methods, not appropriate for critical applications Hard to debug if the “fix” does something strange

9 Rx Approach Treat bugs like real-life allergies Goals:
Remove the allergen to see if it helps Goals: Comprehensive – survive software bugs Safe - no uncertainty or introduced errors Noninvasive – no modifications Efficient – good performance, reduce downtime Informative – help diagnose bugs

10 Rx Approach Keep checkpoints
Fail > Rollback > Change Environment > Re-Execute Disable modifications if it succeeds

11 Rx Approach Execution Environment
Anything external to the application affecting it Low Level – Hardware Middle Level – OS Kernel: scheduling, VM system, FS, drivers, etc. High Level – libraries Change must be: Correctness-preserving – follow API’s, do the same thing Avoid bugs – potentially fix a software defect

12 Rx Approach Environmental changes and bugs

13 Rx Design 5 parts Sensors Checkpoint and Rollback (CR)
Environment Wrapper Proxy Control Unit

14 Rx Design: Sensors Detect failures and inform the control unit
Two types of sensors: Detect software errors Detect bugs before they cause crashes Only the 1st is implemented Provide information about the type of exception, memory address, and stack signature

15 Rx Design: Checkpoint and Rollback
Checkpoints are automatically and transparently taken Application memory, accessed files and file pointers are copied by copy-on-write Kept in memory (no disk accesses), old checkpoints can be written to disk Using checkpoints too far back takes too long

16 Rx Design: Checkpoint and Rollback
Based on previous work, Flashback in 2004 Because Rx doesn’t require determinism, it avoids overhead

17 Rx Design: Environment Wrappers
Carry out the environment changes during re-execution Memory Wrapper Intercepts memory library calls (malloc, free, etc) Supports 4 environmental changes Delaying free Padding buffers Allocation isolation Zero-filling Safe, no changes to API

18 Rx Design: Environment Wrappers
Message Wrapper Implemented in the proxy, controls message ordering Changes include Shuffling requests Randomized packet sizes Helps avoid non-deterministic bugs No change to execution – server should not expect any ordering or size

19 Rx Design: Environment Wrappers
Process Scheduling Change priority Signal Delivery Signals are recorded and can be delivered randomly Dropping User Requests Binary search to narrow down possible bad user request and drop

20 Rx Design: Proxy Records and replays messages on re-execution
Simply forwards messages during normal execution On recovery, the proxy Replays requests Carries out message-related environment changes Buffers incoming messages for after failure recovery Keeps track which requests received responses

21 Rx Design: Control Unit
Coordinates the other components and performs 3 functions: Directs checkpointing and requests rollbacks Diagnoses failures based on symptoms and experiences and chooses changes to use Gives an information report for programmers Keeps a failure table to judge how well each environmental change works for future reference

22 Rx Design Multi-threaded process checkpointing
Threads must be at the user level before taking a checkpoint because of kernel locks and state issues A signal makes threads exit blocked calls to take the checkpoint, then Rx retries them Big I/O problems with this method, cannot set checkpoint interval too short

23 Experimental Results 4 different sets of tests
Surviving failures Performance overhead Malicious requests Learning from previous failures Tested with 4 real applications Apache httpd – web server Squid – proxy server MySQL – database server CVS – version control server 6 bugs: data race, buffer overflow, uninitialized read, dangling pointer, stack overflow, double free

24 Experimental Results Alternatives are the whole program restart or a rollback and re-execute method Rx provides availability and is faster than restart methods except in the case of very simple programs (CVS) If the bug is deterministic, restarting will likely cause a crash again

25 Experimental Results

26 Experimental Results

27 Experimental Results

28 Future Work Inter-server communication Unavoidable bugs/failures
If Rx is on all systems, it can rollback any that it needs to when a failure occurs Coordinated checkpoints Unavoidable bugs/failures Memory leaks – requires whole program restart Deadlocks Semantic bugs that have nothing to do with the environment Undetectable bugs – need better sensors Implement Proxy in the kernel level

29 Evaluation Safe/fast recovery of certain bugs, but not all bugs
Masks failures to users, provides availability Rx was only tested on I/O bound applications, overhead may be larger for computation-based applications

30 Discussion Questions?


Download ppt "Presented by: Daniel Taylor"

Similar presentations


Ads by Google