Presentation is loading. Please wait.

Presentation is loading. Please wait.

Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures.

Similar presentations

Presentation on theme: "Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures."— Presentation transcript:

1 Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures at the Users Site

2 Motivation Software failures are a major contributor to system downtime. Security holes. Software has grown in size, complexity and cost. Software testing has become more difficult. Software packages inevitably contain bugs (even production ones).

3 Motivation Result: Software failures during production runs at users site. One Solution: Offsite software diagnosis: Difficult to reproduce failure triggering conditions. Cannot provide timely online recovery (e.g. from fast Internet Worms). Programmers cannot be provided to every site. Privacy concerns.

4 Goal: automatically diagnosing software failures occurring at end-user site production runs. Understand a failure that has happened. Find the root causes. Minimize manual debugging.

5 Current state of the art Offsite diagnosis:Primitive onsite diagnosis: Interactive debuggers. Program slicing. Core Dump analysis (Partial execution path construction). Large overhead makes it impractical for production sites. Unprocessed failure information collections. Deterministic replay tools. All require manual analysis. Privacy concerns.

6 Onsite Diagnosis Efficiently reproduce the occurred failure (i.e. fast and automatically). Impose little overhead during normal execution. Require no human involvement. Require no prior knowledge.

7 Triage Capturing the failure point and conducting just-in-time failure diagnosis with checkpoint-reexecution. Delta Generation and Delta Analysis. Automated top-down human-like software failure diagnosis protocol. Reports: Failure nature and type. Failure-triggering conditions. Failure-related code/variable and the fault propagation chain.

8 Triage Architecture 3 groups of components: 1. Runtime Group. 2. Control Group. 3. Analysis Group.

9 Checkpoint & Reexecution Uses Rx (Previous work by authors). Rx checkpointing: Use fork()-like operations. Keeps a copy of accessed files and file pointers. Record messages using a network proxy. Replay may be potentially modified.

10 Lightweight Monitoring for detecting failures Must not impose high overhead. Cheapest way: catch fault traps: Assertions Access violations Divide by zero More… Extensions: Branch histories, system call trace… Triage only uses exceptions and assertions.

11 Control layer Implements the Triage Diagnosis protocol. Controls reexecutions with different inputs based on past results. Choice of analysis technique. Collects results and sends to off-site programmers.

12 Analysis Layer Techniques:

13 TDP: Triage Diagnosis Protocol Simple Replay Coredump analysis Dynamic bug detection Delta Generation Delta Analysis Deterministic bugStack/Heap OK. Segmentation fault: strln() Null-pointer dereference Collection of good and bad inputs Code paths leading to fault Report

14 TDP: Triage Diagnosis Protocol Example report

15 Protocol extensions and variations Add different debugging techniques. Reorder diagnosis steps. Omit steps (e.g. memory checks for java programs). Protocol may be costume-designed for specific applications. Try and fix bugs: Filter failure triggering inputs. Dynamically delete code – risky. Change variable values. Automatic patch generation – future work?

16 Delta Generation Two Goals: 1. Generate many similar replays: some that fail and some that dont. 2. Identify signature of failure triggering inputs. Signatures may be used for: Failure analysis and reproduction. Input filtering e.g. Vigilante, Autograph,etc.

17 Delta Generation Changing the inputChanging the Environment Replay previously stored client requests via proxy – try different subsets and combinations. Isolate bug-triggering part – data fuzzing. Find non-failing inputs with minimum distance from failing ones. Make protocol aware changes. Use a normal form of the input, if specific triggering portion is known. Pad or zero-fill new allocations. Change messages order. Drop messages. Manipulate thread scheduling. Modify the system environment. Make use of prior steps information (e.g. target specific buffers).

18 Delta Generation Results passed to the next stage: Break code to basic blocks. For each replay extract a vector of exercise count of each block and block trace. Possible to change granularity.

19 Example revisited Good run: Trace: AHIKBDEFEF…EG Block vector: {A:1,B:1,D:1,E:11,F:10,G:1,H:1,I:1,K:1} Bad run: Trace: AHIJBCDE Block vector: {A:1,B:1,C:1,D:1,E:1,H:1,I :1,J:1,K:1}

20 Delta Analysis Follows three steps: 1. Basic Block Vector (BBV) Comparison: Find a pair of most similar failing and non-failing replays F and S. 2. Path comparison: Compare the execution path of F and S. 3. Intersection with backward slice: Find the difference that contributes to the failure.

21 Delta Analysis: BBV Comparison The number of times each block is executed is recorded using instrumentation. Calculate the Manhattan distance between every pair of failing and non-failing replays (can relax the minimum demand and settle for similar). In the Example: {c:-1,E:10,F:10,G:1,J:-1,K:1} giving a Manhattan distance of 24.

22 Delta Analysis: Path Comparison Consider execution order. Find where the failing and non-failing runs diverge. Compute: Minimum Edit Distance i.e. the minimum number of insertion, deletion, and substitution operations needed to transform one to the other. Example:

23 Delta Analysis: Backward Slicing Want to eliminate differences that have no effect on the failure. Dynamic Backward Slicing: extracts a program slice consisting of all and only those that lead to a given instructions execution. Starting point may be supplied by earlier steps of the protocol. Overhead is acceptable in post-hoc analysis. Optimization: Dynamically build dependencies during replays. Experiments show that overhead is acceptably low.

24 Backward Slicing and result Intersection

25 Limitations and Extensions Need to define a privacy policy for the results sent to programmers. Very limited success with patch generation. Does not handle memory leaks well. Failure must occur. Does not handle incorrect operation. Difficult to reproduce bugs that take a long time to manifest. No support for deterministic replay on multi-processor architectures. False positives.

26 Evaluation Methodology Experimented with 10 real software failures in 9 applications. Triage is implemented in Linux OS (2.4.22). Hardware: 2.4 GHz Pentium-4, 512K L2 cache, 1G memory and 100Mbs Ethernet. Triage checkpoints every 200ms and keeps 20 checkpoint. User study: 15 programmers were given 5 bugs and Triages report for some of the bugs. Compared time to locate the bug with and without the report.

27 Bugs used for Evaluation NameProgramApp Description #L OC Bug TypeRoot Cause Description Apache1apache A web server114 K Stack SmashLong alias match pattern overflows a local array Apache2apache A web server102 K Semantic (NULL ptr) Missing certain part of url causes NULL pointer dereference CVScvs GNU version control server 115 K Double FreeError-handling code placed at wrong order leads to double free NySQLmsql A database server102 8K Data RaceDatabase logging error in case of data race Squidsquid-2.3A web proxy cache server 94KHeap Buffer Overflow Buffer length calculation misses special character cases BCbc-1.06Interactive algebraic language 17KHeap Buffer Overflow Using wrong variable in for-loop end-condition Linuxlinux-extractExtracted from linux K Semantic (copy- paste error) Forget-to-change variable identifier due to copy- paste MANman-1.5h1Documentation tools 4.7 K Global Buffer Overflow Wrong for-loop end-condition NCOMPncompress-1.2.4File (de)compression 1.9 K Stack SmashFixed length array can not hold long input file name TARtar GNU tar archive tool 27KSemantic (NULL ptr) Directory property corner case is not well handled

28 Experimental Results No input testing

29 Experimental Results For application bugs, Delta generation only worked for BC and TAR. In all cases Triage correctly diagnoses the nature of the bug (deterministic or non-deterministic). In all 6 applicable cases Triage correctly pinpoints the bug type, buggy instruction, and memory location. When Delta Analysis is applied, it reduces the amount of data to be considered by 63% (Best: 98% worse: 12%). For MySQL – Finds an example interleaving pair as a trigger.

30 Case Study 1: Apache Failure at ap_gregsub. Bug detector catches a stack smash in lmatcher. How can lmatcher affect try_alias_list? Stack smash overwrites the stack frame above it, invalidating r. Trace shows how lmatcher is called by try_alias_list. Failure is independent of the headers. Failure is triggered by requests for a specific resource.

31 Case Study 2: Squid Coredump analysis suggests a heap overflow. Happens at strcat of two buffers. Fault propagation shows how buffers were allocated. t has strlen(usr) while the other buffer has strlen(user)*3. Input testing gives failure- triggering input. Gives minimally different non-failing inputs.

32 Efficiency and Overhead Normal Execution overhead: Negligble effect caused by checkpointing. In no case over 5%. With 400ms checkpointing intervals – overhead is 0.1%

33 Efficiency and Overhead Diagnosis Efficiency: Except for Delta Analysis, all steps are efficient. All (other) diagnostic steps finish within 5 minutes. Delta analysis time is governed by the Edit Distance D in the O(ND) computation (N – number of blocks). Comparison step of Delta Analysis may run in the background.

34 User Study Real bugs: On average, programmers took 44.6% less time debugging using Triage reports. Toy bugs: On average, programmers took 18.4% less time debugging using Triage reports.

35 Questions?

Download ppt "Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos and Yuanyuan Zhou University of Illinois at Urbana Champaign Triage: Diagnosing Production Run Failures."

Similar presentations

Ads by Google