Presentation is loading. Please wait.

Presentation is loading. Please wait.

Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer.

Similar presentations

Presentation on theme: "Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer."— Presentation transcript:

1 Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer Science University Illinois, Urbana Champaign

2 Joseph TucekCS-UIUCPage 2 Despite all of our effort, production runs still fail What do we do about these failures?

3 Joseph TucekCS-UIUCPage 3 What is (currently) done about end-user failures? Dumps leave much manual effort to diagnose We still need to reproduce the bug This is hard, if not impossible, to do

4 Joseph TucekCS-UIUCPage 4 Why on-site diagnosis of production run failures? Production run bugs are valuable Not caught in testing Potentially environment specific Causing real damage to end users We cant diagnose production failures off-site Reproduction is hard The programmer doesnt have the end-user environment Privacy concerns limit even the reports we do get We must diagnose at the end-users site

5 Joseph TucekCS-UIUCPage 5 What do we mean by diagnosis? Diagnosis traces back to the underlying fault Core dumps tell you about the failure Bug detection tells you about some errors Existing diagnosis tools are offline trigger faulterror failure service interruption incorrect state e.g. smashed stack root cause buggy line of code

6 Joseph TucekCS-UIUCPage 6 What do we need to perform diagnosis? (1) We need information about the failure What is the fault, the error, the propagation tree? Off-site: Repeatedly inspect the bug (e.g. with a debugger) We run analysis tools targeted at the failure, or at suspected failures Off-site techniques dont work on-site Reproducing the bug is non-trivial We dont know what specific failures will occur Existing analysis tools are too expensive

7 Joseph TucekCS-UIUCPage 7 What do we need to perform diagnosis? (2) We need guidance as to what to do next What analysis should we perform, what is likely to work well, and what variables are interesting? Off-site: The programmer decides, based on past knowledge On-site, there is no programmer. Any decisions as to action must be made automatically.

8 Joseph TucekCS-UIUCPage 8 What do we need to perform diagnosis? (3) We need to try what-ifs with the execution If we change this input, what happens? Skip this function? Off-site: Programmers run many input variations Even with differing code. This is difficult on-site Most replay focuses on minimizing variance We cant understand what the results mean

9 Joseph TucekCS-UIUCPage 9 What does Triage contribute? Enables on-site diagnosis Uses systems techniques to make offline analysis tools feasible on-site Addresses the three previous challenges Allows a new technique, delta analysis Human study Real programmers and real bugs Show large time savings in time-to-fix

10 Joseph TucekCS-UIUCPage 10 Overview Introduction Addressing the three challenges Diagnosis process & design Experimental results Human study Overhead Related work Conclusions

11 Joseph TucekCS-UIUCPage 11 Getting information about the failure Checkpoint/re-execution can capture the bug The environment, input, memory state, etc. Everything we need to reproduce the bug Benefits: We can relive the failure over and over Dynamically plug in analysis tools on-demand Makes the expensive cheap Normal-run overhead is low too

12 Joseph TucekCS-UIUCPage 12 Guidance about what to do next A human-like diagnosis protocol can guide the diagnosis process Repeated replay lets us diagnose incrementally Based on past results, we can pick the next step E.g. if the bug doesnt always repeat, we should look for races StageGoal 1failure/error type & location 2failure triggering conditions 3Fault related code & variables

13 Joseph TucekCS-UIUCPage 13 Trying what-ifs with the execution Flexible re-execution lets us play with what-ifs Three types of re-execution Plain – deterministic Loose – allow some variance Wild – introduce (potentially large) variations Extracts how they differ with delta analysis

14 Joseph TucekCS-UIUCPage 14 Main idea of Triage How to get information about the failure? Capture the bug with checkpoint/re-execution Relive the bug with various diagnostic techniques How to decide what to do? Use a human-like protocol to select analysis Incrementally increase our understanding of the bug How to try out what-if scenarios? Flexible re-execution allows varied executions Delta analysis points out what makes them different

15 Joseph TucekCS-UIUCPage 15 Overview Introduction Addressing the three challenges Diagnosis process & design Experimental results Human study Overhead Related work Conclusions

16 Joseph TucekCS-UIUCPage 16 Triage Architecture Checkpointing Subsystem Analysis Tools (e.g. backward slicing, bug detection) Control Unit (Protocol)

17 Joseph TucekCS-UIUCPage 17 Triage vs. Rx Both are in memory Both support variations in execution Triage has no output commit Triage has no need for safety Can even skip code Triage considers why the failure occurs Tries to analyze the failure

18 Joseph TucekCS-UIUCPage 18 Failure analysis & delta generation (stage 1 and 2) Bounds checking (1.1x) Assertion checking (1x) Happens-before (12x) Atomicity detection (60x) Static core analysis (1x) Taint analysis (2x) Dynamic Slicing (1000x) Symbolic exec. (1000x) Lockset analysis (20x) Rearrange allocation Drop inputs Mutate inputs Pad buffers Change file state Drop code Reschedule threads Change libraries Reorder messages The differences caused by variations are useful as well

19 Joseph TucekCS-UIUCPage 19 Delta analysis A B C D E F G A B C X E G Y A B C D E F G X Y {A:1 B:1 C:1 D:1 X:0 E:1 F:1 G:1 Y:0} {A:1 B:1 C:1 D:0 X:1 E:1 F:0 G:1 Y:1} {A:0 B:0 C:0 D:1 X:1 E:0 F:1 G:0 Y:1} Compute the basic block vector:

20 Joseph TucekCS-UIUCPage 20 Delta analysis From delta generations many runs, Triage finds the most similar Compare the basic block vectors Triage will diff the two closest runs The minimum edit distance, aka shortest edit script A B C D E F G - ^ V A B C X E G Y

21 Joseph TucekCS-UIUCPage 21 A bug in TAR char * get_directory_contents (char *path, dev_t device) { struct accumulator *accumulator; /* Recursively scan the given PATH. */ { char *dirp = savedir (path); char const *entry; size_t entrylen; char *name_buffer; size_t name_buffer_size; size_t name_length; struct directory *directory; enum children children; if (! dirp) savedir_error (path); errno = 0; name_buffer_size = strlen (path) + NAME_FIELD_SIZE; name_buffer = xmalloc (name_buffer_size + 2); strcpy (name_buffer, path); if (! ISSLASH (path[strlen (path) - 1])) strcat (name_buffer, "/"); name_length = strlen (name_buffer); directory = find_directory (path); children = directory ? directory->children : CHANGED_CHILDREN; accumulator = new_accumulator (); if (children != NO_CHILDREN) for (entry = dirp; (entrylen = strlen (entry)) != 0; entry += entrylen + 1) char * savedir (const char *dir) { DIR *dirp; struct dirent *dp; char *name_space; size_t allocated = NAME_SIZE_DEFAULT; size_t used = 0; int save_errno; dirp = opendir (dir); if (dirp == NULL) return NULL; name_space = xmalloc (allocated); errno = 0; while ((dp = readdir (dirp)) != NULL) { char const *entry = dp->d_name; if (entry[entry[0] != '.' ? 0 : entry[1] != '.' ? 1 : 2] != '\0') { size_t entry_size = strlen (entry) + 1; if (used + entry_size < used) xalloc_die (); if (allocated <= used + entry_size) { do { if (2 * allocated < allocated) xalloc_die (); allocated *= 2; } while (allocated <= used + entry_size); Segmentation fault null point dereference Execution difference

22 Joseph TucekCS-UIUCPage 22 Sample Triage report Failure point Segfault in lib strlen Stack & heap OK Bug detection Deterministic bug Null pointer at incremen.c:207 Fault propagation dirp = opendir (dir); if (dirp == NULL) return NULL; dirp = savedir (path); entry = dirp; strlen(entry)

23 Joseph TucekCS-UIUCPage 23 Results – Human Study We tested Triage with a human study 15 programmers drawn from faculty, research programmers, and graduate students No undergraduates! Measured time to repair bugs, with/without Triage Everybody got core dumps, sample inputs, instructions on how to replicate, and access to many debugging tools Including Valgrind 3 simple toy bugs, & 2 real bugs The TAR bug you just saw A copy-paste error in BC

24 Joseph TucekCS-UIUCPage 24 Time to fix a bug We hope that the report is be easy to check We cut out the reproduction step This is quite unfair to Triage Also, we put a time limit Over time is counted as max time reproducefind failure…error…faultfix it check Triage report fix it

25 Joseph TucekCS-UIUCPage 25 Results – Human study For the real bugs, Triage strongly helps (47%) Better than 99.99% confidence that with < without

26 Joseph TucekCS-UIUCPage 26 Results – Other Bugs Δ Generation Δ Analysis Dynamic Slicing Apache Input element12%8 instructions Apache Input element69%3 instructions CVS -- 4 functions MySQL interleaving-- Squid 1 character71%6 instructions BC array padding98%3 instructions Linux-ext -- 6 instructions MAN -- 9 functions NCOMP -- 5 instructions TAR file perms68%6 instructions

27 Joseph TucekCS-UIUCPage 27 Results – Normal Run Overhead Identical to checkpoint system (Rx) overhead Under 5%

28 Joseph TucekCS-UIUCPage 28 Results – Diagnosis Overhead CPU bound is the worst case Still reasonable because were only redoing 200ms Delta analysis is somewhat costly Should be run in the background

29 Joseph TucekCS-UIUCPage 29 Related work Checkpointing & re-execution Zap [Osman, OSDI02], TTVM [King, USENIX05] Bug detection & diagnosis Valgrind [Nethercote], CCured [Necula, POPL02], Purify [Hastings, USENIX92] Eraser [Savage, TOCS97], [Netzer, PPoPP91] Backward slicing [Weiser, CACM82] Innumerable others Execution variation Input variation Delta debugging [Zeller, FSE02], Fuzzing [B. So] Environment variation Rx [Qin, SOSP05] DieHard [Berger, PLDI06]

30 Joseph TucekCS-UIUCPage 30 Conclusions & Future Work On-site diagnosis can be made feasible Checkpoint can effectively capture the failure Expensive off-line analysis can be done on-site Privacy issues are minimized Also useful for in house testing Reduces the manual portion of analysis Future work Automatic bug hot fixes Visualization of delta analysis

31 Joseph TucekCS-UIUCPage 31 Thank you Questions? Special thanks to Hewlett-Packard for student scholarship support. This work supported by NSF, DoE, and Intel

Download ppt "Triage: Diagnosing Production Run Failures at the Users Site Joseph Tucek, Shan Lu, Chengdu Huang, Spiros Xanthos, and Yuanyuan Zhou Department of Computer."

Similar presentations

Ads by Google