Presentation is loading. Please wait.

Presentation is loading. Please wait.

Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang.

Similar presentations


Presentation on theme: "Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang."— Presentation transcript:

1 Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang

2 2 We need reliable software  People’s daily life now depends on reliable software  Software companies spend lots of resources on debugging  More than 50% effort on finding and fixing bugs  Around $300 billion per year

3 Concurrency bugs hurt  It is an increasingly parallel world  Concurrency bugs in history 3

4 Multi-threaded program  Concurrent programs under the shared-memory model  Programs execute multiple interacting threads in parallel  Threads communicate via shared memory  Shared-memory accesses should be well-synchronized Multicore chip core1 cache thread1 core2 cache thread2 core3 cache thread3 core4 cache thread4 shared memory 4

5 Huge Interleaving space An example of concurrency bug Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; The interleaving space 5 Bad interleavings Previous research focuses on finding Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Segmentation Fault

6  Software quality does not improve until bugs are fixed  Manual concurrency bug fixing is  time-consuming: 73 days on average  error-prone: 39% patches are buggy in the first release  CFix : automated concurrency-bug fixing [PLDI’11*, OSDI’12]  Program behaves correctly if bad interleavings do not occur  Fix concurrency bugs by disabling bad interleavings Bug fixing 6 *SIGPLAN: “one of the first papers to attack the problem of automated bug fixing” *SIGPLAN: “one of the first papers to attack the problem of automated bug fixing”

7 Huge Interleaving space Bad interleavings Disabled The interleaving space (again) lead to production-run failures lead to production-run failures 7 Bad interleavings Disabled

8  Failures still happen in production runs  The reason behind failure needs to be understood  Tools dealing with production runs demand low overhead  Diagnostic information needs to be informative  Production-run concurrency-bug failure diagnosis  Design new monitoring schemes and sampling strategies  CCI: a pure software solution [OOPSLA’10]  PBI, LXR: hardware-assisted solutions [ASPLOS’13 & 14] Failure diagnosis 8

9 My work on concurrency bugs [ASPLOS’11] Production-Run Failure Diagnosis: CCI/PBI/LXR [OOPSLA’10, ASPLOS’13 & 14] 9 [PLDI’11*, OSDI’12] *Received a SIGPLAN CACM nomination Bug Detection and software testing: ConSeq Automated Concurrency-Bug Fixing: CFix

10 Outline  Motivation and Overview  Automated Concurrency-Bug Fixing  The problem and idea  Overview  Internals of CFix  Evaluation and summary 10

11  What is the correct behavior?  Usually requires developers’ knowledge  How to get the correct behavior?  Correct program states under bug-triggering inputs  No change to program states under other inputs Automated fixing is difficult 11 Description: Symptom Triggering condition … Description: Symptom Triggering condition … Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ?

12  What is the correct behavior?  The program state is correct as long as the buggy interleaving does not occur  How to get the correct behavior?  Only need to disable failure-inducing interleavings  Can leverage well-defined synchronization operations CFix’ insights 12 Description: Symptom Triggering condition … Description: Symptom Triggering condition … Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ?

13 Description: Symptom Triggering condition … Description: Symptom Triggering condition … Description: Interleavings that lead to software failure Description: Interleavings that lead to software failure 13 Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ? atomicity violation detectors ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10 atomicity violation detectors ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10 order violation detectors ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11 order violation detectors ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11 data race detectors SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10 data race detectors SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10 abnormal data flow detectors ZhangASPLOS’11, ShiOOPSLA’10 abnormal data flow detectors ZhangASPLOS’11, ShiOOPSLA’10 p r c A B W b R W g I 1 I 2 How to get a general solution that generates good patches?

14 ... Patched binary Merged binary... Selected binary Mutual exclusion Order Mutual exclusion Order Final patched binary 14 Description: Interleavings that lead to software failure Description: Interleavings that lead to software failure Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity CFix Run-time Support Patch Merging Patch Merging Patch Testing & Selection Synchronization Enforcement Fix-Strategy Design Source code Bug reports

15 Fix-strategy design: what to fix Challenges:  Huge variety of bugs Challenges:  Huge variety of bugs 15 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

16  Why these two?  Real-world concurrency bug characteristics study[SHAN ASPLOS’08]: 97% either atomicity violation or order violation  Either can be fixed by mutual exclusion or order enforcement Two types of Concurrency bugs 16 Atomicity violation Order violation

17 Fix-strategy design: how to fix Challenges:  Inaccurate root cause Challenges:  Inaccurate root cause 17 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

18 atomicity-violation Thread 1 if (ptr != NULL) { ptr->field = 1; } ptr = NULL; Thread 2 18 P C R

19 Fix-strategy for atomicity-voilation Thread 1 if (ptr != NULL) { ptr->field = 1; } ptr = NULL; Thread 2 19

20 CFix: fix-strategy design Challenges:  Inaccurate root cause  Huge variety of bugs Solution:  A combination of mutual exclusion & order relationship enforcement Challenges:  Inaccurate root cause  Huge variety of bugs Solution:  A combination of mutual exclusion & order relationship enforcement 20 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

21 Fix-strategies Overview OV Detector AV Detector Race Detector DU Detector I 1 I 2 A B p r c W b R W g 21

22 CFix: synchronization enforcement Challenges:  Correctness  Performance  simplicity Solution:  Mutual exclusion enforcement: AFix [PLDI’11]  Order relationship enforcement: OFix [OSDI’12] Challenges:  Correctness  Performance  simplicity Solution:  Mutual exclusion enforcement: AFix [PLDI’11]  Order relationship enforcement: OFix [OSDI’12] 22 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

23  Input: three statements (p, c, r) with contexts  Idea: making the code region from p to c be mutually exclusive with r Atomicity violation in Fixing 23 Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; r p c

24  Approach: lock  Goal:  Correctness: paired lock acquisition and release operations  Performance: Make the critical section as small as possible Mutual exclusion enforcement: AFix p c r 24

25  A naïve solution  Add lock on edges reaching p  Add unlock on edges leaving c  Potential new bugs  Could lock without unlock  Could unlock without lock  etc. A naïve solution p c p c p c p c 25

26  Assume p and c are in the same function f  Step 1: find protected nodes in critical section  Step 2: add lock operations  unprotected node  protected node  protected node  unprotected node  Avoid those potential bugs mentioned The AFix solution p c 26

27  p and c adjustment when they are in different functions  Observation: people put lock and unlock in one function  Find the longest common prefix of p’s and c’s stack traces  Adjust p and c accordingly  Put r into a critical section  Do nothing if we can reach r from the p–c critical section  Lock type:  Lock with timeout: if critical section has blocking operations  Reentrant lock: if recursion is possible within critical section Subtle details 27

28 use read initialization destroy OFix: two order relationships A i A B A j … … ? firstA-B allA-B A 1 B A n … A 1 B A n … 28

29  Approach: condition variable and flag  Insert signal operations in A-threads  Insert wait operation before B  Rules  A-thread signals exactly once when it will not execute more A  A-thread signals as soon as possible  B proceeds when each A-thread has signaled OFix allA-B enforcement 29

30 OFix allA-B enforcement: A side How to identify the last A instance in one thread A...; for (...)... ; // A...;  Each thread that executes A  exactly once as soon as it can execute no more A 30

31 OFix allA-B enforcement: A side How to identify the last thread that executes A void main() { for (...) thread_create(thr_main);...; } void ofix_signal() { mutex_lock(L); --; if ( == 0) cond_broadcast(con); mutex_unlock(L); } void thr_main() { for (...)... ; // A...; } counter for signal threads =1 ++ thread _create A 31

32  Safe to execute only when is 0  Give up if OFix knows that it introduces new deadlock  Timed wait-operation to mask potential deadlocks OFix allA-B enforcement: B side B void ofix_wait() { mutex_lock(L); if ( != 0) cond_timedwait(con, L, t); mutex_unlock(L); } 32

33  Basic enforcement  When A may not execute  Add a safety-net of signal with allA-B algorithm OFix firstA-B B A 33

34 CFix: patch testing & selection Challenge:  Multi-thread software testing Solution:  CFix-patch oriented testing Challenge:  Multi-thread software testing Solution:  CFix-patch oriented testing 34 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

35 Patch testing principles  Two ideas:  No exhaustive testing, but patch oriented testing  Leverage existing testing techniques, with extra heuristics  The work-flow  Step 1 Prune incorrect patches Patches causing failures due to wrong fix strategies, etc  Step 2 Prune slow patches  Step 3 Prune complicated patches 35

36 Run once without external perturbation  Reject if there is a time-out or failure  Patches fixing wrong root cause  Make software to fail deterministically Thread 1 ptr->field = 1; Thread 2 ptr = NULL; 36

37 Implicit bad patch  A failure in patch_b implies a failure in patch_a  If patch_a is less restrictive than patch_b  Helpful to prune patch_a  Traditional testing may not find the failure in patch_a a Mutual Exclusion b c Order Relationships 37

38 Challenge:  One single programming mistake usually leads to multiple bug reports Solution:  Heuristics to merge patches Challenge:  One single programming mistake usually leads to multiple bug reports Solution:  Heuristics to merge patches CFix: patch merging 38 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

39 c1 r1 p1 p2 c2, r2 void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) return; memcpy(buf[buf_len], str, str_len); buf_len = tmp; } An example with multiple reports p1 c1 p2 r1 c2, r2  Too many lock/unlock operations  Potential new deadlocks  May hurt performance and simplicity 39

40 Related patch: a case of AFix  Merge if p, c, or r is in some other patch’s critical sections lock(L1) p1 lock(L2) p2 c1 unlock(L1) c2 unlock(L2) lock(L1) r1 unlock(L1) lock(L2) r2 unlock(L2) lock(L1) p1 p2 c1 c2 unlock(L1) lock(L1) r2 unlock(L1) 40

41 c1 r1 p1 p2 c2,r2 void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) { return; } memcpy(buf[buf_len], str, str_len); buf_len = tmp; } The merged patch for the example p1 c1 p2 r1 c2, r2 c1,p2 c2,r1,r2 p1 41

42  To understand whether there is a deadlock underlying time-out  Low-overhead, and suitable for production runs  To understand whether there is a deadlock underlying time-out  Low-overhead, and suitable for production runs CFix: run-time support 42 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

43 Evaluation methodology APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 AV Detector OV Detector RA Detector DU Detector 43

44 Evaluation result # of Ops 5 7 5 2 2 2 3 3 5 9 3 2 5 APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 44

45 Summary  Software reliability is critical  Fixing Concurrency bugs is costly and error-prone  CFix uses some heuristics, with good results in practice  A combination of mutual exclusion and order enforcement  Use testing to select the best patch  Fix root cause without requiring detectors to report it  Small overhead and good simplicity 45

46 Questions ? Thank you 46


Download ppt "Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang."

Similar presentations


Ads by Google