Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang.

Slides:



Advertisements
Similar presentations
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
Advertisements

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Guoliang Jin, Linhai Song, Wei Zhang, Shan Lu, and Ben Liblit University of Wisconsin–Madison Automated Atomicity- Violation Fixing.
An Case for an Interleaving Constrained Shared-Memory Multi- Processor CS6260 Biao xiong, Srikanth Bala.
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
5.6 Semaphores Semaphores –Software construct that can be used to enforce mutual exclusion –Contains a protected variable Can be accessed only via wait.
Microsoft Research Faculty Summit Yuanyuan(YY) Zhou Associate Professor University of Illinois, Urbana-Champaign.
Concurrency CS 510: Programming Languages David Walker.
OS Spring 2004 Concurrency: Principles of Deadlock Operating Systems Spring 2004.
OS Fall’02 Concurrency: Principles of Deadlock Operating Systems Fall 2002.
Synchronization in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 Concurrency: Deadlock and Starvation Chapter 6.
02/17/2010CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
02/19/2007CSCI 315 Operating Systems Design1 Process Synchronization Notice: The slides for this lecture have been largely based on those accompanying.
Learning From Mistakes—A Comprehensive Study on Real World Concurrency Bug Characteristics Shan Lu, Soyeon Park, Eunsoo Seo and Yuanyuan Zhou Appeared.
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.
Pthread (continue) General pthread program structure –Encapsulate parallel parts (can be almost the whole program) in functions. –Use function arguments.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
LOOM: Bypassing Races in Live Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1.
1 Concurrent Languages – Part 1 COMP 640 Programming Languages.
What Change History Tells Us about Thread Synchronization RUI GU, GUOLIANG JIN, LINHAI SONG, LINJIE ZHU, SHAN LU UNIVERSITY OF WISCONSIN – MADISON, USA.
DEBUGGING. BUG A software bug is an error, flaw, failure, or fault in a computer program or system that causes it to produce an incorrect or unexpected.
1 Announcements The fixing the bug part of Lab 4’s assignment 2 is now considered extra credit. Comments for the code should be on the parts you wrote.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
Cooperative Concurrency Bug Isolation Guoliang Jin, Aditya Thakur, Ben Liblit, Shan Lu University of Wisconsin–Madison Instrumentation and Sampling Strategies.
Chapter 7 -1 CHAPTER 7 PROCESS SYNCHRONIZATION CGS Operating System Concepts UCF, Spring 2004.
Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Synchronization.
Ali Kheradmand, Baris Kasikci, George Candea Lockout: Efficient Testing for Deadlock Bugs 1.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Software Systems Advanced Synchronization Emery Berger and Mark Corner University.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-5 Process Synchronization Department of Computer Science and Software.
HXY Debugging Made by Contents 目录 History of Java MT Sequential & Parallel Different types of bugs Debugging skills.
CS533 – Spring Jeanie M. Schwenk Experiences and Processes and Monitors with Mesa What is Mesa? “Mesa is a strongly typed, block structured programming.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
Detecting Atomicity Violations via Access Interleaving Invariants
CAPP: Change-Aware Preemption Prioritization Vilas Jagannath, Qingzhou Luo, Darko Marinov Sep 6 th 2011.
Soyeon Park, Shan Lu, Yuanyuan Zhou UIUC Reading Group by Theo.
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
Agenda  Quick Review  Finish Introduction  Java Threads.
Testing Concurrent Programs Sri Teja Basava Arpit Sud CSCI 5535: Fundamentals of Programming Languages University of Colorado at Boulder Spring 2010.
6/27/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
pThread synchronization
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
Learning from Mistakes: A Comprehensive Study on Real-World Concurrency Bug Characteristics Ben Shelton.
Healing Data Races On-The-Fly
Lazy Preemption to Enable Path-Based Analysis of Interrupt-Driven Code
Background on the need for Synchronization
Automated Atomicity-Violation Fixing
CSC 591/791 Reliable Software Systems
Atomic Operations in Hardware
Atomic Operations in Hardware
Diagnosing and Fixing Concurrency Bugs
Martin Rinard Laboratory for Computer Science
Threads and Memory Models Hal Perkins Autumn 2011
Fault Injection: A Method for Validating Fault-tolerant System
Threads and Memory Models Hal Perkins Autumn 2009
Background and Motivation
Parallelism and Concurrency
Concurrency: Mutual Exclusion and Process Synchronization
Understanding Real-World Concurrency Bugs in Go
Presentation transcript:

Diagnosing and Fixing Concurrency Bugs Credits to Dr. Guoliang Jin, Computer Science, NC STATE Presented by Tao Wang

2 We need reliable software  People’s daily life now depends on reliable software  Software companies spend lots of resources on debugging  More than 50% effort on finding and fixing bugs  Around $300 billion per year

Concurrency bugs hurt  It is an increasingly parallel world  Concurrency bugs in history 3

Multi-threaded program  Concurrent programs under the shared-memory model  Programs execute multiple interacting threads in parallel  Threads communicate via shared memory  Shared-memory accesses should be well-synchronized Multicore chip core1 cache thread1 core2 cache thread2 core3 cache thread3 core4 cache thread4 shared memory 4

Huge Interleaving space An example of concurrency bug Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; The interleaving space 5 Bad interleavings Previous research focuses on finding Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; Segmentation Fault

 Software quality does not improve until bugs are fixed  Manual concurrency bug fixing is  time-consuming: 73 days on average  error-prone: 39% patches are buggy in the first release  CFix : automated concurrency-bug fixing [PLDI’11*, OSDI’12]  Program behaves correctly if bad interleavings do not occur  Fix concurrency bugs by disabling bad interleavings Bug fixing 6 *SIGPLAN: “one of the first papers to attack the problem of automated bug fixing” *SIGPLAN: “one of the first papers to attack the problem of automated bug fixing”

Huge Interleaving space Bad interleavings Disabled The interleaving space (again) lead to production-run failures lead to production-run failures 7 Bad interleavings Disabled

 Failures still happen in production runs  The reason behind failure needs to be understood  Tools dealing with production runs demand low overhead  Diagnostic information needs to be informative  Production-run concurrency-bug failure diagnosis  Design new monitoring schemes and sampling strategies  CCI: a pure software solution [OOPSLA’10]  PBI, LXR: hardware-assisted solutions [ASPLOS’13 & 14] Failure diagnosis 8

My work on concurrency bugs [ASPLOS’11] Production-Run Failure Diagnosis: CCI/PBI/LXR [OOPSLA’10, ASPLOS’13 & 14] 9 [PLDI’11*, OSDI’12] *Received a SIGPLAN CACM nomination Bug Detection and software testing: ConSeq Automated Concurrency-Bug Fixing: CFix

Outline  Motivation and Overview  Automated Concurrency-Bug Fixing  The problem and idea  Overview  Internals of CFix  Evaluation and summary 10

 What is the correct behavior?  Usually requires developers’ knowledge  How to get the correct behavior?  Correct program states under bug-triggering inputs  No change to program states under other inputs Automated fixing is difficult 11 Description: Symptom Triggering condition … Description: Symptom Triggering condition … Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ?

 What is the correct behavior?  The program state is correct as long as the buggy interleaving does not occur  How to get the correct behavior?  Only need to disable failure-inducing interleavings  Can leverage well-defined synchronization operations CFix’ insights 12 Description: Symptom Triggering condition … Description: Symptom Triggering condition … Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ?

Description: Symptom Triggering condition … Description: Symptom Triggering condition … Description: Interleavings that lead to software failure Description: Interleavings that lead to software failure 13 Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity ? ? atomicity violation detectors ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10 atomicity violation detectors ParkASPLOS’09, FlanaganPOPL’04, LuASPLOS’06, ChewEuroSys’10 order violation detectors ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11 order violation detectors ZhangASPLOS’10, LuciaMICRO’09, YuISCA’09, GaoASPLOS’11 data race detectors SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10 data race detectors SenPLDI’08, SavageTOCS’97, YuSOSP’05, EricksonOSDI’10, KasikciASPLOS’10 abnormal data flow detectors ZhangASPLOS’11, ShiOOPSLA’10 abnormal data flow detectors ZhangASPLOS’11, ShiOOPSLA’10 p r c A B W b R W g I 1 I 2 How to get a general solution that generates good patches?

... Patched binary Merged binary... Selected binary Mutual exclusion Order Mutual exclusion Order Final patched binary 14 Description: Interleavings that lead to software failure Description: Interleavings that lead to software failure Patch: Correctness Performance Simplicity Patch: Correctness Performance Simplicity CFix Run-time Support Patch Merging Patch Merging Patch Testing & Selection Synchronization Enforcement Fix-Strategy Design Source code Bug reports

Fix-strategy design: what to fix Challenges:  Huge variety of bugs Challenges:  Huge variety of bugs 15 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

 Why these two?  Real-world concurrency bug characteristics study[SHAN ASPLOS’08]: 97% either atomicity violation or order violation  Either can be fixed by mutual exclusion or order enforcement Two types of Concurrency bugs 16 Atomicity violation Order violation

Fix-strategy design: how to fix Challenges:  Inaccurate root cause Challenges:  Inaccurate root cause 17 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

atomicity-violation Thread 1 if (ptr != NULL) { ptr->field = 1; } ptr = NULL; Thread 2 18 P C R

Fix-strategy for atomicity-voilation Thread 1 if (ptr != NULL) { ptr->field = 1; } ptr = NULL; Thread 2 19

CFix: fix-strategy design Challenges:  Inaccurate root cause  Huge variety of bugs Solution:  A combination of mutual exclusion & order relationship enforcement Challenges:  Inaccurate root cause  Huge variety of bugs Solution:  A combination of mutual exclusion & order relationship enforcement 20 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

Fix-strategies Overview OV Detector AV Detector Race Detector DU Detector I 1 I 2 A B p r c W b R W g 21

CFix: synchronization enforcement Challenges:  Correctness  Performance  simplicity Solution:  Mutual exclusion enforcement: AFix [PLDI’11]  Order relationship enforcement: OFix [OSDI’12] Challenges:  Correctness  Performance  simplicity Solution:  Mutual exclusion enforcement: AFix [PLDI’11]  Order relationship enforcement: OFix [OSDI’12] 22 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

 Input: three statements (p, c, r) with contexts  Idea: making the code region from p to c be mutually exclusive with r Atomicity violation in Fixing 23 Thread 1 if (ptr != NULL) { ptr->field = 1; } Thread 2 ptr = NULL; r p c

 Approach: lock  Goal:  Correctness: paired lock acquisition and release operations  Performance: Make the critical section as small as possible Mutual exclusion enforcement: AFix p c r 24

 A naïve solution  Add lock on edges reaching p  Add unlock on edges leaving c  Potential new bugs  Could lock without unlock  Could unlock without lock  etc. A naïve solution p c p c p c p c 25

 Assume p and c are in the same function f  Step 1: find protected nodes in critical section  Step 2: add lock operations  unprotected node  protected node  protected node  unprotected node  Avoid those potential bugs mentioned The AFix solution p c 26

 p and c adjustment when they are in different functions  Observation: people put lock and unlock in one function  Find the longest common prefix of p’s and c’s stack traces  Adjust p and c accordingly  Put r into a critical section  Do nothing if we can reach r from the p–c critical section  Lock type:  Lock with timeout: if critical section has blocking operations  Reentrant lock: if recursion is possible within critical section Subtle details 27

use read initialization destroy OFix: two order relationships A i A B A j … … ? firstA-B allA-B A 1 B A n … A 1 B A n … 28

 Approach: condition variable and flag  Insert signal operations in A-threads  Insert wait operation before B  Rules  A-thread signals exactly once when it will not execute more A  A-thread signals as soon as possible  B proceeds when each A-thread has signaled OFix allA-B enforcement 29

OFix allA-B enforcement: A side How to identify the last A instance in one thread A...; for (...)... ; // A...;  Each thread that executes A  exactly once as soon as it can execute no more A 30

OFix allA-B enforcement: A side How to identify the last thread that executes A void main() { for (...) thread_create(thr_main);...; } void ofix_signal() { mutex_lock(L); --; if ( == 0) cond_broadcast(con); mutex_unlock(L); } void thr_main() { for (...)... ; // A...; } counter for signal threads =1 ++ thread _create A 31

 Safe to execute only when is 0  Give up if OFix knows that it introduces new deadlock  Timed wait-operation to mask potential deadlocks OFix allA-B enforcement: B side B void ofix_wait() { mutex_lock(L); if ( != 0) cond_timedwait(con, L, t); mutex_unlock(L); } 32

 Basic enforcement  When A may not execute  Add a safety-net of signal with allA-B algorithm OFix firstA-B B A 33

CFix: patch testing & selection Challenge:  Multi-thread software testing Solution:  CFix-patch oriented testing Challenge:  Multi-thread software testing Solution:  CFix-patch oriented testing 34 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

Patch testing principles  Two ideas:  No exhaustive testing, but patch oriented testing  Leverage existing testing techniques, with extra heuristics  The work-flow  Step 1 Prune incorrect patches Patches causing failures due to wrong fix strategies, etc  Step 2 Prune slow patches  Step 3 Prune complicated patches 35

Run once without external perturbation  Reject if there is a time-out or failure  Patches fixing wrong root cause  Make software to fail deterministically Thread 1 ptr->field = 1; Thread 2 ptr = NULL; 36

Implicit bad patch  A failure in patch_b implies a failure in patch_a  If patch_a is less restrictive than patch_b  Helpful to prune patch_a  Traditional testing may not find the failure in patch_a a Mutual Exclusion b c Order Relationships 37

Challenge:  One single programming mistake usually leads to multiple bug reports Solution:  Heuristics to merge patches Challenge:  One single programming mistake usually leads to multiple bug reports Solution:  Heuristics to merge patches CFix: patch merging 38 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

c1 r1 p1 p2 c2, r2 void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) return; memcpy(buf[buf_len], str, str_len); buf_len = tmp; } An example with multiple reports p1 c1 p2 r1 c2, r2  Too many lock/unlock operations  Potential new deadlocks  May hurt performance and simplicity 39

Related patch: a case of AFix  Merge if p, c, or r is in some other patch’s critical sections lock(L1) p1 lock(L2) p2 c1 unlock(L1) c2 unlock(L2) lock(L1) r1 unlock(L1) lock(L2) r2 unlock(L2) lock(L1) p1 p2 c1 c2 unlock(L1) lock(L1) r2 unlock(L1) 40

c1 r1 p1 p2 c2,r2 void buf_write() { int tmp = buf_len + str_len; if (tmp > MAX) { return; } memcpy(buf[buf_len], str, str_len); buf_len = tmp; } The merged patch for the example p1 c1 p2 r1 c2, r2 c1,p2 c2,r1,r2 p1 41

 To understand whether there is a deadlock underlying time-out  Low-overhead, and suitable for production runs  To understand whether there is a deadlock underlying time-out  Low-overhead, and suitable for production runs CFix: run-time support 42 Run-time Support Fix-Strategy Design Synchronization Enforcement Synchronization Enforcement Patch Merging Patch Merging Patch Testing & Selection Patch Testing & Selection

Evaluation methodology APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 AV Detector OV Detector RA Detector DU Detector 43

Evaluation result # of Ops APP. PBZIP2 x264 FFT HTTrack Mozilla-1 transmission ZSNES Apache MySQL-1 MySQL-2 Mozilla-2 Cherokee Mozilla-3 44

Summary  Software reliability is critical  Fixing Concurrency bugs is costly and error-prone  CFix uses some heuristics, with good results in practice  A combination of mutual exclusion and order enforcement  Use testing to select the best patch  Fix root cause without requiring detectors to report it  Small overhead and good simplicity 45

Questions ? Thank you 46