CHESS Finding and Reproducing Heisenbugs in Concurrent Programs

Slides:



Advertisements
Similar presentations
CHESS : Systematic Testing of Concurrent Programs
Advertisements

Chapter 5 Concurrency: Mutual Exclusion and Synchronization Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee.
1 Chapter 5 Concurrency: Mutual Exclusion and Synchronization Principals of Concurrency Mutual Exclusion: Hardware Support Semaphores Readers/Writers Problem.
A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Concurrent Programming James Adkison 02/28/2008. What is concurrency? “happens-before relation – A happens before B if A and B belong to the same process.
Interprocess Communication
CHESS: Systematic Concurrency Testing Tom Ball, Sebastian Burckhardt, Madan Musuvathi, Shaz Qadeer Microsoft Research
Iterative Context Bounding for Systematic Testing of Multithreaded Programs Madan Musuvathi Shaz Qadeer Microsoft Research.
CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.
Tom Ball, Sebastian Burckhardt, Madan Musuvathi, Shaz Qadeer Microsoft Research.
 Thomas Ball Principal Researcher Microsoft Corporation  Sebastian Burckhardt Researcher Microsoft Corporation  Madan Musuvathi Researcher Microsoft.
CS444/CS544 Operating Systems Introduction to Synchronization 2/07/2007 Prof. Searleman
Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.
Intro to Threading CS221 – 4/20/09. What we’ll cover today Finish the DOTS program Introduction to threads and multi-threading.
CHESS: Find and Reproduce Heisenbugs in Concurrent Programs Tom Ball, Sebastian Burckhardt, Peli de Halleux, Madan Musuvathi, Shaz Qadeer Microsoft Research.
Continuously Recording Program Execution for Deterministic Replay Debugging.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CHESS Finding and Reproducing Heisenbugs Tom Ball, Sebastian Burckhardt Madan Musuvathi, Shaz Qadeer Microsoft Research Interns: Gerard Basler (ETH Zurich),
Synchronization in Java Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Race Conditions CS550 Operating Systems. Review So far, we have discussed Processes and Threads and talked about multithreading and MPI processes by example.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.
Solution to Dining Philosophers. Each philosopher I invokes the operations pickup() and putdown() in the following sequence: dp.pickup(i) EAT dp.putdown(i)
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Mutual Exclusion.
Games Development 2 Concurrent Programming CO3301 Week 9.
1 Announcements The fixing the bug part of Lab 4’s assignment 2 is now considered extra credit. Comments for the code should be on the parts you wrote.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
CSC321 Concurrent Programming: §5 Monitors 1 Section 5 Monitors.
Kernel Locking Techniques by Robert Love presented by Scott Price.
Lecture 8 Page 1 CS 111 Online Other Important Synchronization Primitives Semaphores Mutexes Monitors.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Concurrency & Dynamic Programming.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Processes and Threads.
Polytechnic University of Tirana Faculty of Information Technology Computer Engineering Department A MULTITHREADED SEARCH ENGINE AND TESTING OF MULTITHREADED.
CMSC 330: Organization of Programming Languages Threads.
CSE 153 Design of Operating Systems Winter 2015 Midterm Review.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Agenda  Quick Review  Finish Introduction  Java Threads.
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.
Testing Concurrent Programs Sri Teja Basava Arpit Sud CSCI 5535: Fundamentals of Programming Languages University of Colorado at Boulder Spring 2010.
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
CSE 486/586 CSE 486/586 Distributed Systems PA Best Practices Steve Ko Computer Sciences and Engineering University at Buffalo.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Background on the need for Synchronization
Advanced Topics in Concurrency and Reactive Programming: Asynchronous Programming Majeed Kassis.
G.Anuradha Reference: William Stallings
Effective Data-Race Detection for the Kernel
Threads Chapter 4.
Background and Motivation
Dr. Mustafa Cem Kasapbaşı
Why Threads Are A Bad Idea (for most purposes)
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
CSE 542: Operating Systems
CSE 542: Operating Systems
Synchronization and liveness
Presentation transcript:

CHESS Finding and Reproducing Heisenbugs in Concurrent Programs Julieta Arakelian Based on a paper by Madanlal Musuvathi, Shaz Qadeer, Tom Ball, Gerard Basler, Piramanayakam Arumuga Nainar, and Iulian Neamtiu להגיד איך זה קשור לקורס

Introduction Heisenbugs Systematic exploration of program behavior הייסנבאגס – קשה למצוא ואחרי שפותרים איך יודעים שפתרנו אם זה קורה פעם בשנה Systematic exploration of program behavior Reproduction to enable easy debugging

What does CHESS do? Takes control over the scheduling of threads and asynchronous events Captures all interleaving non-determinsm Forces every run to have different interleaving Today, CHESS works on three platforms and has been used to find numerous bugs, more than half by Microsoft testers Reproduced all stress test crashes reported in smaller configurations to ease the debugging process מוצאת את כל האי-דטרמיניסם בתוכנית להסביר שיש טסטים שרצים כמה ימים על מלא חוטים (stress test) ןעם CHESS מצאו את אותם באגים כמוהם בפחות זמן ופחות חוטים

Example Reproduction of a heisenbug in CCR, a .NET library for asynchronous concurrent programming The failing test hadn’t failed for many months Changes made to reproduce: Commenting out all passing tests Changed the harness to run the test once, since CHESS takes care or running the test repeatedly

Example - reproduction Reproduced over 27 seconds CHESS reported a deadlock after exploring 6737 different thread interleavings Since CHESS recorded the scheduling, it was possible to run it under a debugger to find the issue and solve it הבאג היה שאחד החוטים בא לבצע עבודה שנמצאת בפורט מסוים וחוט אחר ביטל את כל העבודות בפורט הזה לכן כשחוט הראשון בא להמשיך לעבוד על המשימות האלה קורה exception ונוצר deadlock

Challenges Avoid perturbing the system under test and integrate with the existing test infrastructure Capture and explore all interleaving non-determinism. Understand the semantic of all synchronization functions Explore intelligently, avoid redundant search and prioritize potentially erroneous interleavings

Challenges - wrapper Avoid perturbing the system under test and integrate with the existing test infrastructure Thin wrapper layer between the program under test and the concurrency API Doesn’t change the semantics Allows CHESS to control thread scheduling Works with Win32, .NET and Singularity use DLL-shimming techniques to redirect all calls to the synchronization library by overwriting the import address table of the program under test. In addition, we use binary instrumentation to insert a call to the wrapper before instructions that use hardware synchronization mechanisms, such as interlocked operations. For .NET programs, we used an extended CLR profiler [11] that replaces calls to API functions with calls to the wrappers at JIT time. Finally, for the Singularity API, we use a static IL rewriter [1] to make the modifications.

Scheduler Two classes of non-determinism: 1. Input – values provided that affect the program execution such as return values from system calls and the state of memory when the program starts 2. Interleaving - threads running concurrently on a multi-processor and the timings of events like preemptions, asynchronous callbacks and timers

Scheduler - goals Capture all non-determinism to be able to reproduce the exact execution by replaying the choices Expose the choices to a search engine that systematically enumerates possible executions לדוגמא, שני חוטים שמחכים למנעול, כשהוא משתחרר הוא יכול להעיר אחד משניהם. צריך לשמור את לוג את מי הוא העיר כדי שיהיה אפשר לשחזר את הסנריו הזה בנוסף, צריך להגיד למנוע חיפוש שיש פה שתי אפשרויות כדי שהריצות הבאות יבדקו את שתי האפשרויות

Scheduler - challenges Capture and expose ALL deterministic choices Understand the semantic of all concurrency functions Not introduce impossible behaviors Not slow the execution Be easily integrated with existing test frameworks אם לא תופסים הכל וקולטים הכל – אולי לא נצליח לשחזר או שבכלל לא נבדוק ריצות מסוימות

Scheduler Handling input non-determinism Too expensive to support full log and replay of non-deterministic input Instead, clean the state of the memory, the disk and the network between runs, so all tests start from the same state Log and replay input values such as current time, process id’s, random numbers and error values from system calls. Doesn’t guarantee deterministic replay (search takes care of it, we will see soon)

Scheduler Choosing the right abstraction layer Redirects calls to concurrency primitives to alternate implementations In case of more complex concurrency primitives, the scheduler treats them as part of the program being tested Creates a “Happens before” graph to log and replay להתייחס למימושים מסובכים בתור התוכנית עושה את המימוש של chess הרבה יותר קל אבל הוא גם יכול לפספס ריצות לדוגמא, יש איזה מנעול שמשתשמש בתור כדי לתת את המנעול, אז הסדר של תפיסת המנעול תמיד יהיה אותו הסדר, לכן לא נחקור ריצות אחרות שיכולות להיות בגרסאות הבאות של המנעול כאשר התור ישתנה

Scheduler – Logging Lamport’s “happens-before” graph Captures the relative execution order of the threads in a concurrent execution Nodes – instructions executed, include: A task: thread, timer callbacks, threadpool work items A synchronization variable: lock, semaphore, variables accessed atomically and queues An operation Edges – determine the execution order למה זה טוב ? 1. מסתכל על כל הפרימיטיבי סנכרון בתוכנית 2. אבסטרקטי מבחינת זמנים – שומר רק את הסדר

Scheduler – Logging Lamport’s “happens-before” graph Two bits for each operation: isWrite : The operation changes the state of the resource IsRelease : The operation unblocks tasks waiting for the resource In addition, CHESS keeps a set of enabled tasks and a set of tasks waiting for each resource שמים קשתות מכל הצמתים שלפניו עם אותו מנעול וקשתות לכל הצמתים האחריו עם אותו מנעול 2. מנגנון החיפוש של chess צריך לדעת בכל זמן נתון איזה חוטים חופשיים Trylock מסומן ב is write רק כאשר הוא מצליח

Lamport’s “happens-before” graph example Thread A public int inc(){ lock.lock(); int newCount = ++count; lock.unlock(); return newCount; } Enabled tasks : A B Enabled tasks : A B Enabled tasks : A A lock isWrite A lock isWrite isRelease Thread B public int inc(){ lock.lock(); int newCount = ++count; lock.unlock(); return newCount; } Waiting for lock: Waiting for lock: B Waiting for lock: B lock isWrite B lock isWrite isRelease

Scheduler – Logging Capturing the “happens-before” graph For each call to synchronization operation CHESS needs to: Determine if the task will be disabled – if the operation is blocking Label the call – create the node Inform the scheduler when a task is created or terminated Mapping from resource handles to synchronization variables Set of threads waiting for each synchronization variable Variables for currently executing task Set of enabled tasks

Scheduler – Logging Capturing the “happens-before” graph 1. Determine if the task will be disabled – if the operation is blocking The wrapper function calls the “try” version of the locking function, if it fails the task is moved to the list of tasks waiting for the resource When a release operation is made on the resource, all the waiting tasks are moved

Scheduler – Logging Capturing the “happens-before” graph 2. Label the call – create the node Using the saved state it is easy to obtain the task and resource. Setting the bits is done by understanding the semantics of the API call. When in doubt, both bits are set to True במקרה שלא בטוחים מה הפונקציה עושה מדליקים את שני הביטים – להדליק istwrite רק מוסיף מלא קשתות ולהדליק isrelease רק יוסיף מלא שחרורםי מיותרים – אבל זה שומר על הנכונות Mapping from resource handles to synchronization variables Set of threads waiting for each synchronization variable Variables for currently executing task Set of enabled tasks 18

Scheduler – Logging Capturing the “happens-before” graph 3. Inform the scheduler when a task is created or terminated Identifies the API functions that create tasks The wrapper informs CHESS about the creation, and CHESS creates a closure to wrap the input closure with calls to the CHESS scheduler

Scheduler Capturing data-races by single threading execution Most concurrent programs include data-races, so we need to reflect them in the happens-before graph. Instead, CHESS scheduler enforces single threaded execution. Two issues with this approach: Slows down the execution – Can be avoided by running multiple instances of CHESS simultaneously CHESS may not be able to explore both possible outcomes of a data-race – We run the data-race detector on the first runs to inform the scheduler about all data races Most concurrent programs comunicate via shared memory. If the program is data-race free, then all accesses to shared memory are ordered by synchronization operations. If this is the case, the happens before graph is enough edges may result in inability to replay a given execution. One possible solution is to use a dynamic data-race detection tool [46, 9] that captures the outcome of each data-race at runtime. The main disadvantage of this approach is the performance overhead—current data-race detection slow the execution of the program by an order of magnitude. Therefore, we considered this solution too expensive.

Exploring non-determinism How does CHESS obtain control at scheduling points? How does is systematically drive the test among different schedules?

Exploring non-determinism Basic search operation Three phases: REPLAY RECORD SEARCH Uses the information gathered to determine the schedule for the next iteration. The algorithm for choosing the schedule will be shown soon Plays a sequence of scheduling choices from a file. Empty on first iteration, on next iterations contains a partial schedule found by the search phase of the last iteration Behaves as a fair, non-preemptive scheduler. On yield, picks the next thread based on its fairness priorities. In addition, extends the partial schedule on the trace file by recording which thread was chosen from all enabled threads each time – records the set of choices the were available but not chosen

Exploring non-determinism Dealing with imperfect replay The scheduler can fail to replay a trace in two cases: The thread to schedule at a point is disabled A scheduled thread performs a different sequence of synchronization operations When detected – give up replay and switch to record. Then try to replay the same trace. First, the thread to schedule at a scheduling point is disabled. This happens when a particular resource, such as a lock, was available at this scheduling point in the previous iteration but is currently unavailable. Second, a scheduled thread performs a different sequence of synchronization operations than the one present in the trace. This can happen due to a change in the program control flow resulting from a program state not reset at the end of the previous iteration.

Exploring non-determinism Special handling for sources of determinism Lazy initialization – Problem: If the late initialization performs synchronization operations, CHESS would fail the see these operations in subsequent iterations Solution: Run a few iterations of the test in order to initialize all data structures The downside, of course, is that CHESS loses the capability to interleave the lazy-initialization operations with other threads, potentially missing some bugs

Exploring non-determinism Special handling for sources of determinism 2. Interference from environment – Problem: If the system is a part of a larger environment there could be synchronization variables that are shared with other systems Solution: Replay - If a thread appears to be disabled unexpectedly, the scheduler will try to reschedule it a few times before changing to record. Record – CHESS will record it as disabled although it may no be next run. Worst case it will create a false deadlock – to check if the deadlock is real, the scheduler tries to schedule all deadlocked threads For instance, when we run CHESS on Dryad we bring up the entire Cosmos system (of which Dryad is a part) as part of the startup. While we do expect the tester to provide sufficient isolation between the system under test and its environment, it is impractical to require complete isolation. As a simple example, both Dryad and Cosmos share the same logging module, which uses a lock to protect a shared log buffer. When a Dryad thread calls into the logging module, it could potentially interfere with a Cosmos thread that is currently holding the lock

Exploring non-determinism Special handling for sources of determinism 3. Nondeterministic calls – Problem: Functions such as random() and gettimeofday() return different values each iteration Solution: Random – reseed the generation to return a predefined constant for all iterations

Exploring non-determinism Ensuring starvation-free schedules All fair schedules need to be explored Errors found on non-fair schedules are not likely to happen in the field The scheduler gives lower priority to threads that yield the processor.

Exploring non-determinism Tackling state-space explosion The CHESS scheduler is non-preemptive by default, giving it the ability to execute large bodies of code atomically. Since a non-preemptive scheduler will not model the behavior that a real scheduler may preempt a thread at just about any point in its execution, CHESS explores thread schedules giving priority to schedules with fewer preemptions. Preemption bounding - Many bugs are exposed in multithreaded programs by a few preemptions occurring in particular places in program execution. Allow preemptions at the following points: 1. Calls to synchronization primitives in the concurrency API 2. Accesses to volatile variables that participate in a data race.

Exploring non-determinism Tackling state-space explosion Inserting preemptions prudently: Most concurrency bugs happen with few preemptions in the right places. - Don’t preempt system functions or base libraries modules that are thread-free - If an access to a volatile variable is between other synchronization accesses, then don’t preempt at this point אם יש n חוטים שכל אחד עושה K צעדים יוצא שיש n^k ריצות שונות – זה המון

Exploring non-determinism Tackling state-space explosion 2. Capturing states: State caching using the trace to reach the current state. Maintain for each execution a partially ordered happens-before graph. By caching this, CHESS avoids exploring the same state redundantly. שתי תוכניות עם אותו partial happens before גרף שונות רק בסדר של פעולה סינכרון בלתי תלויות – לכן אם התוכנית היא data-race free מבחינתו זה אותה תוכנית

Exploring non-determinism Monitoring executions Since CHESS executes the program being tested, it detects all the following: NULL dereferences Segmentation faults Crashes due to memory corruption Deadlock – The set of enabled tasks becomes empty during the run Livelock – Requires the user to set a timeout for the test Data-races and non-consistent behavior – runtime overhead

Conclusions CHESS has helped find and reproduce numerous concurrency errors in large applications

Questions?