CHESS Finding and Reproducing Heisenbugs in Concurrent Programs

Name: CHESS Finding and Reproducing Heisenbugs in Concurrent Programs
Uploaded: 2017-10-10T05:53:48+00:00
Duration: PTM17S25
Channel: David Grant
Description: CHESS Finding and Reproducing Heisenbugs in Concurrent Programs

CHESS Finding and Reproducing Heisenbugs in Concurrent Programs
Julieta Arakelian Based on a paper by Madanlal Musuvathi, Shaz Qadeer, Tom Ball, Gerard Basler, Piramanayakam Arumuga Nainar, and Iulian Neamtiu להגיד איך זה קשור לקורס

Introduction Heisenbugs Systematic exploration of program behavior
הייסנבאגס – קשה למצוא ואחרי שפותרים איך יודעים שפתרנו אם זה קורה פעם בשנה Systematic exploration of program behavior Reproduction to enable easy debugging

What does CHESS do? Takes control over the scheduling of threads and asynchronous events Captures all interleaving non-determinsm Forces every run to have different interleaving Today, CHESS works on three platforms and has been used to find numerous bugs, more than half by Microsoft testers Reproduced all stress test crashes reported in smaller configurations to ease the debugging process מוצאת את כל האי-דטרמיניסם בתוכנית להסביר שיש טסטים שרצים כמה ימים על מלא חוטים (stress test) ןעם CHESS מצאו את אותם באגים כמוהם בפחות זמן ופחות חוטים

Example Reproduction of a heisenbug in CCR, a .NET library for asynchronous concurrent programming The failing test hadn’t failed for many months Changes made to reproduce: Commenting out all passing tests Changed the harness to run the test once, since CHESS takes care or running the test repeatedly

Example - reproduction
Reproduced over 27 seconds CHESS reported a deadlock after exploring 6737 different thread interleavings Since CHESS recorded the scheduling, it was possible to run it under a debugger to find the issue and solve it הבאג היה שאחד החוטים בא לבצע עבודה שנמצאת בפורט מסוים וחוט אחר ביטל את כל העבודות בפורט הזה לכן כשחוט הראשון בא להמשיך לעבוד על המשימות האלה קורה exception ונוצר deadlock

Challenges Avoid perturbing the system under test and integrate with the existing test infrastructure Capture and explore all interleaving non-determinism. Understand the semantic of all synchronization functions Explore intelligently, avoid redundant search and prioritize potentially erroneous interleavings

Challenges - wrapper Avoid perturbing the system under test and integrate with the existing test infrastructure Thin wrapper layer between the program under test and the concurrency API Doesn’t change the semantics Allows CHESS to control thread scheduling Works with Win32, .NET and Singularity use DLL-shimming techniques to redirect all calls to the synchronization library by overwriting the import address table of the program under test. In addition, we use binary instrumentation to insert a call to the wrapper before instructions that use hardware synchronization mechanisms, such as interlocked operations. For .NET programs, we used an extended CLR profiler [11] that replaces calls to API functions with calls to the wrappers at JIT time. Finally, for the Singularity API, we use a static IL rewriter [1] to make the modifications.

Scheduler Two classes of non-determinism: 1. Input – values provided that affect the program execution such as return values from system calls and the state of memory when the program starts 2. Interleaving - threads running concurrently on a multi-processor and the timings of events like preemptions, asynchronous callbacks and timers

Scheduler - goals Capture all non-determinism to be able to reproduce the exact execution by replaying the choices Expose the choices to a search engine that systematically enumerates possible executions לדוגמא, שני חוטים שמחכים למנעול, כשהוא משתחרר הוא יכול להעיר אחד משניהם. צריך לשמור את לוג את מי הוא העיר כדי שיהיה אפשר לשחזר את הסנריו הזה בנוסף, צריך להגיד למנוע חיפוש שיש פה שתי אפשרויות כדי שהריצות הבאות יבדקו את שתי האפשרויות

Scheduler - challenges
Capture and expose ALL deterministic choices Understand the semantic of all concurrency functions Not introduce impossible behaviors Not slow the execution Be easily integrated with existing test frameworks אם לא תופסים הכל וקולטים הכל – אולי לא נצליח לשחזר או שבכלל לא נבדוק ריצות מסוימות

Scheduler Handling input non-determinism
Too expensive to support full log and replay of non-deterministic input Instead, clean the state of the memory, the disk and the network between runs, so all tests start from the same state Log and replay input values such as current time, process id’s, random numbers and error values from system calls. Doesn’t guarantee deterministic replay (search takes care of it, we will see soon)

Scheduler Choosing the right abstraction layer
Redirects calls to concurrency primitives to alternate implementations In case of more complex concurrency primitives, the scheduler treats them as part of the program being tested Creates a “Happens before” graph to log and replay להתייחס למימושים מסובכים בתור התוכנית עושה את המימוש של chess הרבה יותר קל אבל הוא גם יכול לפספס ריצות לדוגמא, יש איזה מנעול שמשתשמש בתור כדי לתת את המנעול, אז הסדר של תפיסת המנעול תמיד יהיה אותו הסדר, לכן לא נחקור ריצות אחרות שיכולות להיות בגרסאות הבאות של המנעול כאשר התור ישתנה

Scheduler – Logging Lamport’s “happens-before” graph
Captures the relative execution order of the threads in a concurrent execution Nodes – instructions executed, include: A task: thread, timer callbacks, threadpool work items A synchronization variable: lock, semaphore, variables accessed atomically and queues An operation Edges – determine the execution order למה זה טוב ? 1. מסתכל על כל הפרימיטיבי סנכרון בתוכנית 2. אבסטרקטי מבחינת זמנים – שומר רק את הסדר

Scheduler – Logging Lamport’s “happens-before” graph
Two bits for each operation: isWrite : The operation changes the state of the resource IsRelease : The operation unblocks tasks waiting for the resource In addition, CHESS keeps a set of enabled tasks and a set of tasks waiting for each resource שמים קשתות מכל הצמתים שלפניו עם אותו מנעול וקשתות לכל הצמתים האחריו עם אותו מנעול 2. מנגנון החיפוש של chess צריך לדעת בכל זמן נתון איזה חוטים חופשיים Trylock מסומן ב is write רק כאשר הוא מצליח

Lamport’s “happens-before” graph example
Thread A public int inc(){ lock.lock(); int newCount = ++count; lock.unlock(); return newCount; } Enabled tasks : A B Enabled tasks : A B Enabled tasks : A A lock isWrite A lock isWrite isRelease Thread B public int inc(){ lock.lock(); int newCount = ++count; lock.unlock(); return newCount; } Waiting for lock: Waiting for lock: B Waiting for lock: B lock isWrite B lock isWrite isRelease

Scheduler – Logging Capturing the “happens-before” graph
For each call to synchronization operation CHESS needs to: Determine if the task will be disabled – if the operation is blocking Label the call – create the node Inform the scheduler when a task is created or terminated Mapping from resource handles to synchronization variables Set of threads waiting for each synchronization variable Variables for currently executing task Set of enabled tasks

1. Determine if the task will be disabled – if the operation is blocking The wrapper function calls the “try” version of the locking function, if it fails the task is moved to the list of tasks waiting for the resource When a release operation is made on the resource, all the waiting tasks are moved

2. Label the call – create the node Using the saved state it is easy to obtain the task and resource. Setting the bits is done by understanding the semantics of the API call. When in doubt, both bits are set to True במקרה שלא בטוחים מה הפונקציה עושה מדליקים את שני הביטים – להדליק istwrite רק מוסיף מלא קשתות ולהדליק isrelease רק יוסיף מלא שחרורםי מיותרים – אבל זה שומר על הנכונות Mapping from resource handles to synchronization variables Set of threads waiting for each synchronization variable Variables for currently executing task Set of enabled tasks 18

3. Inform the scheduler when a task is created or terminated Identifies the API functions that create tasks The wrapper informs CHESS about the creation, and CHESS creates a closure to wrap the input closure with calls to the CHESS scheduler

Scheduler Capturing data-races by single threading execution
Most concurrent programs include data-races, so we need to reflect them in the happens-before graph. Instead, CHESS scheduler enforces single threaded execution. Two issues with this approach: Slows down the execution – Can be avoided by running multiple instances of CHESS simultaneously CHESS may not be able to explore both possible outcomes of a data-race – We run the data-race detector on the first runs to inform the scheduler about all data races Most concurrent programs comunicate via shared memory. If the program is data-race free, then all accesses to shared memory are ordered by synchronization operations. If this is the case, the happens before graph is enough edges may result in inability to replay a given execution. One possible solution is to use a dynamic data-race detection tool [46, 9] that captures the outcome of each data-race at runtime. The main disadvantage of this approach is the performance overhead—current data-race detection slow the execution of the program by an order of magnitude. Therefore, we considered this solution too expensive.

Exploring non-determinism
How does CHESS obtain control at scheduling points? How does is systematically drive the test among different schedules?

Exploring non-determinism Basic search operation
Three phases: REPLAY RECORD SEARCH Uses the information gathered to determine the schedule for the next iteration. The algorithm for choosing the schedule will be shown soon Plays a sequence of scheduling choices from a file. Empty on first iteration, on next iterations contains a partial schedule found by the search phase of the last iteration Behaves as a fair, non-preemptive scheduler. On yield, picks the next thread based on its fairness priorities. In addition, extends the partial schedule on the trace file by recording which thread was chosen from all enabled threads each time – records the set of choices the were available but not chosen

Exploring non-determinism Dealing with imperfect replay
The scheduler can fail to replay a trace in two cases: The thread to schedule at a point is disabled A scheduled thread performs a different sequence of synchronization operations When detected – give up replay and switch to record. Then try to replay the same trace. First, the thread to schedule at a scheduling point is disabled. This happens when a particular resource, such as a lock, was available at this scheduling point in the previous iteration but is currently unavailable. Second, a scheduled thread performs a different sequence of synchronization operations than the one present in the trace. This can happen due to a change in the program control flow resulting from a program state not reset at the end of the previous iteration.

Exploring non-determinism Special handling for sources of determinism
Lazy initialization – Problem: If the late initialization performs synchronization operations, CHESS would fail the see these operations in subsequent iterations Solution: Run a few iterations of the test in order to initialize all data structures The downside, of course, is that CHESS loses the capability to interleave the lazy-initialization operations with other threads, potentially missing some bugs

2. Interference from environment – Problem: If the system is a part of a larger environment there could be synchronization variables that are shared with other systems Solution: Replay - If a thread appears to be disabled unexpectedly, the scheduler will try to reschedule it a few times before changing to record. Record – CHESS will record it as disabled although it may no be next run. Worst case it will create a false deadlock – to check if the deadlock is real, the scheduler tries to schedule all deadlocked threads For instance, when we run CHESS on Dryad we bring up the entire Cosmos system (of which Dryad is a part) as part of the startup. While we do expect the tester to provide sufficient isolation between the system under test and its environment, it is impractical to require complete isolation. As a simple example, both Dryad and Cosmos share the same logging module, which uses a lock to protect a shared log buffer. When a Dryad thread calls into the logging module, it could potentially interfere with a Cosmos thread that is currently holding the lock

3. Nondeterministic calls – Problem: Functions such as random() and gettimeofday() return different values each iteration Solution: Random – reseed the generation to return a predefined constant for all iterations

Exploring non-determinism Ensuring starvation-free schedules
All fair schedules need to be explored Errors found on non-fair schedules are not likely to happen in the field The scheduler gives lower priority to threads that yield the processor.

Exploring non-determinism Tackling state-space explosion
The CHESS scheduler is non-preemptive by default, giving it the ability to execute large bodies of code atomically. Since a non-preemptive scheduler will not model the behavior that a real scheduler may preempt a thread at just about any point in its execution, CHESS explores thread schedules giving priority to schedules with fewer preemptions. Preemption bounding - Many bugs are exposed in multithreaded programs by a few preemptions occurring in particular places in program execution. Allow preemptions at the following points: 1. Calls to synchronization primitives in the concurrency API 2. Accesses to volatile variables that participate in a data race.

Inserting preemptions prudently: Most concurrency bugs happen with few preemptions in the right places. - Don’t preempt system functions or base libraries modules that are thread-free - If an access to a volatile variable is between other synchronization accesses, then don’t preempt at this point אם יש n חוטים שכל אחד עושה K צעדים יוצא שיש n^k ריצות שונות – זה המון

2. Capturing states: State caching using the trace to reach the current state. Maintain for each execution a partially ordered happens-before graph. By caching this, CHESS avoids exploring the same state redundantly. שתי תוכניות עם אותו partial happens before גרף שונות רק בסדר של פעולה סינכרון בלתי תלויות – לכן אם התוכנית היא data-race free מבחינתו זה אותה תוכנית

Exploring non-determinism Monitoring executions
Since CHESS executes the program being tested, it detects all the following: NULL dereferences Segmentation faults Crashes due to memory corruption Deadlock – The set of enabled tasks becomes empty during the run Livelock – Requires the user to set a timeout for the test Data-races and non-consistent behavior – runtime overhead

Conclusions CHESS has helped find and reproduce numerous concurrency errors in large applications

Questions?

CHESS Finding and Reproducing Heisenbugs in Concurrent Programs

Similar presentations

Presentation on theme: "CHESS Finding and Reproducing Heisenbugs in Concurrent Programs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CHESS Finding and Reproducing Heisenbugs in Concurrent Programs

Similar presentations

Presentation on theme: "CHESS Finding and Reproducing Heisenbugs in Concurrent Programs"— Presentation transcript:

Similar presentations

About project

Feedback