Presentation is loading. Please wait.

Presentation is loading. Please wait.

CHESS : Systematic Testing of Concurrent Programs Madan Musuvathi Shaz Qadeer Microsoft Research.

Similar presentations


Presentation on theme: "CHESS : Systematic Testing of Concurrent Programs Madan Musuvathi Shaz Qadeer Microsoft Research."— Presentation transcript:

1 CHESS : Systematic Testing of Concurrent Programs Madan Musuvathi Shaz Qadeer Microsoft Research

2 Testing multithreaded programs is HARD Specific thread interleavings expose subtle errors Testing often misses these errors Even when found, errors are hard to debug No repeatable trace Source of the bug is far away from where it manifests

3 Concurrency is a real problem Windows 2000 hot fixes Concurrency errors most common defects among detectable errors Incorrect synchronization and protocol errors most common defects among all coding errors Windows Server 2003 late cycle defects Synchronization errors second in the list, next to buffer overruns Race conditions can result in security exploits

4 Current practice Concurrency testing == Stress testing Example: testing a concurrent queue Create 100 threads performing queue operations Run for days/weeks Pepper the code with sleep ( random() ) Stress increases the likelihood of rare interleavings Makes any error found hard to debug

5 CHESS: Unit testing for concurrency Example: testing a concurrent queue Create 1 reader thread and 1 writer thread Exhaustively try all thread interleavings Run the test repeatedly on a specialized scheduler Explore a different thread interleaving each time Use model checking techniques to avoid redundancy Check for assertions and deadlocks in every run The error-trace is repeatable

6 Systematic Stress Testing Using CHESS Kernel: Threads, Scheduler, Synchronization Objects Kernel: Threads, Scheduler, Synchronization Objects While(not done) { TestScenario() } While(not done) { TestScenario() } TestScenario() { … } Program Tester Provides a Test Scenario CHESS CHESS runs the scenario in a loop Every run takes a different interleaving Every run is repeatable Win32 API

7 Conditions on Test Scenario Test scenario should terminate in all interleavings Test scenario should be idempotent Free all resources (handles, memory, …) Clear the hardware state Key observation: Existing stress tests already have these properties Because they repeatedly run for ever

8 Perturb the System as Little as Possible Kernel: Threads, Scheduler, Synchronization Objects Kernel: Threads, Scheduler, Synchronization Objects While(not done){ TestScenario() } While(not done){ TestScenario() } TestScenario(){ … } Program CHESS Win32 API Detour Win32 API calls To control and introduce nondeterminism Run the system as is On the actual OS, hardware Using system threads, synchronization Advantages Avoid reporting false errors Easy to add to existing test frameworks Use existing debuggers

9 Implementation details Handle all the Win32 synchronization mechanisms Critical sections, locks, semaphores, events,… Threadpools Asynchronous procedure calls Timers IO Completions No modification to the kernel scheduler / Win32 library CHESS drives the system along a desired by interleaving by hijacking the scheduler

10 Controlling the Scheduling Nondeterminism Nondeterministic choices for the scheduler Determine when to context switch On context switch, pick the next runnable thread to run On resource release, wake up one of the waiting threads Hijack these choices from the scheduler Ensure at most one thread is runnable No thread is waiting on a resource At chosen schedule points, block the current thread while waking the next thread Emulate program execution on a uniprocessor with context switches only at synchronization points

11 Partial-order reduction Many thread interleavings are equivalent Accesses to separate memory locations by different threads can be reordered Avoid exploring equivalent thread interleavings

12 Partial-order reduction in CHESS Algorithm: Assume the program is data-race free Context switch only at synchronization points Check for data-races in each execution Theorem: If the algorithm terminates without reporting races, then the program has no assertion failures

13 Executions on Multi-cores CHESS checks for data-races If a Test Scenario manifests a bug on a multi-core machine, then CHESS will Either report a data-race Or the bug CHESS systematically enumerates all sequentially consistent executions Any data-race free multi-core execution is equivalent to a sequentially consistent execution

14 State space explosion x = 1; y = 1; x = 1; y = 1; x = 2; y = 2; x = 2; y = 2; 2,1 1,0 0,0 1,1 2,2 2,1 2,0 2,1 2,2 1,2 2,0 2,2 1,1 1,2 1,0 1,2 1,1 y = 1; x = 1; y = 2; x = 2;

15 … y = 2; x = 2; … y = 2; State space explosion x = 1; … y = 1; x = 1; … y = 1; … n threads k steps each Number of executions = O( n nk ) Exponential in both n and k Typically: n 100 Limits scalability to large programs (large k)

16 Bounding execution depth Works very well for message-passing programs Limit the number of message exchanges Message processing code executed atomically Can go deep in the state space Does not work for multithreaded programs Even toy programs can have large number of steps (shared-variable accesses)

17 x = 1; if (p != 0) { x = p->f; } x = 1; if (p != 0) { x = p->f; } Iterative context bounding Prioritize executions with small number of preemptions Two kinds of context switches: Preemptions – forced by the scheduler e.g. Time-slice expiration Non-preemptions – a thread voluntarily yields e.g. Blocking on an unavailable lock, thread end x = p->f; } x = p->f; } x = 1; if (p != 0) { x = 1; if (p != 0) { p = 0; preemption non-preemption

18 Iterative context-bounding algorithm The scheduler has a budget of c preemptions Nondeterministically choose the preemption points Resort to non-preemptive scheduling after c preemptions Once all executions explored with c preemptions Try with c+1 preemptions Iterative context-bounding has desirable properties Property 0: Easy to implement

19 Property 1: Polynomial state space Terminating program with fixed inputs and deterministic threads n threads, k steps each, c preemptions Number of executions <= nk C c. (n+c)! = O( (n 2 k) c. n! ) Exponential in n and c, but not in k x = 1; … y = 1; x = 1; … y = 1; x = 2; … y = 2; x = 2; … y = 2; x = 1; … x = 1; … x = 2; … x = 2; … y = 1; … y = 1; … y = 2; Choose c preemption points Permute n+c atomic blocks

20 Property 2: Deep exploration possible with small bounds A context-bounded execution has unbounded depth a thread may execute unbounded number of steps within each context Event a context-bound of zero yields complete terminating executions

21 Property 3: Finds the simplest error trace Finds smallest number of preemptions to the error Number of preemptions better metric of error complexity than execution length

22 Property 4: Coverage metric If search terminates with context-bound of c, then any remaining error must require at least c+1 preemptions Intuitive estimate for The complexity of the bugs remaining in the program The chance of their occurrence in practice

23 Property 5: Lots of bugs with small number of preemptions A non-blocking implementation of the work- stealing queue algorithm bounded circular buffer accessed concurrently by readers and stealers Developer provided test harness three buggy variations of the program Each bug found with at most 2 preemptions executions with 35 preemptions are possible!

24 Context-bounding + Partial-order reduction Algorithm: Assume the program is data-race free Context switch only at synchronization points Explore executions with c preemptions Check for data-races in each execution Theorem: If the algorithm terminates without reporting races, Then the program has no assertion failures reachable with c preemptions Requires that a thread can block only at synchronization points Proof (Musuvathi-Q, PLDI 2007)

25 Bugs found ProgramKLOCMax Num Threads Bugs Reachable with Preemption Count 0123Total Bluetooth Work-Stealing Queue Transaction Manager APE Dryad Channels

26 // Function called by a worker thread // of RChannelReaderImpl void RChannelReaderImpl:: AlertApplication(RChannelItem* item) { // Notify Application // XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS); } // Function called by the main thread void TestChannel(WorkQueue* workQueue,...) { // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue); //... do work here channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel }

27 // Function called by a worker thread // of RChannelReaderImpl void RChannelReaderImpl:: AlertApplication(RChannelItem* item) { // Notify Application // XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS); } // Function called by the main thread void TestChannel(WorkQueue* workQueue,...) { // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue); //... do work here channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel }

28 // Function called by a worker thread // of RChannelReaderImpl void RChannelReaderImpl:: AlertApplication(RChannelItem* item) { // Notify Application // XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS); } // Function called by the main thread void TestChannel(WorkQueue* workQueue,...) { // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue); //... do work here channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel }

29 // Function called by a worker thread // of RChannelReaderImpl void RChannelReaderImpl:: AlertApplication(RChannelItem* item) { // Notify Application // XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS); } // Function called by the main thread void TestChannel(WorkQueue* workQueue,...) { // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue); //... do work here channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel }

30 // Function called by a worker thread // of RChannelReaderImpl void RChannelReaderImpl:: AlertApplication(RChannelItem* item) { // Notify Application // XXX: Preempt here for the bug EnterCriticalSection(&m_baseCS); // process before exit LeaveCriticalSection(&m_baseCS); } // Function called by the main thread void TestChannel(WorkQueue* workQueue,...) { // Creating a channel // allocates worker threads RChannelReader* channel = new RChannelReaderImpl(..., workQueue); //... do work here channel->Close(); // wrong assumption that channel->Close() // waits for worker threads to be finished delete channel; // BUG: deleting the channel when // worker threads still have a valid // reference to the channel }

31 Facts about Dryad error trace Long error trace but requires only one preemption Depth-bounding cannot find it without a lot of luck The error trace has 6 non-preempting context switches It is important to leave unbounded the number of non- preempting context switches This (and the other 6 errors) in Dryad remained in spite of careful regression testing and months of production use

32 Bugs found ProgramKLOCMax Num Threads Bugs Reachable with Preemption Count 0123Total Bluetooth Work-Stealing Queue Transaction Manager APE Dryad Channels

33 Coverage vs. Context-bound

34 Dryad (coverage vs. time)

35 Current CHESS applications (work in progress) Dryad (library for distributed dataflow programming) Singularity/Midori (OS in managed code) User-mode drivers Cosmos (distributed file system) SQL database

36 Conclusion Concurrency is important Building robust concurrent software is still a challenge Lack of debugging and testing tools CHESS: Concurrency unit-testing Exhaustively try all interleavings Attempt to seamlessly integrate with existing test frameworks Provide replay capability Iterative context-bounding algorithm key to the design


Download ppt "CHESS : Systematic Testing of Concurrent Programs Madan Musuvathi Shaz Qadeer Microsoft Research."

Similar presentations


Ads by Google