Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation Deterministic Multiprocessing.

Similar presentations


Presentation on theme: "Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation Deterministic Multiprocessing."— Presentation transcript:

1 Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation Deterministic Multiprocessing

2 What is Deterministic MP? 2 Multiprocessor executes multiple threads Threads share resources (ie, memory) Due to bus arbiters, memory controllers, etc, some orderings in shared resources are undefined Problem for: debugging (reproducibility), thorough testing (many possible cases) Deterministic: same input  same output

3 Types of Determinism 3 Strong: same input  same output, regardless of race conditions Must capture all communicating memory access pairs Weak: same input  same output, as long as locking is correct Takes advantage of locks for low SW overhead

4 Types of Deterministic Execution 4 Record/Replay: HW/SW keeps log of program input Single-program: system calls, memory interleavings Full-system: interrupts, I/O, etc Log allows later replay of a bug However, several executions may still differ outside of replay Full-time Ordering of memory accesses follows a statically-defined deterministic order: for same program and same input, output is always same

5 DMP: Deterministic Shared Memory Multiprocessing 5 Devietti, Lucia, Ceze, Oskin

6 Central Idea To guarantee deterministic behavior: - the direct way is to preserve the same global interleaving of instructions in every execution of a parallel program - unnecessary and significant performance impact Insight: only communicating pairs matter

7 Improve a bit....... Not all memory access is communicating can parallelize communication-free portion in each quantum need to know when communications happen! MESI cache coherence protocol provides this for free DMP Sharing Table - tracks info about mem ownership - two ownership change possibilities: - reading data owned by others - writing data to shared memory

8 Improve a bit more...... Transactional Memory + deterministic commit order TM: atomic and isolation of quantum Speculation: find quantum not involved in communication If communication happens, squash + re-execute potential optimization: forward uncommitted (or speculative) data between quanta could save a large number of squashes

9 Performance

10 Discussion Speculation similar idea, but use for opposite purpose to TLS require complex hardware I/O or parts of OS can not execute speculatively Dealing with nondeterminism threads can use OS to communicate nondeterministic OS API calls, e.g. read Better way of token-passing?

11 Kendo: Efficient Deterministic Multithreading in Software 11 Olszewski, Ansel, Amarasinghe

12 Definitions Strong Determinism Deterministic order of memory accesses to shared data for particular program input ALWAYS produces same output for every run with a particular input Not easily providable without hardware support Weak Determinism Deterministic order of lock acquisitions for a given program input Produces same output for every run if race-free Can be guaranteed if all accesses to shared data protected by locks If no data-races, strong and weak determinism provide same guarantees!

13 Introducing Kendo Software framework to enforce weak determinism of general lock-based C/C++ code for commodity shared-memory multiprocessors No special hardware necessary! Deterministic Logical Time Each thread has its own monotonically increasing deterministic logical clock How to implement? Performance counter events? When is it a thread T's turn to use a lock? All threads with tid < T have greater logical clocks All threads with tid ≥ T have greater or equal logical clocks

14 Simple Locking Mechanism function det_mutex_lock(l) { pause_logical_clock(); wait_for_turn(); lock(l); inc_logical_clock(); resume_logical_clock(); } function det_mutex_unlock(l) { unlock(l); } Simple algorithm for implementing locks Pause logical clock during acquisition and wait for turn to access lock (using heuristic in previous slide) Once in critical section resume the clock and continue Pros: o Easy to implement Problems?

15 Improved Lock function det_mutex_lock(l){ pause_logical_clock(); while(true){ // Loop until we have successfully acquired the lock. wait_for_turn(); // Wait for our deterministic logical clock to be unique global minimum if (try_lock(l)){ // Check the state of the lock, acquiring it if it is free if(l.released_logical_time // Lock is free in physical time, but still acquired in >= get_logical_clock()){ // deterministic logical time so we cannot acquire it yet unlock(l); // Release the lock } else { // Lock is free in both physical and in deterministic logical break; // time, so it is safe to exit the spin loop } inc_logical_clock(); // Increment our deterministic logical clock and start over } inc_logical_clock(); // Increment our deterministic logical clock before exiting resume_logical_clock(); } function det_mutex_unlock(l){ pause_logical_clock(); l.released_logical_time = get_logical_clock(); unlock(l); inc_logical_clock(); resume_logical_clock(); }

16 Optimizations Queuing Queue for each lock guarantees first-come first-serve Fast-forwarding While waiting for a lock can set logical time to lock.released_logical_time (or +1 if queuing) Lazy reads If application can read out-of-date shared data, no need to lock on read (i.e. finding a "best" value) Provide read window (in logical time), if all threads past earliest allowable logical time, can successfully read

17 Results

18 Capo: A Software-Hardware Interface for Practical Deterministic Multiprocessor Replay 18 Montesinos, Hicks, King, Torellas

19 Capo: Motivation Record/replay system for debugging Not intended to be deployed in the field Builds on DeLorean [1] Chunk-based record/replay system Terminate chunks at communicating pairs, record chunk commit order only Only half the story Capo adds software side as a Linux implementation: Record syscall results Provide infrastructure to record/replay multiple programs and multiplex hardware record/replay features [1] P. Montesinos, L. Ceze, and J. Torrellas, “DeLorean: Recording and Deterministically Replaying Shared- Memory Multiprocessor Execution Efficiently,” in ISCA, June 2008.

20 Capo's Contributions Replay Spheres: distinct realms of record/replay Defining hardware-software interface Simulated DeLorean hardware (chunk-based recording) Linux kernel modifications

21 Capo Architecture Replay Sphere: set of R-threads; isolated environment Arbitrary set of processes is inside sphere Replay Sphere Mgr: multiplexes HW support over spheres HW: records chunk commit order (DeLorean) SW: records system calls OS not inside sphere, except copy_to_user()

22 Hardware Details

23 Performance Record Replay

24 Log Size

25 Summary (Devietti et al)

26 Discussion 26 Which is more useful: record/replay or full-time? Debugging only, vs. system design philosophy Tradeoff: cost (log size, overhead) vs. utility Strong vs. weak determinism Race conditions are an important class of bugs


Download ppt "Chris Fallin, David Lewis, Zongwei Zhou Date & location of presentation Deterministic Multiprocessing."

Similar presentations


Ads by Google