Presentation is loading. Please wait.

Presentation is loading. Please wait.

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

Similar presentations


Presentation on theme: "CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear."— Presentation transcript:

1 CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear at ASPLOS 2010

2 2CoreDet Goal: Execution-level determinism... of arbitrary, unmodified C/C++ pthreads programs without special hardware without sacrificing scalability Implementation: a compiler pass in LLVM, plus a custom runtime library

3 3CoreDet Goal: Execution-level determinism... of arbitrary, unmodified C/C++ pthreads programs without special hardware without sacrificing scalability Implementation: a compiler pass in LLVM, plus a custom runtime library

4 4CoreDet An implementation of DMP in software DMP-Ownership A new protocol DMP-Buffering

5 5CoreDet An implementation of DMP in software DMP-Ownership straightforward to implement poorer scalability, fewer overheads DMP-Buffering better scalability, more overheads no speculation key insight: relaxed memory consistency (specifically, Weak Ordering) A new protocol

6 6CoreDet An implementation of DMP in software DMP-Ownership straightforward to implement poorer scalability, fewer overheads DMP-Buffering better scalability, more overheads no speculation (easier to implement than TM) key insight: relaxed memory consistency (specifically, Weak Ordering) A new protocol

7 7 CoreDet: Implementation A compiler instruments the code with calls to the runtime static optimizations to remove instrumentation A runtime library scheduling threads tracks interthread communication deterministic wrappers for...  pthreads  malloc

8 8Outline DMP-Ownership in Software Why not DMP-TM in Software? DMP-Buffering A Few Implementation Details Performance

9 9DMP-Serial a := x a := x b := y b := y x := a * 2 x := a * 2 y := a + b y := a + b Thread 1 c := x c := x d := y d := y x := a * 3 x := a * 3 y := a - b y := a - b Thread 2

10 10DMP-Serial a := x a := x b := y b := y x := a * 2 x := a * 2 y := a + b y := a + b Thread 1 c := x c := x d := y d := y x := a * 3 x := a * 3 y := a - b y := a - b Thread 2 quantum

11 11DMP-Serial end of round time T1T1 T3T3 T2T2 quantum Execution is completely serialized

12 12DMP-Ownership Parallel Serial Parallel mode: no communication (can write only to private data) Serial mode: arbitrary communication end of round time T1T1 T3T3 T2T2 MOT x owned-by T 1 y shared z owned-by T 2 :::: ::::

13 13 DMP-Ownership in Software T1T1 T3T3 T2T2 x owned-by T 1 y shared z owned-by T 2 :::: :::: Quantum Formation Count instructions: e.g. at the end of each basic block Requirements:

14 14 DMP-Ownership in Software T1T1 T3T3 T2T2 x owned-by T 1 y shared z owned-by T 2 :::: :::: Quantum Formation Count instructions: e.g. at the end of each basic block Ownership Tracking Instrument every load/store Manipulate an MOT (in memory) via a runtime library Requirements:

15 15 DMP-Ownership in Software We have implemented this in CoreDet.... but the scalability is not that great splash mean parsec mean

16 16Outline DMP-Ownership in Software Why not DMP-TM in Software? DMP-Buffering A Few Implementation Details Performance

17 17DMP-Serial end of round time T1T1 T3T3 T2T2 quantum Execution is completely serialized

18 18DMP-TM end of round time T1T1 T3T3 T2T2 Execution is parallel and transactional commit implicit transactions

19 19 DMP-TM in Software Quantum Formation Count instructions Requirements: T1T1 T3T3 T2T2

20 20 DMP-TM in Software Quantum Formation Count instructions Transactional Execution Instrument every load/store Use an off-the-shelf STM with minor modifications Requirements: This is really hard! T1T1 T3T3 T2T2

21 21 DMP-TM: Why not STM? STMs make the wrong assumptions: 1)Most code is not transactional Transactions are short Transactions are scoped void foo() {... begin_transaction() return } An unscoped transaction:

22 22 DMP-TM: Why not STM? STMs make the wrong assumptions. 1)Most code is not transactional DMP-TM Reality: All code is transactional Transactions are short DMP-TM Reality: Transactions should be long to amortize the cost of commit T1T1 T3T3 T2T2

23 23 DMP-TM: Why not STM? STMs make the wrong assumptions. 3)Transactions are scoped DMP-TM Reality: Transactions are not scoped We cannot form balanced quanta with scoped transactions Imbalance leads to poor performance T1T1 T3T3 T2T2

24 24 DMP-TM: Why not STM? STMs make the wrong assumptions. 3)Transactions are scoped DMP-TM Reality: Transactions are not scoped When transactions are not scoped, we must be prepared to rollback the entire callstack! void foo() {... begin_transaction() return }

25 25 ➡ Speculation makes things hard ➡ Good scalability because of versioned memory (versions are modified in parallel) DMP-TM: What Can We Learn? DMP-Buffering: Use versioned memory without speculation

26 26Outline DMP-Ownership in Software Why not DMP-TM in Software? DMP-Buffering A Few Implementation Details Performance

27 27DMP-Buffering Parallel mode: no communication (versioned memory via store buffering) time T1T1 T3T3 T2T2 Parallel Global Memory (read only) Thread x :=.. y.. x :=.. y.. Store Buffer

28 28DMP-Buffering Parallel mode: no communication (versioned memory via store buffering) Commit mode: deterministically (serially) publish local store buffers Global Memory...... Store Buffer Thread time T1T1 T3T3 T2T2 Parallel Commit stores are reordered x=..

29 29DMP-Buffering Parallel mode: no communication (versioned memory via store buffering) Global Memory (read/write) lock(x) lock(x) Store Buffer Thread end of round time T1T1 T3T3 T2T2 Parallel Serial Commit Commit mode: deterministically (serially) publish local store buffers Serial mode: used for synchronization (e.g. atomic ops)

30 30DMP-Buffering T1T1 T3T3 T2T2 Parallel mode: buffer stores locally Commit mode: publish local store buffers Serial mode: used for synchronization (e.g. atomic ops) ends at synchronization (atomic ops and fences), and quantum boundaries happens serially for determinism executes in parallel for performance

31 31DMP-Buffering A = 1 A = 1 if (B == 0) if (B == 0)...... B = 1 B = 1 if (A == 0) if (A == 0)...... Thread 1Thread 2 Dekker’s Algorithm (there is a data race)

32 32DMP-Buffering A = buffer[A] A = buffer[A] B = buffer[B] B = buffer[B] Thread 1Thread 2 buffer[A] = 1 buffer[A] = 1 if (B == 0) if (B == 0)...... buffer[B] = 1 buffer[B] = 1 if (A == 0) if (A == 0)......

33 33DMP-Buffering A = buffer[A] A = buffer[A] B = buffer[B] B = buffer[B] Thread 1Thread 2 parallel commit buffer[A] = 1 buffer[A] = 1 if (B == 0) if (B == 0)...... buffer[B] = 1 buffer[B] = 1 if (A == 0) if (A == 0)...... This is deterministic... reordered

34 34DMP-Buffering A = buffer[A] A = buffer[A] B = buffer[B] B = buffer[B] Thread 1Thread 2 parallel commit buffer[A] = 1 buffer[A] = 1 if (B == 0) if (B == 0)...... buffer[B] = 1 buffer[B] = 1 if (A == 0) if (A == 0)......... but not sequentially consistent (cycle in the happens-before graph)

35 35DMP-Buffering Thread 1Thread 2 A = 1 A = 1 tmp 1 = B tmp 1 = B if (tmp 1 == 0) if (tmp 1 == 0)...... B = 1 B = 1 tmp 2 = A tmp 2 = A if (tmp 2 == 0) if (tmp 2 == 0)...... Dekker’s Algorithm (again) Let’s remove the data race...

36 36DMP-Buffering A = 1 A = 1 tmp 1 = B tmp 1 = B Thread 1Thread 2 lock(L) lock(L) unlock(L) unlock(L) lock(L) lock(L) unlock(L) unlock(L) if (tmp 1 == 0) if (tmp 1 == 0)...... B = 1 B = 1 tmp 2 = A tmp 2 = A if (tmp 2 == 0) if (tmp 2 == 0)...... Dekker’s Algorithm (no data race)

37 37DMP-Buffering A = 1 A = 1 tmp 1 = B tmp 1 = B lock(L) lock(L) unlock(L) unlock(L) lock(L) lock(L) unlock(L) unlock(L) if (tmp 1 == 0) if (tmp 1 == 0)...... B = 1 B = 1 tmp 2 = A tmp 2 = A if (tmp 2 == 0) if (tmp 2 == 0)...... parallel + commit serial parallel + commit serial parallel + commit Synchronization happens sequentially

38 DMP-Buffering A = 1 A = 1 tmp 1 = B tmp 1 = B lock(L) lock(L) unlock(L) unlock(L) lock(L) lock(L) unlock(L) unlock(L) if (tmp 1 == 0) if (tmp 1 == 0)...... B = 1 B = 1 tmp 2 = A tmp 2 = A if (tmp 2 == 0) if (tmp 2 == 0)...... parallel + commit serial parallel + commit serial parallel + commit Synchronization happens sequentially Data race free programs are sequentially consistent (required by C++ and Java memory models)

39 39DMP-Buffering Atomic ops must happen in serial mode CAS(X,a,b) CAS(X,a,b) CAS(X,a,c) CAS(X,a,c)

40 40DMP-Buffering Atomic ops must happen in serial mode tmp = x tmp = x if (tmp == a) if (tmp == a) x = b x = b tmp = x tmp = x if (tmp == a) if (tmp == a) x = c x = c

41 41DMP-Buffering Atomic ops must happen in serial mode tmp = x tmp = x if (tmp == a) if (tmp == a) x = b x = b tmp = x tmp = x if (tmp == a) if (tmp == a) x = c x = c parallel commit not atomic Synchronization, e.g. lock(), must happen in serial mode These are atomic ops There is an implied fence (must flush store buffers)

42 42 DMP-Buffering in Software Quantum Formation Count instructions Store Buffering Redirect every load/store into the store buffer Publish store buffers deterministically Requirements: T1T1 T3T3 T2T2

43 43 DMP-Buffering: Parallel Commit For determinism, the commit order must be deterministic i.e. logically serial For performance, the commit must happen in parallel Basic idea: Publish store buffers in parallel Preserve the commit order on collisions

44 44 0xA0 0xB0 0xC0 addr value 0xD0 0xE0 Global Memory DMP-Buffering: Parallel Commit 0xA1 0xB1 0xC1 0xC2 0xD2 0xE2 addr value Thread 1 Thread 2 Basic idea: Commit store buffers in parallel Preserve the commit order on collisions Collision! Resolve for Thread 2

45 45 GlobalMemory DMP-Buffering: Parallel Commit For determinism, the commit order must be deterministic i.e. logically serial For performance, the commit must happen in parallel Basic idea: Commit store buffer chunks in any order Preserve the serial commit order chunk 0xA chunk 0xB chunk 0xA 0xA0xB Thread 1 Thread 2 committed

46 46Outline DMP-Ownership in Software Why not DMP-TM in Software? DMP-Buffering A Few Implementation Details Performance

47 47 Quantum Formation “Just” instruction counting Tension between: Perfect counting, for maximal balance  e.g. every basic block Minimal counting, for minimal overhead  e.g. only backedges and recursive calls Heuristic compromise: :: : cnt += 5 cnt += 3 cnt += 50cnt += 54 35 50

48 48 Redundant accesses y =... x... z =... x... Remove Instrumention From... Accesses to thread-local (non-escaping) objects don’t need to instrument this

49 49 Remove Instrumention From... Accesses to thread-local (non-escaping) objects don’t need to instrument this  DMP-Buffering: requires unification-based points-to analysis int local; int *p = (...) ? &local : &global;... must access through the store buffer Redundant accesses y =... x... z =... x...

50 50 Remove Instrumention From... Accesses to thread-local (non-escaping) objects don’t need to instrument this Redundant accesses y =... x... z =... x...  DMP-Buffering: requires unification-based points-to analysis int local; int *p = (...) ? &local : &global;... must access through the store buffer

51 51 Remove Instrumention From... Accesses to thread-local (non-escaping) objects don’t need to instrument this Redundant accesses y =... x... z =... x...  DMP-Buffering: requires unification-based points-to analysis int local; int *p = (...) ? &local : &global;... must access through the store buffer  DMP-Buffering: this does not apply

52 52 External Libraries We do not instrument external shared libraries, such as the system libc 1. External calls must be serialized Preventing over-serialization: We check indirect calls at runtime We provide deterministic wrappers for common libc functions, e.g. memcpy and malloc We do not serialize pure libc functions, e.g. sqrt

53 53Outline DMP-Ownership in Software Why not DMP-TM in Software? DMP-Buffering A Few Implementation Details Performance

54 54Scalability splash mean parsec mean

55 55Overheads splash mean parsec mean Since we preserve scalability, we can overcome overheads by adding cores

56 56 Wrap Up CoreDet execution-level determinism in software of arbitrary multithreaded programs DMP-Buffering uses a relaxed memory consistency model scales comparably to nondeterministic execution Thank you! The CoreDet source code will eventually be released at http://sampa.cs.washington.edu


Download ppt "CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear."

Similar presentations


Ads by Google