Presentation is loading. Please wait.

Presentation is loading. Please wait.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

Similar presentations


Presentation on theme: "- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western."— Presentation transcript:

1 - 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western Michigan University * Intel, Inc ‡ Offline Symbolic Analysis for Multi-Processor Execution Replay

2 - 2 - Overview Goal: Deterministic replay for multi-threaded programs Debug non-deterministic bugs Program Input Shared Memory Dependency Past SolutionsOur Solution Log I/O, signals, DMA, etc., Monitor memory operations Software is slow Hardware is complex BugNet [ISCA'05] Log loads (cache miss data) SAT constraint solver Determine offline before replay Sources of non-determinism Program input (interrupt, I/O, DMA, etc.) Shared-memory dependencies

3 - 3 - Deterministic Replay Uses Recorder Replayer Memory Leaks Data Races Dangling Pointers Dynamic Program Analysis Reproduce non-deterministic bugs Remote Site OR In-house Developer Site Step-Backward in time Debugging

4 - 4 - Traditional Record-N-Replay Systems Write Read Log shared memory dependencies Checkpoint Memory and Register State Log non-deterministic program input Interrupts, I/O values, DMA, etc. Log non-deterministic program input Interrupts, I/O values, DMA, etc. Thread 1Thread 2Thread 3

5 - 5 - Recording Shared Memory Dependency Problem Need to monitor every memory operation Software-based Replay System PinSEL (UCSD/Intel)iDNA (Microsoft) Hardware-based Replay System FDR/ReRun (Wisconsin) Strata (UCSD) DeLorean (UIUC) x100x10 Complex hardware

6 - 6 - Hardware Complexity Hardware-based solution Detect shared memory dependencies by monitoring cache coherence messages Transitive optimization to reduce log size Complexity Requires changes to coherence sub-system Complex to design and verify 9 design bugs in coherence mechanism of AMD64 [Narayanasamy et al. ICCD’06] W(a) W(b) R(a)

7 - 7 - New Direction to Hardware-based Solution Complexity-effective solution Do NOT record shared-memory dependencies at all Infer dependencies offline before replay using Satisfiability Modulo Theory (SMT) solver

8 - 8 - Our Approach Write Read Log shared memory dependency Checkpoint Memory and Registers Log non-deterministic program input Interrupts, I/O values, DMA, etc. Log non-deterministic program input Interrupts, I/O values, DMA, etc. BugNet [ISCA’05] Load-based Hardware Recorder BugNet [ISCA’05] Load-based Hardware Recorder Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline Checkpoint Registers

9 - 9 - Roadmap Motivation BugNet for single-threaded programs [ISCA’05] Recording cache miss data is sufficient BugNet is sufficient for multi-threaded programs Insight: BugNet can replay each thread in isolation Offline SMT Analysis Evaluation Conclusion

10 - 10 - BugNet [Narayanasamy et al, ISCA’05] Insight Recording initial register state and values of loads is sufficient for deterministic replay Implicitly captures the program input from I/O, DMA, interrupts, etc. Input and output of other instructions are reproduced during replay Optimization Record a load only if it is the first access to a memory location Our modification Recording data fetched on cache miss captures first loads Any first access to a location would result in a cache miss May unnecessarily record data due to store misses, but that is OK

11 - 11 - Recording Cache Miss Data (First Loads) Execution Time Log file First Load Checkpoint Register Values Program Counter Load A = 0 (cnt1, 0) Load B = 5 (cnt2, 5) Store C = 1 On a store miss Record old value – data before store update New value – data after store update – can be reproduced deterministically Cache Miss Checkpoint Record cache misses (Memory count, Data) Implicitly capture first loads (cnt3, 0) Deterministic Replay Input and output (including address) of all instructions are replayed

12 - 12 - BugNet Extension Self-modifying code Consider instruction read as a load; so instructions are logged Full system Replay Continue logging in kernel mode See the paper for details on context switches, page faults, etc.

13 - 13 - Roadmap Motivation BugNet for single-threaded programs [ISCA’05] Recording cache miss data is sufficient BugNet is sufficient for multi-threaded programs Insight: BugNet can replay each thread in isolation Offline SMT Analysis Evaluation Conclusion

14 - 14 - BugNet for Multithreaded Programs Insight BugNet recorder (initial register state + loads) for each thread is sufficient for replaying that thread  Recording cache miss data is sufficient for multithreaded programs  No additional hardware support required for recording dependencies Reason Load dependent on a remote write cause a cache miss to ensure coherence  BugNet implicitly records load values dependent on remote writes Effect Can replay each thread in isolation (independent of other threads) using BugNet logs

15 - 15 - Replaying Each Thread Independently Proc 1 Proc 2 Load A=0 Load A= Store A= 1 Invalidation Cache Coherence Invalidate cache block to gain exclusive permission Log cache miss data Implicitly records loads dependent on remote writes No change to coherence mechanism (1 st, 0) (3 rd, 1 ) Proc 1 LOG (1 st, 0) Proc 2 LOG Cache Miss Cache Block Invalidated 1 Replay each thread independent of others

16 - 16 - Shared Memory Dependency Thread 1 Thread 2 Load Store Load Store Load Store Load Store Load SMT Solver resolves shared memory dependency Billion instructions Offline analysis would not scale Final State : A, B, C We need to bound search space ? : Old Value x : New Value A A A B B C A A B B C C

17 - 17 - Roadmap Motivation BugNet Offline Symbolic Analysis Encoding Ordering Constraints Bounding Search Space Evaluation Conclusion

18 - 18 - Old Value Encoding Ordering Constraints Proc 1 Proc 2 x New Value x 1 x 2 x 3 x 4 x 5 x Final Program Order Constraint (Assume Sequential Consistency) Proc1 : X1 < X2 AND Proc2 : X3 < X4 < X5AND Load-Store Constraint ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND

19 - 19 - Multiple Memory Locations Proc 1 Proc 2 x 1 x 2 x 3 x 4 x 5 x Final Program Order Constraints (Assume Sequential Consistency) Proc1 :Y1 < X1 < X2 < Y2 AND Proc2 :X3 < X4 < X5 < Y3 AND Load-Store Constraints ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y1 < Y2 < Y3 AND : y y 1 2 y 3 y Final Old Value x New Value

20 - 20 - Satisfiability-Modulo-Theory (SMT) Solver SMT Solver Ordering Constraints (Program Order) ∧ (Load-Store Order for X) ∧ (Load-Store Order for Y) ∧ : Total Order x 1 x 2 x 3 x 4 x 5 y y 1 2 y 3 SMT solver Find one valid total order from multiple solutions All solutions could be produced, if needed

21 - 21 - Replay Guarantees The replayed execution has the same final register and memory states Each thread has the exactly same sequence of instructions along with input and output Reconstructed shared memory dependencies obey program order and load-store semantics

22 - 22 - Roadmap Motivation BugNet Offline Symbolic Analysis Encoding Ordering Constraints Bounding Search Space Evaluation Conclusion

23 - 23 - Bounding Search Space Proc 1 Proc 2 N cycles Final State cnt 1cnt 2 cnt 3cnt 4 Record “Strata hints” Each processor periodically records memory operation count Strata regions have a global order Strata Region 3 SMT solver analyzes One region at a time Start from the last region Final state of a region = Initial state of the following region Strata Region 2 Strata Region 1 Final State Initial State Final State Initial State Final State

24 - 24 - Strata Hints Cycle-bound After N cycles, each core records its memory operation count No communication is required between cores Problem The size of Strata region is not based to number of shared memory dependencies Can we bound based on number of shared memory dependencies? Downgrade-bound Count coherence downgrade requests Requires communication between cores, but reduces offline analysis overhead

25 - 25 - Filtering Local & Read-only Accesses Load A Store B Load B Store B Store A Filter Local accesses : no shared-memory dependency Read-only accesses : any total order is valid Load C Effectiveness < 1% of memory accesses remain to be analyzed Strata Region Thread 1 Thread 2

26 - 26 - Roadmap Motivation Record & Replay Offline Symbolic Analysis Evaluation Strata Hint Size Offline Symbolic Analysis Overhead Conclusion

27 - 27 - Evaluation Simics + cycle accurate simulator Simulate multi-processor execution (2, 4, 8,16 cores) Fast-forward up to known synchronization points Trace collected for 500 million instructions Benchmarks SPLASH2 : barnes, fmm, ocean Parsec 2.0 : blackscholes, bodytrack, x264 SPEComp : wupwise, swim Apache MySQL Yices SMT constraint solver [Dutertre and Moura CAV’06]

28 - 28 - Strata Hints Size vs. Offline Analysis Overhead Downgrade-bound scheme is effective Cycle-bound (10,000) Downgrade-bound (25) Downgrade-bound (10) 10% x100 Offline analysis overhead is one-time cost (not for every replay)

29 - 29 - Strata hints vs. ReRun log Strata hints are 4x less than ReRun log Significant reduction in hardware complexity Proposed System ReRun [Hower and Hill, ISCA’08] x4

30 - 30 - Recording Performance, etc. Cache Miss Data Log 290 Mbytes / one second of program execution Recording Performance On average, 0.35% slowdown in IPC Scalability results can be found in the paper

31 - 31 - Conclusion Deterministic replay for multi-threaded program is critical We proposed a complexity-effective solution Use BugNet : Record cache miss data No need to record shared memory dependencies Determine shared memory dependency using SMT constraint solver offline Result < 1% recording overhead Efficient log size (4x smaller than state-of-the-art scheme ReRun) Can analyze one second of 8-threaded program in less than 1000 seconds One-time offline analysis cost (not for every replay)

32 - 32 - Thank you


Download ppt "- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western."

Similar presentations


Ads by Google