- 1 - Dongyoon Lee †, Mahmoud Said, Satish Narayanasamy †, Zijiang James Yang, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

Slides:

Advertisements

Similar presentations

Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.

Advertisements

More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

Michael Bond (Ohio State) Milind Kulkarni (Purdue)

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Multiprocessor Architectures for Speculative Multithreading Josep Torrellas, University of Illinois The Bulk Multicore Architecture for Programmability.

The Case for a SC-preserving Compiler Madan Musuvathi Microsoft Research Dan Marino Todd Millstein UCLA University of Michigan Abhay Singh Satish Narayanasamy.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.

1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug.

Continuously Recording Program Execution for Deterministic Replay Debugging.

1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.

Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.

UPC Reducing Misspeculation Penalty in Trace-Level Speculative Multithreaded Architectures Carlos Molina ψ, ф Jordi Tubella ф Antonio González λ,ф ISHPC-VI,

Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.

BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.

Chapter Hardwired vs Microprogrammed Control Multithreading

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,

MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.

MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.

Samsara: Efficient Deterministic Replay with Hardware Virtualization Extensions Peking University Shiru Ren, Chunqi Li, Le Tan, and Zhen Xiao July 27 ，

What is the Cost of Determinism?

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Jonathan Walpole (based on a slide set from Vidhya Sivasankaran)

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra National University of Singapore.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.

Efficient Deterministic Replay of Multithreaded Executions in a Managed Language Virtual Machine Michael Bond Milind Kulkarni Man Cao Meisam Fathi Salmi.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.

On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,

DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science.

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.

Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan November 29,

Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd.

Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.

G. Venkataramani, I. Doudalis, Y. Solihin, M. Prvulovic HPCA ’08 Reading Group Presentation 02/14/2008.

Agenda  Quick Review  Finish Introduction  Java Threads.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore

Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.

On-Demand Dynamic Software Analysis

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

Speculative Lock Elision

‘99 ACM/IEEE International Symposium on Computer Architecture

OS Virtualization.

CMSC 611: Advanced Computer Architecture

Lecture 24: Multiprocessors

Presentation transcript:

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western Michigan University * Intel, Inc ‡ Offline Symbolic Analysis for Multi-Processor Execution Replay

- 2 - Overview Goal: Deterministic replay for multi-threaded programs Debug non-deterministic bugs Program Input Shared Memory Dependency Past SolutionsOur Solution Log I/O, signals, DMA, etc., Monitor memory operations Software is slow Hardware is complex BugNet [ISCA'05] Log loads (cache miss data) SAT constraint solver Determine offline before replay Sources of non-determinism Program input (interrupt, I/O, DMA, etc.) Shared-memory dependencies

- 3 - Deterministic Replay Uses Recorder Replayer Memory Leaks Data Races Dangling Pointers Dynamic Program Analysis Reproduce non-deterministic bugs Remote Site OR In-house Developer Site Step-Backward in time Debugging

- 4 - Traditional Record-N-Replay Systems Write Read Log shared memory dependencies Checkpoint Memory and Register State Log non-deterministic program input Interrupts, I/O values, DMA, etc. Log non-deterministic program input Interrupts, I/O values, DMA, etc. Thread 1Thread 2Thread 3

- 5 - Recording Shared Memory Dependency Problem Need to monitor every memory operation Software-based Replay System PinSEL (UCSD/Intel)iDNA (Microsoft) Hardware-based Replay System FDR/ReRun (Wisconsin) Strata (UCSD) DeLorean (UIUC) x100x10 Complex hardware

- 6 - Hardware Complexity Hardware-based solution Detect shared memory dependencies by monitoring cache coherence messages Transitive optimization to reduce log size Complexity Requires changes to coherence sub-system Complex to design and verify 9 design bugs in coherence mechanism of AMD64 [Narayanasamy et al. ICCD’06] W(a) W(b) R(a)

- 7 - New Direction to Hardware-based Solution Complexity-effective solution Do NOT record shared-memory dependencies at all Infer dependencies offline before replay using Satisfiability Modulo Theory (SMT) solver

- 8 - Our Approach Write Read Log shared memory dependency Checkpoint Memory and Registers Log non-deterministic program input Interrupts, I/O values, DMA, etc. Log non-deterministic program input Interrupts, I/O values, DMA, etc. BugNet [ISCA’05] Load-based Hardware Recorder BugNet [ISCA’05] Load-based Hardware Recorder Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline Satisfiability-Modulo-Theory (SMT) solver reconstructs interleaving offline Checkpoint Registers

- 9 - Roadmap Motivation BugNet for single-threaded programs [ISCA’05] Recording cache miss data is sufficient BugNet is sufficient for multi-threaded programs Insight: BugNet can replay each thread in isolation Offline SMT Analysis Evaluation Conclusion

BugNet [Narayanasamy et al, ISCA’05] Insight Recording initial register state and values of loads is sufficient for deterministic replay Implicitly captures the program input from I/O, DMA, interrupts, etc. Input and output of other instructions are reproduced during replay Optimization Record a load only if it is the first access to a memory location Our modification Recording data fetched on cache miss captures first loads Any first access to a location would result in a cache miss May unnecessarily record data due to store misses, but that is OK

Recording Cache Miss Data (First Loads) Execution Time Log file First Load Checkpoint Register Values Program Counter Load A = 0 (cnt1, 0) Load B = 5 (cnt2, 5) Store C = 1 On a store miss Record old value – data before store update New value – data after store update – can be reproduced deterministically Cache Miss Checkpoint Record cache misses (Memory count, Data) Implicitly capture first loads (cnt3, 0) Deterministic Replay Input and output (including address) of all instructions are replayed

BugNet Extension Self-modifying code Consider instruction read as a load; so instructions are logged Full system Replay Continue logging in kernel mode See the paper for details on context switches, page faults, etc.

Roadmap Motivation BugNet for single-threaded programs [ISCA’05] Recording cache miss data is sufficient BugNet is sufficient for multi-threaded programs Insight: BugNet can replay each thread in isolation Offline SMT Analysis Evaluation Conclusion

BugNet for Multithreaded Programs Insight BugNet recorder (initial register state + loads) for each thread is sufficient for replaying that thread  Recording cache miss data is sufficient for multithreaded programs  No additional hardware support required for recording dependencies Reason Load dependent on a remote write cause a cache miss to ensure coherence  BugNet implicitly records load values dependent on remote writes Effect Can replay each thread in isolation (independent of other threads) using BugNet logs

Replaying Each Thread Independently Proc 1 Proc 2 Load A=0 Load A= Store A= 1 Invalidation Cache Coherence Invalidate cache block to gain exclusive permission Log cache miss data Implicitly records loads dependent on remote writes No change to coherence mechanism (1 st, 0) (3 rd, 1 ) Proc 1 LOG (1 st, 0) Proc 2 LOG Cache Miss Cache Block Invalidated 1 Replay each thread independent of others

Shared Memory Dependency Thread 1 Thread 2 Load Store Load Store Load Store Load Store Load SMT Solver resolves shared memory dependency Billion instructions Offline analysis would not scale Final State : A, B, C We need to bound search space ? : Old Value x : New Value A A A B B C A A B B C C

Roadmap Motivation BugNet Offline Symbolic Analysis Encoding Ordering Constraints Bounding Search Space Evaluation Conclusion

Old Value Encoding Ordering Constraints Proc 1 Proc 2 x New Value x 1 x 2 x 3 x 4 x 5 x Final Program Order Constraint (Assume Sequential Consistency) Proc1 : X1 < X2 AND Proc2 : X3 < X4 < X5AND Load-Store Constraint ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND

Multiple Memory Locations Proc 1 Proc 2 x 1 x 2 x 3 x 4 x 5 x Final Program Order Constraints (Assume Sequential Consistency) Proc1 :Y1 < X1 < X2 < Y2 AND Proc2 :X3 < X4 < X5 < Y3 AND Load-Store Constraints ( M→old== M→prev→new) X1: X1 < X3 AND X2: (X3 < X2 < X4 OR X5 < X2) AND : Y1: Y1 < Y2 AND Y2: Y1 < Y2 < Y3 AND : y y 1 2 y 3 y Final Old Value x New Value

Satisfiability-Modulo-Theory (SMT) Solver SMT Solver Ordering Constraints (Program Order) ∧ (Load-Store Order for X) ∧ (Load-Store Order for Y) ∧ : Total Order x 1 x 2 x 3 x 4 x 5 y y 1 2 y 3 SMT solver Find one valid total order from multiple solutions All solutions could be produced, if needed

Replay Guarantees The replayed execution has the same final register and memory states Each thread has the exactly same sequence of instructions along with input and output Reconstructed shared memory dependencies obey program order and load-store semantics

Roadmap Motivation BugNet Offline Symbolic Analysis Encoding Ordering Constraints Bounding Search Space Evaluation Conclusion

Bounding Search Space Proc 1 Proc 2 N cycles Final State cnt 1cnt 2 cnt 3cnt 4 Record “Strata hints” Each processor periodically records memory operation count Strata regions have a global order Strata Region 3 SMT solver analyzes One region at a time Start from the last region Final state of a region = Initial state of the following region Strata Region 2 Strata Region 1 Final State Initial State Final State Initial State Final State

Strata Hints Cycle-bound After N cycles, each core records its memory operation count No communication is required between cores Problem The size of Strata region is not based to number of shared memory dependencies Can we bound based on number of shared memory dependencies? Downgrade-bound Count coherence downgrade requests Requires communication between cores, but reduces offline analysis overhead

Filtering Local & Read-only Accesses Load A Store B Load B Store B Store A Filter Local accesses : no shared-memory dependency Read-only accesses : any total order is valid Load C Effectiveness < 1% of memory accesses remain to be analyzed Strata Region Thread 1 Thread 2

Roadmap Motivation Record & Replay Offline Symbolic Analysis Evaluation Strata Hint Size Offline Symbolic Analysis Overhead Conclusion

Evaluation Simics + cycle accurate simulator Simulate multi-processor execution (2, 4, 8,16 cores) Fast-forward up to known synchronization points Trace collected for 500 million instructions Benchmarks SPLASH2 : barnes, fmm, ocean Parsec 2.0 : blackscholes, bodytrack, x264 SPEComp : wupwise, swim Apache MySQL Yices SMT constraint solver [Dutertre and Moura CAV’06]

Strata Hints Size vs. Offline Analysis Overhead Downgrade-bound scheme is effective Cycle-bound (10,000) Downgrade-bound (25) Downgrade-bound (10) 10% x100 Offline analysis overhead is one-time cost (not for every replay)

Strata hints vs. ReRun log Strata hints are 4x less than ReRun log Significant reduction in hardware complexity Proposed System ReRun [Hower and Hill, ISCA’08] x4

Recording Performance, etc. Cache Miss Data Log 290 Mbytes / one second of program execution Recording Performance On average, 0.35% slowdown in IPC Scalability results can be found in the paper

Conclusion Deterministic replay for multi-threaded program is critical We proposed a complexity-effective solution Use BugNet : Record cache miss data No need to record shared memory dependencies Determine shared memory dependency using SMT constraint solver offline Result < 1% recording overhead Efficient log size (4x smaller than state-of-the-art scheme ReRun) Can analyze one second of 8-threaded program in less than 1000 seconds One-time offline analysis cost (not for every replay)

Thank you