A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.

Slides:

Advertisements

Similar presentations

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Advertisements

Effective and Inexpensive (Memory) Race Recording Min Xu Thesis Defense 05/04/2006 Electrical and Computer Engineering Department, UW-Madison Advisors:

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug.

Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.

Continuously Recording Program Execution for Deterministic Replay Debugging.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.

1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.

1 Lecture 23: Multiprocessors Today’s topics:  RAID  Multiprocessor taxonomy  Snooping-based cache coherence protocol.

Unbounded Transactional Memory Paper by Ananian et al. of MIT CSAIL Presented by Daniel.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

VIRTUAL MEMORY. Virtual memory technique is used to extents the size of physical memory When a program does not completely fit into the main memory, it.

Multiprocessor Cache Coherency

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.

Computer Architecture Lecture 28 Fasih ur Rehman.

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

COS 598: Advanced Operating System. Operating System Review What are the two purposes of an OS? What are the two modes of execution? Why do we have two.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.

Virtual Memory 1 1.

1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.

UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Translation Lookaside Buffer

Lecture 20: Consistency Models, TM

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

Multiscalar Processors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Hardware Memory Race Recording for Deterministic Replay

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Demand Paged Virtual Memory

Virtual Memory Hardware

Lecture 22: Consistency Models, TM

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 23: Virtual Memory, Multiprocessors

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Consistency Models, TM

The University of Adelaide, School of Computer Science

Review What are the advantages/disadvantages of pages versus segments?

Lecture 17 Multiprocessors and Thread-Level Parallelism

Virtual Memory 1 1.

Presentation transcript:

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation

1 % gcc sim.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gdb a.out gdb> run Program exited normally. gdb> % gcc para-sim.c % a.out Segmentation fault % Why Do You Need a Recorder? % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” %

2 Ideally … % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” % Long recording: small log Low runtime overhead Low cost Applicability: Programs – data race Systems – non-SC

3 Flight Data Recorder (ISCA’03) Full-system Record-Replay Recording memory races: Assumes Sequential Consistency (SC) Record order of instruction interleaving Target cache-coherence multiprocessor server Piggyback on coherence protocol: little extra H/W Recording system states: SafetyNet Recording I/Os Results: Non-trivial recording interval: 1 second Negligible runtime overhead: less than 2% Can be “Always On”

4 RTR Better memory race log compression 1 byte per Kilo instructions Dealing with Total Store Ordering In this talk, I will try to describe a full picture combining FDR and RTR.

5 Outline Introduction Recording System State Recording Input/Output Recording Memory Races Dealing with TSO Summary

6 Recording System State (based on SafetyNet) Purpose: re-construct the initial state (registers, TLB, main memory) at the beginning of the replay interval Policy: FDR’s 1second replay interval Take a logical checkpoint every 1/3 second Reserve memory space to store logs for 4 checkpoints Logical checkpoint: Quiesce entire system to take a physical checkpoint Registers and TLB states (4248 bytes/processor on SPARC V9) Log old value of a cache line upon first update Add an “already-updated” bit per cache line

7 FDR paper

8 Outline Introduction Recording System State Recording Input/Output Recording Memory Races Dealing with TSO Summary

9 Recording I/O I/O loads Instruction count + interrupt number DMA store values

10 Outline Introduction Recording System State Recording Input/Output Recording Memory Races Dealing with TSO Summary

11 Log All Dependence ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2  3 1  4 3  5 4  6 Log I : 2  3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes But too many dependence

12 Netzer’s Transitive Reduction (TR) approximated by FDR ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J : 2  3 3  5 4  6 Log I : 2  3 Log Size: 64 bytes (8 integers) TR Reduced Log How to further reduce log size?

13 RTR Actively creating artificial dependencies Stricter Vectorized

14 The Intuition of the RTR Algorithm After Reduction From I to J From J to I Vectors “Regulate” Replay

15 Stricter Dependences to Aid Vectorization ld A Thread I Thread J Replay st B st C add st C ld B st A ld D 55 subst C 66 ld B st D Log J : 2  3 4  5 Log I : 2  3 Log Size: 48 bytes (6 integers) New Reduced Log stricter Reduced Fewer dependencies to log

16 Compress Vectorized Dependencies ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : x=3,5, ∆=1 Log I : x=3, ∆=1 Log Size: 40 bytes (5 integers) Vectorized Log Vector Deps. TR  RTR: fewer deps + fewer byte/dep

17

18 H/W Considerations (IC) Instruction count per core -- easy (VIC[p]) record previously seen senders’ largest time stamps for transitive reduction (CTS[b]) time stamp per cache block: i.e. record IC upon load/store commits At commit time: Figure out memory address – how difficult? Write CTS: decoupled timestamp memory

19 H/W Considerations Cont’d Piggyback on cache coherence messages FDR: CTS[b] RTR: CTS[b] & sender’s IC Logic to perform algorithm at the receiver side FDR: integer comparison, update VIC[sender], generate log record RTR: in addition, max/min, integer subtraction Augment directory structure Record last owner for evicted blocks Cache must respond to inquiries about evicted blocks: reply with CTS[SET/LRU]

20 Outline Introduction Recording System State Recording Input/Output Recording Memory Races Dealing with TSO Summary

21 Total Store Ordering FIFO Write buffer A store commits by placing its value into write buffer A store is ordered when it exits the write buffer and updates the memory Stores are ordered in commit order (FIFO) Load can obtain values from write buffer or from memory system

22 Problems with TSO /* XXX */ is memory order The two examples create cycles that will result in replay deadlocks

23 Solution Identify problematic load instructions Monitor invalidation in [t1, t2] t1: the load (or the previous store that feeds the load) is ordered at memory t2: all preceding instructions are ordered Log load values and replay these load instructions by values HW: similar to the misspeculation detection circuitry in SC systems (e.g. MIPS R10000) Insufficient for supporting Processor Consistency and other more relaxed models

24 Conclusion RTR  1 byte/kilo-instruction Based on Netzer’s transitive reduction Create stricter dependencies Vectorize dependencies to compress log Avoid overly-strict hence no deadlock