A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Slides:

Advertisements

Similar presentations

Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.

Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

(C) 2002 Daniel SorinWisconsin Multifacet Project SafetyNet: Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery.

Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.

1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug.

Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison

Continuously Recording Program Execution for Deterministic Replay Debugging.

G Robert Grimm New York University Disco.

Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.

BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.

(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.

(C) 2003 Milo Martin Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper,

Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,

15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

SafetyNet Improving the Availability of Shared Memory Multiprocessors with Global Checkpoint/Recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,

Thread-Level Speculation Karan Singh CS

Compactly Representing Parallel Program Executions Ankit Goel Abhik Roychoudhury Tulika Mitra National University of Singapore.

Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.

(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.

Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.

Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.

Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

Safetynet: Improving The Availability Of Shared Memory Multiprocessors With Global Checkpoint/Recovery D. Sorin M. Martin M. Hill D. Wood Presented by.

Availability in CMPs By Eric Hill Pranay Koka. Motivation RAS is an important feature for commercial servers –Server downtime is equivalent to lost money.

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.

(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)

Demand-Driven Software Race Detection using Hardware Performance Counters Joseph L. Greathouse †, Zhiqiang Ma ‡, Matthew I. Frank ‡ Ramesh Peri ‡, Todd.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Rerun: Exploiting Episodes for Lightweight Memory Race Recording

The University of Adelaide, School of Computer Science

Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Improving Multiple-CMP Systems with Token Coherence

CS 213 Lecture 11: Multiprocessor 3: Directory Organization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Dynamic Verification of Sequential Consistency

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 19: Coherence and Synchronization

The University of Adelaide, School of Computer Science

University of Wisconsin-Madison Presented by: Nick Kirchem

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill June 9th, 2003  Software bugs cost time & money  Hardware is getting cheaper  Use hardware to aid software debugging?

Xu et al.ISCA'03: Flight Data Recorder2 Brief Overview Approach: Full-system Record-Replay – Add H/W “Flight Data Recorder” – Target cache-coherence multiprocessor server – Enables S/W deterministic replay Full-system Evaluation: Low Overhead – Piggyback on coherence protocol: little extra H/W – Non-trivial recording interval: 1 second – Negligible runtime overhead: less than 2% – Can be “Always On”

Xu et al.ISCA'03: Flight Data Recorder3 Outline Overview – Why Deterministic Replay? – The Debugging Scenario – The Solution Recording Multithreading Recording System State & I/O Evaluation Conclusions Efficient RecordingWith full-system commercial workloads

Xu et al.ISCA'03: Flight Data Recorder4 Why Deterministic Replay? Software Bugs Happens In the Field – Differences between development & deployment – Data races (Web server, Database) – I/O interactions (OS, Device Driver) Debugging Usually happens In the Lab – Need to replay the buggy execution Use Core Dump? – Captures the final application state – Not enough for “race” bugs Need Better “Core Dump” – Enable faithfully replaying prior to the failure

Xu et al.ISCA'03: Flight Data Recorder5 The Debugging Scenario Recorder Crash Dump “Core” P1 P2 P3 P4 Checkpoint BCheckpoint C Store log AStore log BStore log C Checkpoint A Crash Read Checkpoint B Replaying from log B, C Replayer

Xu et al.ISCA'03: Flight Data Recorder6 The Solution Online Recorder – Like airplane flight data recorder – “Always on” even on deployed system – H/W based (no change to S/W) Transparent to S/W Minimal performance impact Offline Replayer – Post-mortem replay of pre-crash execution – Possibly on a different machine off-site – Based on existing technology i.e. Simics full-system simulator Focus of this workNot emphasized in this work

Xu et al.ISCA'03: Flight Data Recorder7 Outline Overview Recording Multithreading – What to record? – An example – Practical recorder hardware Recording System State & I/O Evaluation Conclusions Efficient Recording

Xu et al.ISCA'03: Flight Data Recorder8 What to Record? Multithreading Problem – Record order of instruction interleaving Assume Sequential Consistency (SC) – Accesses (appear to have) total order

Xu et al.ISCA'03: Flight Data Recorder9 Previous Record-Replay Approaches InstantReplay ’87 – Record order or memory accesses – overhead may affect program behavior Netzer ’93 – Record optimal trace – too expensive to keep track of all memory locations Bacon & Goldstein ’91 – Record memory bus transactions with hardware – high logging bandwidth RecPlay ’00 – Record only synchronizations – Not deterministic if have data races

Xu et al.ISCA'03: Flight Data Recorder10 Our Approach Uses existing cache coherence hardware – Low overhead, not affect program behavior – Works for program with races – Adapts Netzer’s algorithm in hardware – only record sync. if data race free An Example – Progressively refine the recording algorithm

Xu et al.ISCA'03: Flight Data Recorder11 Example: Record SC Order 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij

Xu et al.ISCA'03: Flight Data Recorder12 Example: Record SC Order 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:4  j:15 j:15  i:5 i:5  j:16 i:6  j:17 i:7  j:18 j:16  i:6 j:17  i:7 Need to add processor instruction count (IC) The very same interleaving is recorded, but …

Xu et al.ISCA'03: Flight Data Recorder13 Example: Record Word Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij Recording just word conflict can enable deterministic replay Hard to remember word accesses and too many arcs …

Xu et al.ISCA'03: Flight Data Recorder14 Example: Record Block Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij

Xu et al.ISCA'03: Flight Data Recorder15 Example: Record Block Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij Need to remember last accessing IC in the cache i:6  j:21 But, can we do better?

Xu et al.ISCA'03: Flight Data Recorder16 Example: Apply Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:6  j:21

Xu et al.ISCA'03: Flight Data Recorder17 Three arcs! No need to know syncs Automatic sync only for race free program Example: Apply Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:3  j:21

Xu et al.ISCA'03: Flight Data Recorder18 Practical Recorder Hardware Processor – instruction count 4 bytes per processor Cache – last access instruction count 6.25% space overhead Coherence Controller – vector of instruction counters 3×4 bytes per processor for 4-way multiprocessor Finite Cache, Out-of-Order, Prefetch, etc. – Recorder still applicable – Details in the paper

Xu et al.ISCA'03: Flight Data Recorder19 Outline

Xu et al.ISCA'03: Flight Data Recorder20 SafetyNet Checkpoint Hardware Problem – To beginning of “replay” interval – Logically take a snapshot of the system Solution – Adapt SafetyNet [Sorin et al. ISCA ‘02] Processor Checkpointing Memory Incremental logging Slightly modified for longer interval

Xu et al.ISCA'03: Flight Data Recorder21 Recording I/O Interrupts – Not exceptions – Record Interrupt type & IC Instruction I/O – Load: record values – Store: ignored DMA – Record input values – Record ordering: as pseudo thread

Xu et al.ISCA'03: Flight Data Recorder22 Outline Overview Recording Memory Races Recording System State & I/O Evaluation – An example system – Simulation methods – Runtime, log size Conclusions With full-system commercial workloads

Xu et al.ISCA'03: Flight Data Recorder23 Target System Commercial Server H/W – Sequential Consistent CC-NUMA – Full I/O: Interrupt, DMA, etc. – Simulation system (Simics + Memory Simulator) 4 way in-order issue, 1 GHz, 4 processors 128KB I/D L1, 4MB L2, MOSI directory protocol Commercial Server S/W – Unmodified commercial server benchmarks Apache, Slash, SPEC JBB, OLTP

Xu et al.ISCA'03: Flight Data Recorder24 CC-NUMA MP An Example System Memory Banks DMA Interface Core Cache(s) Cache Controller Directory Data Compressor (LZ77) Recorder Memory DMA Content & Order Interrupts, I/O Cache Checkpoint Memory Races Memory Checkpoint

Xu et al.ISCA'03: Flight Data Recorder25 Runtime Overhead Slowdown – Less than 2% – statistically insignificant for 2 workloads – No problem “always on” Slowdown causes – Extra traffic – Stall by buffer overflow – More blocking – Extra coherence message on some get- shared’s Runtime per Transaction (Normalized to base system) OLTPJBBAPACHESLASH

Xu et al.ISCA'03: Flight Data Recorder26 UncompressedCompressed Log Size 1 – 1.33 Second Recording – Buffer: 35 MB (7%); Bandwidth: 25 MB/Second/Processor Efficient Race Log – Longer recording is possible with better checkpoint scheme Longer Recording – Using disk can get longer replay: 320 GB disk = ~3 hours recording Interrupt, Input, DMA Log Races log Checkpoint Log Log Size (MB/Second/Processor) OLTPJBBAPACHESLASH OLTPJBBAPACHESLASH

Xu et al.ISCA'03: Flight Data Recorder27 Conclusion Low Overhead Deterministic Replay – Piggyback MP cache coherence hardware – Modest extra hardware – Modest overhead (less than 2% slowdown) Minimal race recording with transitive reduction Full-system Deterministic Replay – Evaluated with commercial workloads – Full-system recording (including OS, I/O)

Xu et al.ISCA'03: Flight Data Recorder28 Thank You Questions?

Xu et al.ISCA'03: Flight Data Recorder29 Flight Data Recorder vs. ReEnact Flight Data RecorderReEnact Target SystemCC-NUMATLS Deterministic Replay?YesYes* Race-detection?No**Yes Effective Interval (instructions)>100,000,000<100,000 Slowdown<2%Avg 5.8% OS, I/OYesNo (extendable?) Active during OS & I/O?YesNo * Need to disable TLS? ** Not in the recorder, but in the replayer

Xu et al.ISCA'03: Flight Data Recorder30 Scalability More processors, more races log – Not a quadratic increase – e.g. 4p to 16p for 2x more log Real systems have more I/O – But, also more memory available for log

Xu et al.ISCA'03: Flight Data Recorder31 Protocol Changes Get IC count from source processor – W  R: Piggyback IC count to DataResponse msg – W  W: Piggyback IC count to DataResponse msg – R  W: Piggyback IC count to InvalidateAck msg Cache block Writeback – Snooping protocol Eager IC update Extra messages on interconnect Not on critical path – Directory based protocol Lazy IC update Extra latency for cache misses

Xu et al.ISCA'03: Flight Data Recorder32 Replayer (Full-system Simulator) Input data to the replayer – Checkpoint – Execution log – DMA log – I/O log – Exception log Replay the execution – Load system checkpoint: registers, TLB, etc – Replay the MP execution order in partial order – Replay the I/O and exceptions – Proper device model needed to interrupt system output – Memory inspection support – Step forward/backward (enhanced debugger features)

Xu et al.ISCA'03: Flight Data Recorder33 Example: False Sharing 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 P1P2

Xu et al.ISCA'03: Flight Data Recorder34 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 P1P2 Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder35 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder36 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 P1P2 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder37 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing

Xu et al.ISCA'03: Flight Data Recorder38 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing