Presentation is loading. Please wait.

Presentation is loading. Please wait.

A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill

Similar presentations


Presentation on theme: "A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill"— Presentation transcript:

1 A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill http://www.cs.wisc.edu/multifacet June 9th, 2003  Software bugs cost time & money  Hardware is getting cheaper  Use hardware to aid software debugging?

2 Xu et al.ISCA'03: Flight Data Recorder2 Brief Overview Approach: Full-system Record-Replay – Add H/W “Flight Data Recorder” – Target cache-coherence multiprocessor server – Enables S/W deterministic replay Full-system Evaluation: Low Overhead – Piggyback on coherence protocol: little extra H/W – Non-trivial recording interval: 1 second – Negligible runtime overhead: less than 2% – Can be “Always On”

3 Xu et al.ISCA'03: Flight Data Recorder3 Outline Overview – Why Deterministic Replay? – The Debugging Scenario – The Solution Recording Multithreading Recording System State & I/O Evaluation Conclusions Efficient RecordingWith full-system commercial workloads

4 Xu et al.ISCA'03: Flight Data Recorder4 Why Deterministic Replay? Software Bugs Happens In the Field – Differences between development & deployment – Data races (Web server, Database) – I/O interactions (OS, Device Driver) Debugging Usually happens In the Lab – Need to replay the buggy execution Use Core Dump? – Captures the final application state – Not enough for “race” bugs Need Better “Core Dump” – Enable faithfully replaying prior to the failure

5 Xu et al.ISCA'03: Flight Data Recorder5 The Debugging Scenario Recorder Crash Dump “Core” P1 P2 P3 P4 Checkpoint BCheckpoint C Store log AStore log BStore log C Checkpoint A Crash Read Checkpoint B Replaying from log B, C Replayer

6 Xu et al.ISCA'03: Flight Data Recorder6 The Solution Online Recorder – Like airplane flight data recorder – “Always on” even on deployed system – H/W based (no change to S/W) Transparent to S/W Minimal performance impact Offline Replayer – Post-mortem replay of pre-crash execution – Possibly on a different machine off-site – Based on existing technology i.e. Simics full-system simulator Focus of this workNot emphasized in this work

7 Xu et al.ISCA'03: Flight Data Recorder7 Outline Overview Recording Multithreading – What to record? – An example – Practical recorder hardware Recording System State & I/O Evaluation Conclusions Efficient Recording

8 Xu et al.ISCA'03: Flight Data Recorder8 What to Record? Multithreading Problem – Record order of instruction interleaving Assume Sequential Consistency (SC) – Accesses (appear to have) total order

9 Xu et al.ISCA'03: Flight Data Recorder9 Previous Record-Replay Approaches InstantReplay ’87 – Record order or memory accesses – overhead may affect program behavior Netzer ’93 – Record optimal trace – too expensive to keep track of all memory locations Bacon & Goldstein ’91 – Record memory bus transactions with hardware – high logging bandwidth RecPlay ’00 – Record only synchronizations – Not deterministic if have data races

10 Xu et al.ISCA'03: Flight Data Recorder10 Our Approach Uses existing cache coherence hardware – Low overhead, not affect program behavior – Works for program with races – Adapts Netzer’s algorithm in hardware – only record sync. if data race free An Example – Progressively refine the recording algorithm

11 Xu et al.ISCA'03: Flight Data Recorder11 Example: Record SC Order 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij

12 Xu et al.ISCA'03: Flight Data Recorder12 Example: Record SC Order 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:4  j:15 j:15  i:5 i:5  j:16 i:6  j:17 i:7  j:18 j:16  i:6 j:17  i:7 Need to add processor instruction count (IC) The very same interleaving is recorded, but …

13 Xu et al.ISCA'03: Flight Data Recorder13 Example: Record Word Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij Recording just word conflict can enable deterministic replay Hard to remember word accesses and too many arcs …

14 Xu et al.ISCA'03: Flight Data Recorder14 Example: Record Block Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij

15 Xu et al.ISCA'03: Flight Data Recorder15 Example: Record Block Conflict Order i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij Need to remember last accessing IC in the cache i:6  j:21 But, can we do better?

16 Xu et al.ISCA'03: Flight Data Recorder16 Example: Apply Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:6  j:21

17 Xu et al.ISCA'03: Flight Data Recorder17 Three arcs! No need to know syncs Automatic sync only for race free program Example: Apply Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 i:5  j:21 i:6  j:22 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij i:3  j:21

18 Xu et al.ISCA'03: Flight Data Recorder18 Practical Recorder Hardware Processor – instruction count 4 bytes per processor Cache – last access instruction count 6.25% space overhead Coherence Controller – vector of instruction counters 3×4 bytes per processor for 4-way multiprocessor Finite Cache, Out-of-Order, Prefetch, etc. – Recorder still applicable – Details in the paper

19 Xu et al.ISCA'03: Flight Data Recorder19 Further Details At each processor j: – IC = inst count of last committed inst by j – VIC[P] = latest ICs received by each proc – CIC[M] = IC of last load/store of block b in j’s L1 On commit, IC++; if load(b) then CIC[b] = IC i sends arc start (i,CIC_i[b]) on coherence reply On coherence reply receive, arc end is (j,IC+1) – If CIC_i[b] > VIC[i] then log arc; VIC[i]=CIC_i[b]

20 Xu et al.ISCA'03: Flight Data Recorder20 Example Transitive Reduction i:4  j:15 j:15  i:7 i:7  j:18 4 Flag=1 5 X1:=515 $r1:=Flag 6 X2:=6 16 Bneq $r1,$r0,-1 7 Flag:=017 Nop 18 $r1:=Flag 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 ij IC LOG VIC i: 7 VIC j: 15 CIC {x1,x2}: 6 i,6 6 < 7, so ignore coherence traffic ?

21 Xu et al.ISCA'03: Flight Data Recorder21 Relaxation Exact: coherence reply sends (i,CIC[b]) Safe: send (i,x) for any x: CIC[b] ≤ x ≤ IC Exact: on receive, add arc (x, IC’+1) Safe: add arc (x,y) for any y: IC+1 ≤ y ≤ IC’+1 x CIC[b] IC y (e.g., IC+1) IC’+1 (read b) x Finite Caches Speculative Processors

22 Xu et al.ISCA'03: Flight Data Recorder22 Outline

23 Xu et al.ISCA'03: Flight Data Recorder23 SafetyNet Checkpoint Hardware Problem – To beginning of “replay” interval – Logically take a snapshot of the system Solution – Adapt SafetyNet [Sorin et al. ISCA ‘02] Processor Checkpointing Memory Incremental logging Slightly modified for longer interval

24 Xu et al.ISCA'03: Flight Data Recorder24 Recording I/O Interrupts – Not exceptions – Record Interrupt type & IC Instruction I/O – Load: record values – Store: ignored DMA – Record input values – Record ordering: as pseudo thread

25 Xu et al.ISCA'03: Flight Data Recorder25 Outline Overview Recording Memory Races Recording System State & I/O Evaluation – An example system – Simulation methods – Runtime, log size Conclusions With full-system commercial workloads

26 Xu et al.ISCA'03: Flight Data Recorder26 Target System Commercial Server H/W – Sequential Consistent CC-NUMA – Full I/O: Interrupt, DMA, etc. – Simulation system (Simics + Memory Simulator) 4 way in-order issue, 1 GHz, 4 processors 128KB I/D L1, 4MB L2, MOSI directory protocol Commercial Server S/W – Unmodified commercial server benchmarks Apache, Slash, SPEC JBB, OLTP

27 Xu et al.ISCA'03: Flight Data Recorder27 CC-NUMA MP An Example System Memory Banks DMA Interface Core Cache(s) Cache Controller Directory Data Compressor (LZ77) Recorder Memory DMA Content & Order Interrupts, I/O Cache Checkpoint Memory Races Memory Checkpoint

28 Xu et al.ISCA'03: Flight Data Recorder28 Runtime Overhead Slowdown – Less than 2% – statistically insignificant for 2 workloads – No problem “always on” Slowdown causes – Extra traffic – Stall by buffer overflow – More blocking – Extra coherence message on some get- shared’s Runtime per Transaction (Normalized to base system) 0 10 20 30 40 50 60 70 80 90 100 OLTPJBBAPACHESLASH OLTP: database transactions (TPC-C on DB2) JBB: server side java benchmark that models a 3-tier system APACHE: static web server SLASH: dynamic web server (slashdot message posting)

29 Xu et al.ISCA'03: Flight Data Recorder29 UncompressedCompressed Log Size 1 – 1.33 Second Recording – Buffer: 35 MB (7%); Bandwidth: 25 MB/Second/Processor Efficient Race Log – Longer recording is possible with better checkpoint scheme Longer Recording – Using disk can get longer replay: 320 GB disk = ~3 hours recording Interrupt, Input, DMA Log Races log Checkpoint Log 0 20 40 60 Log Size (MB/Second/Processor) OLTPJBBAPACHESLASH OLTPJBBAPACHESLASH

30 Xu et al.ISCA'03: Flight Data Recorder30 Conclusion Low Overhead Deterministic Replay – Piggyback MP cache coherence hardware – Modest extra hardware – Modest overhead (less than 2% slowdown) Minimal race recording with transitive reduction Full-system Deterministic Replay – Evaluated with commercial workloads – Full-system recording (including OS, I/O)

31 Xu et al.ISCA'03: Flight Data Recorder31 Thank You Questions?

32 Xu et al.ISCA'03: Flight Data Recorder32 Flight Data Recorder vs. ReEnact Flight Data RecorderReEnact Target SystemCC-NUMATLS Deterministic Replay?YesYes* Race-detection?No**Yes Effective Interval (instructions)>100,000,000<100,000 Slowdown<2%Avg 5.8% OS, I/OYesNo (extendable?) Active during OS & I/O?YesNo * Need to disable TLS? ** Not in the recorder, but in the replayer

33 Xu et al.ISCA'03: Flight Data Recorder33 Scalability More processors, more races log – Not a quadratic increase – e.g. 4p to 16p for 2x more log Real systems have more I/O – But, also more memory available for log

34 Xu et al.ISCA'03: Flight Data Recorder34 Protocol Changes Get IC count from source processor – W  R: Piggyback IC count to DataResponse msg – W  W: Piggyback IC count to DataResponse msg – R  W: Piggyback IC count to InvalidateAck msg Cache block Writeback – Snooping protocol Eager IC update Extra messages on interconnect Not on critical path – Directory based protocol Lazy IC update Extra latency for cache misses

35 Xu et al.ISCA'03: Flight Data Recorder35 Replayer (Full-system Simulator) Input data to the replayer – Checkpoint – Execution log – DMA log – I/O log – Exception log Replay the execution – Load system checkpoint: registers, TLB, etc – Replay the MP execution order in partial order – Replay the I/O and exceptions – Proper device model needed to interrupt system output – Memory inspection support – Step forward/backward (enhanced debugger features)

36 Xu et al.ISCA'03: Flight Data Recorder36 Example: False Sharing 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 P1P2

37 Xu et al.ISCA'03: Flight Data Recorder37 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 P1P2 Example: False Sharing

38 Xu et al.ISCA'03: Flight Data Recorder38 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing

39 Xu et al.ISCA'03: Flight Data Recorder39 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 P1P2 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) Example: False Sharing

40 Xu et al.ISCA'03: Flight Data Recorder40 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing

41 Xu et al.ISCA'03: Flight Data Recorder41 19 Bneq $r1,$r0,-1 20 Nop 21 Y:=X1 22 Z:=X2 32 X1:=515 $r1:=Flag 33 X2:=6 16 Bneq $r1,$r0,-1 34 Flag:=017 Nop 15(P1,31) 34(P2,15) 18(P1,34) 21(P1,32) 22(P1,33) 31 Flag=1 14 Private2:=2 18 $r1:=Flag35 Private1:=3 21(P1,33) 35(P2,14) P1P2 Example: False Sharing


Download ppt "A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill"

Similar presentations


Ads by Google