Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

Similar presentations


Presentation on theme: "Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient."— Presentation transcript:

1 Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient Online Multiprocessor Replay via Speculation and External Determinism

2 Dongyoon Lee Deterministic Replay Record and reproduce non-deterministic events 1)Offline Uses: replay repeatedly after original run Debugging Forensics 2)Online Uses: record and replay concurrently Fault tolerance Decoupled runtime checks We focus on online replay for multi-processors Deterministic Replay 2

3 Dongyoon Lee Online Deterministic Replay Uses ServerReplica Takeover Fault ToleranceDecoupled Runtime Checks AppReplay + Check A A + Check B B C C Need to record and replay concurrently Both recording and replaying should be efficient Requestlog Response Fault !! replay keep the same state 3 P1 P2 P3 P4 A A B B C C

4 Dongyoon Lee Uniprocessor Replay Program Input (e.g. system calls, signals, etc) Thread scheduling Multiprocessor Replay: + Shared memory dependencies Instrument every memory operation PinSEL [Pereira, IISWC’08], iDNA [Bhansali, VEE’06] Page protection SMP-ReVirt [Dunlap, VEE’08] Offline search ODR [Altekar, SOSP’09], PRES [Park, SOSP’09] Replay-SAT [Lee, MICRO’09] Hardware support FDR [Xu, ISCA’03], Strata [Narayanasamy, ASPLOS’06], ReRun [Hower, ISCA’08], DeLorean [Montesinos, ISCA’08] Past Solutions for Deterministic Replay → 10-100x → 2-9x → Slow replay → Custom HW 4

5 Dongyoon Lee Goal: Efficient online software-only multiprocessor replay Key Idea: Speculation + Check 1)Speculate data race free 2)Detect mis-speculation using a cheap check 3)Rollback and retry on mis-speculation Overview of Our Approach multi-threaded fork Lock(l) Unlock(l) Lock(l) T1T2 Checkpoint A Recorded Process T1’T2’ A’ Replayed Process Lock(l’) Unlock(l’) Lock(l’) Speculate Race free Check B’==B? Checkpoint B 5

6 Dongyoon Lee Motivation/Overview Respec Design 1.Speculate data race free 2.Detect mis-speculation 3.Rollback and Retry on mis-speculation Evaluation Conclusion Roadmap 6

7 Dongyoon Lee Observation Reproducing program input and happens-before order of sync. operations guarantees deterministic replay of data-race-free programs [Ronsse and Bosschere ’99] 1) Program input ( e.g. system calls, signals, etc. ) Record: Log system call effects Replay: Emulate system call 2) Synchronization Operations Record and replay happens-before order Instrument common (not all) synchronization primitives in glibc Deterministic Replay of Data-race-free Programs 7 + total order

8 Dongyoon Lee What if a program is NOT race free? Problem Need to detect mis-speculation Data race detector is too heavy-weight Insight: External Determinism is sufficient Not necessary to replay data races Ensure that the replayed process produces the same visible effects as the recorded process to an external observer Visible effects = System output + Final program state Solution: Divergence checks Detect mis-speculation when the replay is not externally deterministic 8

9 Dongyoon Lee 1) System Output Check For every system call, compare system call argument Ensure that the replay produces the same output as the recorded process Divergence Check #1 – System Output Lock(l) Unlock(l) Lock(l) Lock(l’) Unlock(l’) Lock(l’) T1T2 Start A Recorded Process T1’T2’ Start A’ Replayed Process Check O’==O? SysRead X SysWrite O SysRead X’ SysWrite O’ 9 multi-threaded fork

10 Dongyoon Lee Benign Data Races Not all races cause divergence checks to fail A data race is inconsequential if system output matches x=1 x!=0? x=1 x!=0? T1T2 Start A Recorded ProcessReplayed Process Success SysWrite(x) 10 multi-threaded fork T1’T2’ Start A’

11 Dongyoon Lee 1)Need to rollback to the beginning 2)Need to buffer system output till the end 11 Divergence due to Data Races Start A Recorded Process Replayed Process Start A’ multi-threaded fork T1T2 T1’T2’ Fail SysWrite(x) x=1 x=2 x=1 x=2 SysWrite(x)

12 Dongyoon Lee 2) Program state check Compare register and memory state at semi-regular intervals (epochs) Construct a safe intermediate point –To release buffered output –To rollback to in case of mis-speculation Divergence Check #2 – Program State 12 Replayed Process T1’T2’ Start A’ T1T2 Recorded Process Start A epoch Release Output Success epoch SysWrite(x) x=1 x=2 x=1 x=2 SysWrite(x) Fail Checkpoint B B’ == B ?

13 Dongyoon Lee Recovery from Mis-speculation Rollback Rollback both recorded and replayed processes to the previous checkpoint Re-execute Optimistically re-run the failed epoch On repeated failure, switch to uniprocessor execution model –Record and replay only one thread at a time –Parallel execution resumes after the failed interval T1T2 T1’T2’ Check B’==B? x=1 x=2 Fail A == A’ Checkpoint B x=1 x=2 Checkpoint A x=1 x=2 Checkpoint B Checkpoint C Checkpoint A Check B’==B? x=1 x=2 Check C’==C? A == A’ 13

14 Dongyoon Lee Speculative Execution Speculator [Nightingale et al. SOSP’05] Buffer output during speculation Block execution if speculative execution is not feasible Release buffered output on commit Undo speculative changes and squash buffered output on mis-speculation 14

15 Dongyoon Lee Motivation/Overview Respec Design Evaluation 1.Performance results 2.Breakdown of performance overhead 3.Rollback frequency and overhead Conclusion Roadmap 15

16 Dongyoon Lee Evaluation Setup Test Environment 2 GHz 8 core Xeon processor with 3 GB of RAM Run 1~4 worker threads (excluding control threads) Collect the average of 10 trials (except pbzip2 and aget) Benchmarks PARSEC suite –blackscholes, bodytrack, fluidanimate, swaptions, streamcluster SPLASH-2 suite –ocean, raytrace, volrend, water-nsq, fft, and radix Real applications –pbzip2, pfscan, aget, and Apache 16

17 Dongyoon Lee Record and Replay Performance 18% for 2 threads, 55% for 4 threads Real applications (including Apache) showed <50% for 4 threads 17

18 Dongyoon Lee 1) Redundant Execution Overhead (25%) Cost of running two executions (Lower bound of online replay) Mainly due to sharing limited resources: memory system Contribute 25% of total cost for 4 threads Redundant execution overhead (25%) 18

19 Dongyoon Lee 2) Epoch overhead (17%) Due to checkpoint cost Due to artificial epoch barrier cost Contribute 17% of total cost for 4 threads 19 Epoch overhead (17%) Redundant execution overhead (25%)

20 Dongyoon Lee 3) Memory Comparison Overhead (16%) Optimization 1. compare dirty pages only Optimization 2. parallelize comparison Contribute 16% of total cost for 4 threads 20 Memory comparison overhead (16%) Epoch overhead (17%) Redundant execution overhead (25%)

21 Dongyoon Lee 4) Logging Overhead (42%) Logging synchronization operations and system calls overhead Main cost for applications with fine-grained synchronizations Contribute 42% of total cost for 4 threads 21 Logging and other overhead (42%) Memory comparison overhead (16%) Epoch overhead (17%) Redundant execution overhead (25%)

22 Dongyoon Lee Rollback Frequency and Overhead App.ThreadsRollback FrequencyOverheadAvg. Overhead Pbzip2 (100 runs) 4 84% none 41% 45% 15% once 66% 1% twice 105% Aget (50 runs) 4 80% none 6% 18% once 6% 2% twice 6% Pbzip2(16%) and Aget(20%) invoke one or more rollbacks Pbzip2: Rollbacks contribute <10% of total overhead Aget: Rollback overhead is negligible frequent checkpoints => short epochs => small amount of work to be re-done 22

23 Dongyoon Lee Conclusion Goal: Deterministic replay for multithreaded programs Software-only: no custom hardware Online: record and replay concurrently Contributions to replay Speculation: speculate race-free, and rollback/retry if needed External Determinism: Match system output and program states Results Performance overhead record and replay concurrently 2 threads: 18% 4 threads: 55% Thank you… 23

24 Dongyoon Lee Thank you 24

25 Dongyoon Lee Benign Data Races Benign data races could cause frequent rollbacks Performance (NOT correctness) issue The latest Java and C++ memory model prohibits benign races => There are only harmful races [Manson et al. POPL’05],[Boehm et al. PLDI’08] Programmers should explicitly annotate intentionally racy variables (e.g. handcrafted synchronization) using volatile/atomic keywords Could automatically detect and instrument 25

26 Dongyoon Lee Implementation Modify Linux 2.6.27 kernel Deterministic replay Multithreaded fork Record/replay program input (e.g. system calls, signals, …) Compare program state (memory and register contents) Speculator [Nightingale et al. SOSP’05] Checkpoint and rollback Buffer system output or propagate speculative states Modify glibc 2.5.1 Support recording/replaying low-level synchronization operations e.g. locks, unlock, futex waits, futex wakes 26

27 Dongyoon Lee Replayed process 1)Emulate most system calls Feed logged return value and data copied into the process 2)Re-execute some system calls Create or delete threads : clone, exit, … Modify address space: mmap2, mprotect, … Problem Does NOT recreate most kernel state associated with the replayed process (e.g. the file descriptor table) Process can NOT transition from replaying to live execution Solution Recreate the OS state by re-executing native/virtualized system calls ReVirt [Dunlap et al. OSDI’02], Zap [Osman et al. OSDI’02] Handling System Calls 27

28 Dongyoon Lee Copy-on-write fork Linux’s fork supports fork of only single thread Need new copy-on-write primitive for checkpointing multithreads Should checkpoint a thread at safe point kernel entry/exit (system call) Multi-threaded fork 1)The initiating thread that initiates a multithreaded fork creates a barrier on which it waits until all other threads reach a safe point 2)Once all threads reach the barrier, the original thread creates the checkpoint, then let other threads continue execution. Semi-regular checkpoints Adaptive epoch length To bound the amount of work that must be redone on rollback Output triggered commit To provide acceptable latency for interactive tasks Multi-threaded Fork (Checkpoint) 28

29 Dongyoon Lee 1)Allow Respec to commit epochs and release system output Buffer output during speculation Safe to release output on commit after matching program state 2)Reduce the amount of execution that must be re-done when a check fails 3)Allow broader uses of replay system Tolerating non-fail-stop faults (e.g. transient hardware fault) Need to detect latent faults Parallelizing security and reliability checks Benefits of Program State Check 29

30 Dongyoon Lee Respec Log Kernel’s system call + User-level synchronizations MD5 checksum of address space and register state Problem: Not all races are logged Offline replay is NOT guaranteed to succeed Since the recorded process has been replayed successfully at least once, it is likely that offline replay will eventually succeed Solution Offline replay search tools can be used e.g. ODR [Altekar et al. SOSP’09], PRES [Park et al. SOSP’09], Replay-SAT [Lee et al. MICRO’09] Offline Replay with Respec 30

31 Dongyoon Lee e.g. I/O, DMA, interrupts, signals, RDTSC, context-switch, page- fault Asynchronous interrupts (caused by external sources) eg. I/O, timer, disk read completion Synchronous interrupts (=traps) eg. arithmetic overflow exceptions, invoking system calls, page fault, TLB miss x86 instructions (can return non-deterministic results, but do not normally trap when running in user mode) eg. rdtsc(read timestamp counter), rdpmc(read performance monitoring counter) Non-Deterministic Program Input 31

32 Dongyoon Lee Rollback Frequency and Overhead (Pbzip2) Threads Rollback Frequency Original Time (sec) Type Respec Time (sec) Slowdown 10%4.59Overall4.835% 213% once2.35 w/o rollback2.7015% w/ rollback2.9726% overall2.7316% 3 9% once 2% twice 1.64 w/o rollback2.0022% w/ rollback2.2940% overall1.0324% 4 84% no rollback 15% once 1% twice 1.33 w/o rollback1.8841% w/ rollback2.2468% overall1.9345% Out of 100 runs, 13-16% of executions invoke more than one rollbacks Rollbacks contribute 8% of Respec's total overhead 32

33 Dongyoon Lee Rollback Frequency and Overhead (Aget) Threads Rollback Frequency Original Time (sec) Type Respec Time (sec) Slowdown 1 10% once 2% twice 2.05 w/o rollback 2.197% w/ rollback 2.218% overall 2.197% 2 20% once 2% twice 1.93 w/o rollback2.1713% w/ rollback2.1713% overall2.1713% 324% once1.94 w/o rollback2.08 7% w/ rollback2.09 8% overall2.08 7% 4 18% once 2% twice 1.96 w/o rollback2.076% w/ rollback2.086% overall2.086% Out of 50 runs, 14-24% of executions invoke more than one rollbacks Peformance impact is negligible (due to very frequent checkpoint) 33


Download ppt "Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient."

Similar presentations


Ads by Google