Presentation is loading. Please wait.

Presentation is loading. Please wait.

RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.

Similar presentations


Presentation on theme: "RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill."— Presentation transcript:

1 RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill

2 1 % gcc sim.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gdb a.out gdb> run Program exited normally. gdb> % gcc para-sim.c % a.out Segmentation fault % Why Do You Need a Recorder? % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” %

3 2 Ideally … % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” % Long recording: small log Low runtime overhead Low cost Applicability: Programs – data race Systems – non-SC

4 3 Better and Better Recorders Low Cost Low Overhead Small Log Size Applicability Data RaceNon-SC InstRply ’87  YY B. & G. ’91  Y Netzer’93  Y Déjà Vu ’98  Y RecPlay ’00 JaRec ’04  FDR ’03 BugNet ’05  Y ReEnact ‘03  Y CORD ‘06  Y Strata ’06  Y RTR ‘06  YY

5 1 Byte/Kilo- Instruction [ASPLOS’06] A New Recorder Hardware Acceleration [ISCA’03] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical This talk covers only RTR Regulated Transitive Reduction algorithm

6 5 Outline Race Recording RTR Algorithm Results with Commercial Workloads Compress log during recording  replay more “regularly” Conclusion

7 Technically, what’s race recording?

8 7 Race Recording X=6 X = 1 X++ print(X) X = 1 X++ print(X) - X = X*5 - X = X*5 - Thread I Thread J OriginalReplay X=10 Recording X= 6 - X = X*5 - Log Thread I Thread J

9 8 Goal: Reproduce same conflicts with minimum log data Terminologies and Assumptions ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C Thread I Thread J Replay Log ld D st D ld A st B st C sub ld B add st C ld B st A st C ld D st D Conflicts (red) Dependence (black)

10 Regulated Transitive Reduction (RTR)

11 10 Log All Conflicts ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2  3 1  4 3  5 4  6 Log I : 2  3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes But too many conflicts

12 11 Netzer’s Transitive Reduction (TR) ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J : 2  3 3  5 4  6 Log I : 2  3 Log Size: 64 bytes (8 integers) TR Reduced Log How to further reduce log size?

13 12 The Intuition of the RTR Algorithm After Reduction From I to J From J to I Vectors “Regulate” Replay

14 13 Stricter Dependences to Aid Vectorization ld A Thread I Thread J Replay st B st C add st C ld B st A ld D 55 subst C 66 ld B st D Log J : 2  3 4  5 Log I : 2  3 Log Size: 48 bytes (6 integers) New Reduced Log stricter Reduced Fewer dependencies to log

15 14 Compress Vectorized Dependencies ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : x=3,5, ∆=1 Log I : x=3, ∆=1 Log Size: 40 bytes (5 integers) Vectorized Log Vector Deps. TR  RTR: fewer deps + fewer byte/dep

16 15 Deadlock Avoidance of RTR ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C ld D st D Limit the strict dependencies (see paper) i:4  j:1  j:2  i:3  i:4 Replay Cycle

17 Results with Commercial Workloads

18 17 Full-system Simulation Method Commercial server hardware GEMS: Full-system (OS + application) executions 4-core CMP (Sequential Consistent) 1-way in-order issue, 2 GHz, 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory Commercial server software Apache – static web serving SpecJBB – middleware OLTP – TPC-C like Zeus – static web serving L2 C0C0 C1C1 C3C3 C2C2

19 18 Log Size: 1 byte/KI byte/core/KI ApacheJBBOLTPZeusAVG KB/core/s ApacheJBBOLTPZeusAVG Less buffer, longer recording, smaller logs

20 19 RTR vs. Netzer’s TR TR RTR ApacheJBBOLTPZeusAVG Log Size 28% smaller log TR was “optimal”

21 20 Why Does RTR Work Well? RTR Instructions execute at similar speed Dependencies are often “vectorizable”

22 A New Recorder Hardware Acceleration [ISCA’03] 1 Byte/Kilo- Instruction [ASPLOS’06] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical “Less hardware” & “TSO” not covered Equally important More details in the paper

23 22 Conclusion Race recording  Counter nondeterminism RTR  1 byte/kilo-instruction Based on Netzer’s transitive reduction Create stricter dependencies Vectorize dependencies to compress log Avoid overly-strict hence no deadlock Future work Support snooping, SMT, replayer


Download ppt "RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill."

Similar presentations


Ads by Google