RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill
1 % gcc sim.c % a.out Segmentation fault % % gdb a.out gdb> run Program received SIGSEGV. In get() at hash.c:45 45 a = bucket->d; % gdb a.out gdb> run Program exited normally. gdb> % gcc para-sim.c % a.out Segmentation fault % Why Do You Need a Recorder? % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” %
2 Ideally … % gdb a.out log gdb> run Program received SIGSEGV. In get() at para-hash.c:67 67 a = bucket->d; % gcc para-sim.c % a.out Segmentation fault Race recorded in “log” % Long recording: small log Low runtime overhead Low cost Applicability: Programs – data race Systems – non-SC
3 Better and Better Recorders Low Cost Low Overhead Small Log Size Applicability Data RaceNon-SC InstRply ’87 YY B. & G. ’91 Y Netzer’93 Y Déjà Vu ’98 Y RecPlay ’00 JaRec ’04 FDR ’03 BugNet ’05 Y ReEnact ‘03 Y CORD ‘06 Y Strata ’06 Y RTR ‘06 YY
1 Byte/Kilo- Instruction [ASPLOS’06] A New Recorder Hardware Acceleration [ISCA’03] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical This talk covers only RTR Regulated Transitive Reduction algorithm
5 Outline Race Recording RTR Algorithm Results with Commercial Workloads Compress log during recording replay more “regularly” Conclusion
Technically, what’s race recording?
7 Race Recording X=6 X = 1 X++ print(X) X = 1 X++ print(X) - X = X*5 - X = X*5 - Thread I Thread J OriginalReplay X=10 Recording X= 6 - X = X*5 - Log Thread I Thread J
8 Goal: Reproduce same conflicts with minimum log data Terminologies and Assumptions ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C Thread I Thread J Replay Log ld D st D ld A st B st C sub ld B add st C ld B st A st C ld D st D Conflicts (red) Dependence (black)
Regulated Transitive Reduction (RTR)
10 Log All Conflicts ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2 3 1 4 3 5 4 6 Log I : 2 3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes But too many conflicts
11 Netzer’s Transitive Reduction (TR) ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J : 2 3 3 5 4 6 Log I : 2 3 Log Size: 64 bytes (8 integers) TR Reduced Log How to further reduce log size?
12 The Intuition of the RTR Algorithm After Reduction From I to J From J to I Vectors “Regulate” Replay
13 Stricter Dependences to Aid Vectorization ld A Thread I Thread J Replay st B st C add st C ld B st A ld D 55 subst C 66 ld B st D Log J : 2 3 4 5 Log I : 2 3 Log Size: 48 bytes (6 integers) New Reduced Log stricter Reduced Fewer dependencies to log
14 Compress Vectorized Dependencies ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : x=3,5, ∆=1 Log I : x=3, ∆=1 Log Size: 40 bytes (5 integers) Vectorized Log Vector Deps. TR RTR: fewer deps + fewer byte/dep
15 Deadlock Avoidance of RTR ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C ld D st D Limit the strict dependencies (see paper) i:4 j:1 j:2 i:3 i:4 Replay Cycle
Results with Commercial Workloads
17 Full-system Simulation Method Commercial server hardware GEMS: Full-system (OS + application) executions 4-core CMP (Sequential Consistent) 1-way in-order issue, 2 GHz, 64KB I/D L1, 4MB L2, 64byte lines, MOSI directory Commercial server software Apache – static web serving SpecJBB – middleware OLTP – TPC-C like Zeus – static web serving L2 C0C0 C1C1 C3C3 C2C2
18 Log Size: 1 byte/KI byte/core/KI ApacheJBBOLTPZeusAVG KB/core/s ApacheJBBOLTPZeusAVG Less buffer, longer recording, smaller logs
19 RTR vs. Netzer’s TR TR RTR ApacheJBBOLTPZeusAVG Log Size 28% smaller log TR was “optimal”
20 Why Does RTR Work Well? RTR Instructions execute at similar speed Dependencies are often “vectorizable”
A New Recorder Hardware Acceleration [ISCA’03] 1 Byte/Kilo- Instruction [ASPLOS’06] Less Hardware [ASPLOS’06] SC & TSO [ASPLOS’06] Result: One more step toward practical “Less hardware” & “TSO” not covered Equally important More details in the paper
22 Conclusion Race recording Counter nondeterminism RTR 1 byte/kilo-instruction Based on Netzer’s transitive reduction Create stricter dependencies Vectorize dependencies to compress log Avoid overly-strict hence no deadlock Future work Support snooping, SMT, replayer