Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Similar presentations


Presentation on theme: "Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010."— Presentation transcript:

1 Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

2  Shared-memory programs are hard to debug  Due to non-deterministic memory races  Memory races depend on thread interleaving ▪ Read/write by thread A + write by thread B to same location  Deterministic replay  Check-point initial program state at recording start  Record races in a log  Enforce same race ordering at replay Race recording provides repeatability 2Gwendolyn Voskuilen et al.

3  Record predecessor-successor ordering of threads involved in a memory race  Races always involve a write leverage coherence ▪ Global event (e.g., write invalidation) for memory races  Captures all races – synchronization and data  Two key overheads  Log size  Hardware to track race ordering 3Gwendolyn Voskuilen et al.

4  Centralized - Strata [ASPLOS06], DeLorean [ISCA08]  Logging/ordering at a central entity  DeLorean has shorter log but Strata uses less hardware  Both less scalable  Distributed - FDR [ISCA03], RTR [ASPLOS06], Rerun [ISCA08]  Use Lamport clocks with directory coherence  All exploit transitivity to reduce logs ▪ Avoid recording races made redundant by transitivity  Rerun significantly reduces hardware Our focus – distributed schemes 4Gwendolyn Voskuilen et al.

5  Goal: further reduce log size with minimal hardware  Rerun logs 38 GB/hour on 16 2-GHz cores  Our key novelty: Exploit acyclicity of races  Previous schemes record all non-transitive races  Timetraveler records only cyclic, non-transitive races 5Gwendolyn Voskuilen et al.

6  Two novel and elegant mechanisms  Post-dating : correctly orders acyclic races and detects cyclic races via L1 & L2 ▪ No messy cycle detection hardware (just a 32-bit timestamp/core)  Time-delay buffers: avoids false cycles through L2  Reduce log by 8x (commercial) & 123x (scientific) over Rerun  Minimal hardware: 2 32-bit timestamps/core + 696-byte time- delay  696 MB/hour on 16 2-GHz cores Timetraveler significantly reduces log with minimal, elegant hardware 6Gwendolyn Voskuilen et al.

7  Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al.7

8  Rerun eliminates per-block timestamps in L1 and L2  needs only one timestamp per core/L2 bank  Rerun divides thread into atomic sections (episodes)  Ends episode at a race; successor’s timestamp = predecessor timestamp+1 (piggybacked on coherence message)  Logs length and timestamp of episode  In replay, the serial order of episodes is known  Races fall in two categories [Strata]:  Current – block last accessed in another thread’s current episode  Past – block last accessed in a past episode  Distinguished by R/W bit per block (or Bloom filter) Past races are implied by transitivity, need not be logged 8Gwendolyn Voskuilen et al.

9 9 23 20 Timestamp: Dynamic Execution A? B? A? 25 24 26 27 Gwendolyn Voskuilen et al. Episodes: 2 2 log entries (A,B)(A,B) 24 26

10  Timetraveler logs only current, cyclic races  Rerun logs all current races  Post-dating  Upon current race, predecessor gives post-dated timestamp to successor, guarantees not to exceed it due to future races ▪ Without ending ▪ Breathing room for predecessor to avoid ending immediately ▪ Correctly orders acyclic successor ▪ Detects cycles causing post-dated timestamp to be exceeded  Minimal hardware over Rerun 10Gwendolyn Voskuilen et al. Postdating exploits acyclicity & detects cycles with minimal hardware

11 11 23 20 Current TS: Post-dated TS: Gwendolyn Voskuilen et al. 1 chapter --- Dynamic Execution A? B? A? (A,B)(A,B) 33 34 44 --- 45 23 20 Timestamp: Dynamic Execution A? B? A? 25 24 26 27 (A,B)(A,B) 28 33 44 RerunTimetraveler 2 episodes

12  Rerun conservatively ends episodes upon replacements/downgrades of current blocks to L2  Places timestamp at L2 for successors  Orders racing successor after predecessor  Timetraveler employs post-dating to avoid ending  Places post-dated timestamp at L2 Postdating extends chapters beyond replacements 12Gwendolyn Voskuilen et al.

13  Problem: Only one timestamp per L2 bank  All blocks look recent, even if only a single block recently accessed and others accessed long ago  Causes false cycles when accessing one of the others ▪ L2 timestamp > thread’s post-dated timestamp cycle  Solution: Buffer most-recently arrived timestamps at L2  Delays update of L2 timestamp so L2 bank retains old timestamp  L2 timestamp < thread’s post-dated timestamp no cycle  Requests get data from L2, timestamp from buffer or L2  8 entries per L2 bank suffice Time-delay buffer avoids false cycles through L2 13Gwendolyn Voskuilen et al.

14  Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al.14

15  GEMS + Simics  8 in-order cores, MESI coherence  32 KB split I & D, 8 MB 8 bank L2  Workloads  Commercial: Apache, OLTP, SpecJBB 2005  Scientific: SPLASH Ocean, Raytrace, Water-nsquared  Timetraveler  R/W bits per L1 block, 8-entry time-delay buffer per L2 bank, 32-bit timestamps, 16-bit chapter length, postdating offest = 10  Rerun  R/W bloom filters, 32-bit timestamps, 16-bit episode length 15Gwendolyn Voskuilen et al.

16 16 8x 123x Gwendolyn Voskuilen et al. Large reduction in log growth due to post-dating Post-dating & time-delay buffer effectively capture true cycles

17 Benchmarks Current races Current-block replacements Total current- races per chapter Current-racesNon-races Specjbb0.61.121.01.7 Apache1.58.026.19.5 OLTP3.45.812.29.3 Water-n 2 2.36.4228.28.7 Ocean1.82.45.14.1 Raytrace2.43.9197.86.3 Mean-com1.84.919.86.8 Mean-sci2.14.2143.76.4 17 Multiple races per chapter Ending on current-block replacements would significantly shorten chapters Gwendolyn Voskuilen et al.

18  Timetraveler exploits acyclicity of races to reduce log size  8X (commercial) & 123X (scientific) reduction over Rerun  Two novel techniques elegantly exploit and detect cycles  Post-dating  Time-delay buffer  Introduces minimal hardware  Two timestamps per core  696 byte time-delay buffer 18Gwendolyn Voskuilen et al. CMPs on the rise + debugging important Timetraveler valuable

19 Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

20  Two requirements for replay  All original races must occur in replay  No new races (not seen originally) may occur  Replay need not be terribly fast but cannot be terribly slow  Thus simplest scheme is sequential replay of chapters  Can leverage speculation for faster replay Gwendolyn Voskuilen et al.20


Download ppt "Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010."

Similar presentations


Ads by Google