Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.

Slides:



Advertisements
Similar presentations
RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.
Advertisements

The University of Adelaide, School of Computer Science
UW-Madison Computer Sciences Multifacet Group© 2011 Karma: Scalable Deterministic Record-Replay Arkaprava Basu Jayaram Bobba Mark D. Hill Work done at.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
CS530 Operating System Nesting Paging in VM Replay for MPs Jaehyuk Huh Computer Science, KAIST.
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
An efficient data race detector for DIOTA Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere Department of Electronics and.
Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.
1 Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Scalable, Reliable, Power-Efficient Communication for Hardware Transactional Memory Seth Pugsley, Manu Awasthi, Niti Madan, Naveen Muralimanohar and Rajeev.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
Shuchang Shan † ‡, Yu Hu †, Xiaowei Li † † Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences.
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.
Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
1 Distributed Process Management Chapter Distributed Global States Operating system cannot know the current state of all process in the distributed.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
Effects of wrong path mem. ref. in CC MP Systems Gökay Burak AKKUŞ Cmpe 511 – Computer Architecture.
Rerun: Exploiting Episodes for Lightweight Memory Race Recording Derek R. Hower and Mark D. Hill Computer systems complex – more so with multicore What.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
“Virtual Time and Global States of Distributed Systems”
Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
Computer Architecture Lab at Evangelos Vlachos, Michelle L. Goodstein, Michael A. Kozuch, Shimin Chen, Phillip B. Gibbons, Babak Falsafi and Todd C. Mowry.
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.
The University of Adelaide, School of Computer Science
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.
Rerun: Exploiting Episodes for Lightweight Memory Race Recording
Lecture: Large Caches, Virtual Memory
Speculative Lock Elision
ASR: Adaptive Selective Replication for CMP Caches
Transactional Memory : Hardware Proposals Overview
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
12.4 Memory Organization in Multiprocessor Systems
Cache Memory Presentation I
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
The University of Adelaide, School of Computer Science
Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.
The University of Adelaide, School of Computer Science
Changing thread semantics
Improving Multiple-CMP Systems with Token Coherence
Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Dynamic Verification of Sequential Consistency
Lecture 17 Multiprocessors and Thread-Level Parallelism
Programming with Shared Memory Specifying parallelism
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

 Shared-memory programs are hard to debug  Due to non-deterministic memory races  Memory races depend on thread interleaving ▪ Read/write by thread A + write by thread B to same location  Deterministic replay  Check-point initial program state at recording start  Record races in a log  Enforce same race ordering at replay Race recording provides repeatability 2Gwendolyn Voskuilen et al.

 Record predecessor-successor ordering of threads involved in a memory race  Races always involve a write leverage coherence ▪ Global event (e.g., write invalidation) for memory races  Captures all races – synchronization and data  Two key overheads  Log size  Hardware to track race ordering 3Gwendolyn Voskuilen et al.

 Centralized - Strata [ASPLOS06], DeLorean [ISCA08]  Logging/ordering at a central entity  DeLorean has shorter log but Strata uses less hardware  Both less scalable  Distributed - FDR [ISCA03], RTR [ASPLOS06], Rerun [ISCA08]  Use Lamport clocks with directory coherence  All exploit transitivity to reduce logs ▪ Avoid recording races made redundant by transitivity  Rerun significantly reduces hardware Our focus – distributed schemes 4Gwendolyn Voskuilen et al.

 Goal: further reduce log size with minimal hardware  Rerun logs 38 GB/hour on 16 2-GHz cores  Our key novelty: Exploit acyclicity of races  Previous schemes record all non-transitive races  Timetraveler records only cyclic, non-transitive races 5Gwendolyn Voskuilen et al.

 Two novel and elegant mechanisms  Post-dating : correctly orders acyclic races and detects cyclic races via L1 & L2 ▪ No messy cycle detection hardware (just a 32-bit timestamp/core)  Time-delay buffers: avoids false cycles through L2  Reduce log by 8x (commercial) & 123x (scientific) over Rerun  Minimal hardware: 2 32-bit timestamps/core byte time- delay  696 MB/hour on 16 2-GHz cores Timetraveler significantly reduces log with minimal, elegant hardware 6Gwendolyn Voskuilen et al.

 Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al.7

 Rerun eliminates per-block timestamps in L1 and L2  needs only one timestamp per core/L2 bank  Rerun divides thread into atomic sections (episodes)  Ends episode at a race; successor’s timestamp = predecessor timestamp+1 (piggybacked on coherence message)  Logs length and timestamp of episode  In replay, the serial order of episodes is known  Races fall in two categories [Strata]:  Current – block last accessed in another thread’s current episode  Past – block last accessed in a past episode  Distinguished by R/W bit per block (or Bloom filter) Past races are implied by transitivity, need not be logged 8Gwendolyn Voskuilen et al.

Timestamp: Dynamic Execution A? B? A? Gwendolyn Voskuilen et al. Episodes: 2 2 log entries (A,B)(A,B) 24 26

 Timetraveler logs only current, cyclic races  Rerun logs all current races  Post-dating  Upon current race, predecessor gives post-dated timestamp to successor, guarantees not to exceed it due to future races ▪ Without ending ▪ Breathing room for predecessor to avoid ending immediately ▪ Correctly orders acyclic successor ▪ Detects cycles causing post-dated timestamp to be exceeded  Minimal hardware over Rerun 10Gwendolyn Voskuilen et al. Postdating exploits acyclicity & detects cycles with minimal hardware

Current TS: Post-dated TS: Gwendolyn Voskuilen et al. 1 chapter --- Dynamic Execution A? B? A? (A,B)(A,B) Timestamp: Dynamic Execution A? B? A? (A,B)(A,B) RerunTimetraveler 2 episodes

 Rerun conservatively ends episodes upon replacements/downgrades of current blocks to L2  Places timestamp at L2 for successors  Orders racing successor after predecessor  Timetraveler employs post-dating to avoid ending  Places post-dated timestamp at L2 Postdating extends chapters beyond replacements 12Gwendolyn Voskuilen et al.

 Problem: Only one timestamp per L2 bank  All blocks look recent, even if only a single block recently accessed and others accessed long ago  Causes false cycles when accessing one of the others ▪ L2 timestamp > thread’s post-dated timestamp cycle  Solution: Buffer most-recently arrived timestamps at L2  Delays update of L2 timestamp so L2 bank retains old timestamp  L2 timestamp < thread’s post-dated timestamp no cycle  Requests get data from L2, timestamp from buffer or L2  8 entries per L2 bank suffice Time-delay buffer avoids false cycles through L2 13Gwendolyn Voskuilen et al.

 Introduction  Timetraveler operations  Rerun background  Post-dating  Time-delay buffer  Results  Conclusion Gwendolyn Voskuilen et al.14

 GEMS + Simics  8 in-order cores, MESI coherence  32 KB split I & D, 8 MB 8 bank L2  Workloads  Commercial: Apache, OLTP, SpecJBB 2005  Scientific: SPLASH Ocean, Raytrace, Water-nsquared  Timetraveler  R/W bits per L1 block, 8-entry time-delay buffer per L2 bank, 32-bit timestamps, 16-bit chapter length, postdating offest = 10  Rerun  R/W bloom filters, 32-bit timestamps, 16-bit episode length 15Gwendolyn Voskuilen et al.

16 8x 123x Gwendolyn Voskuilen et al. Large reduction in log growth due to post-dating Post-dating & time-delay buffer effectively capture true cycles

Benchmarks Current races Current-block replacements Total current- races per chapter Current-racesNon-races Specjbb Apache OLTP Water-n Ocean Raytrace Mean-com Mean-sci Multiple races per chapter Ending on current-block replacements would significantly shorten chapters Gwendolyn Voskuilen et al.

 Timetraveler exploits acyclicity of races to reduce log size  8X (commercial) & 123X (scientific) reduction over Rerun  Two novel techniques elegantly exploit and detect cycles  Post-dating  Time-delay buffer  Introduces minimal hardware  Two timestamps per core  696 byte time-delay buffer 18Gwendolyn Voskuilen et al. CMPs on the rise + debugging important Timetraveler valuable

Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010

 Two requirements for replay  All original races must occur in replay  No new races (not seen originally) may occur  Replay need not be terribly fast but cannot be terribly slow  Thus simplest scheme is sequential replay of chapters  Can leverage speculation for faster replay Gwendolyn Voskuilen et al.20