1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug.

Slides:



Advertisements
Similar presentations
RTR: 1 Byte/Kilo-Instruction Race Recording Min Xu Rastislav BodikMark D. Hill.
Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,
Gwendolyn Voskuilen, Faraz Ahmad, and T. N. Vijaykumar Electrical & Computer Engineering ISCA 2010.
R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Recording Inter-Thread Data Dependencies for Deterministic Replay Tarun GoyalKevin WaughArvind Gopalakrishnan.
Chapter 6 Limited Direct Execution
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
An efficient data race detector for DIOTA Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere Department of Electronics and.
An Case for an Interleaving Constrained Shared-Memory Multi- Processor CS6260 Biao xiong, Srikanth Bala.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
Efficient On-the-Fly Data Race Detection in Multithreaded C++ Programs Eli Pozniansky & Assaf Schuster.
Concurrency.
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Architectural Support for OS March 29, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
Deterministic Logging/Replaying of Applications. Motivation Run-time framework goals –Collect a complete trace of a program’s user-mode execution –Keep.
BugNet Continuously Recording Program Execution for Deterministic Replay Debugging Satish Narayanasamy Gilles Pokam Brad Calder.
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,
Instructor: Umar KalimNUST Institute of Information Technology Operating Systems Process Synchronization.
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
Deterministic Replay of Java Multithreaded Applications Jong-Deok Choi and Harini Srinivasan slides made by Qing Zhang.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
A Portable Virtual Machine for Program Debugging and Directing Camil Demetrescu University of Rome “La Sapienza” Irene Finocchi University of Rome “Tor.
A “Flight Data Recorder” for Enabling Full-system Multiprocessor Deterministic Replay Min Xu, Rastislav Bodik, Mark D. Hill
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS-Related Hardware.
AADEBUG MUNCHEN Non-intrusive on-the-fly data race detection using execution replay Michiel Ronsse - Koen De Bosschere Ghent University - Belgium.
CDP 2013 Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki and Herb Sutter’s.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Interrupt driven I/O. MIPS RISC Exception Mechanism The processor operates in The processor operates in user mode user mode kernel mode kernel mode Access.
1 CSE451 Architectural Supports for Operating Systems Autumn 2002 Gary Kimura Lecture #2 October 2, 2002.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
CISC Machine Learning for Solving Systems Problems Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Full and Para Virtualization
Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
Agenda  Quick Review  Finish Introduction  Java Threads.
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
Chapter 6 Limited Direct Execution Chien-Chung Shen CIS/UD
Qin Zhao1, Joon Edward Sim2, WengFai Wong1,2 1SingaporeMIT Alliance 2Department of Computer Science National University of Singapore
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.
Data Race Detection Assaf Schuster.
Chapter 2: Computer-System Structures
Instruction-level Tracing: Framework & Applications
Module 2: Computer-System Structures
How to improve (decrease) CPI
Threads and Memory Models Hal Perkins Autumn 2009
Lecture Topics: 11/1 General Operating System Concepts Processes
Architectural Support for OS
Concurrency: Mutual Exclusion and Process Synchronization
Module 2: Computer-System Structures
Architectural Support for OS
Foundations and Definitions
Chapter 2: Computer-System Structures
Chapter 2: Computer-System Structures
Module 2: Computer-System Structures
Lecture 9: Dynamic ILP Topics: out-of-order processors
Dynamic Binary Translators and Instrumenters
Presentation transcript:

1 Thread 1Thread 2 X++T=Y Z=2T=X What is a Data Race? Two concurrent accesses to a shared location, at least one of them for writing. Indicative of a bug

2 Lock(m) Unlock(m)Lock(m) Unlock(m) How Can Data Races be Prevented? Explicit synchronization between threads: Locks Critical Sections Barriers Mutexes Semaphores Monitors Events Etc. Thread 1Thread 2 X++ T=X

3 Is This Sufficient? Yes! No! Programmer dependent Correctness – programmer may forget to synch Need tools to detect data races Expensive Efficiency – to achieve correctness, programmer may overdo. Need tools to remove excessive synch ’ s

4 #define N 100 Type g_stack = new Type[N]; int g_counter = 0; Lock g_lock; void push( Type& obj ){lock(g_lock);...unlock(g_lock);} void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);} void popAll( ) { lock(g_lock); delete[] g_stack; g_stack = new Type[N]; g_counter = 0; unlock(g_lock); } int find( Type& obj, int number ) { lock(g_lock); for (int i = 0; i < number; i++) if (obj == g_stack[i]) break; // Found!!! if (i == number) i = -1; // Not found … Return -1 to caller unlock(g_lock); return i; } int find( Type& obj ) { return find( obj, g_counter ); } Where is Waldo?

5 #define N 100 Type g_stack = new Type[N]; int g_counter = 0; Lock g_lock; void push( Type& obj ){lock(g_lock);...unlock(g_lock);} void pop( Type& obj ) {lock(g_lock);...unlock(g_lock);} void popAll( ) { lock(g_lock); delete[] g_stack; g_stack = new Type[N]; g_counter = 0; unlock(g_lock); } int find( Type& obj, int number ) { lock(g_lock); for (int i = 0; i < number; i++) if (obj == g_stack[i]) break; // Found!!! if (i == number) i = -1; // Not found … Return -1 to caller unlock(g_lock); return i; } int find( Type& obj ) { return find( obj, g_counter ); } Can You Find the Race? Similar problem was found in java.util.Vector write read

6 Detecting Data Races? NP-hard [Netzer&Miller 1990] Input size = # instructions performed Even for 3 threads only Even with no loops/recursion Execution orders/scheduling (# threads) thread_length # inputs Detection-code ’ s side-effects Weak memory, instruction reorder, atomicity

7 Motivation Run-time framework goals Collect a complete trace of a program ’ s user-mode execution Keep the tracing overhead for both space and time low Re-simulate the traced execution deterministically based on the collected trace with full fidelity down to the instruction level Full fidelity: user mode only, no tracing of kernel, only user-mode I/O callbacks Advantages Complete program trace that can be analyzed from multiple perspectives (replay analyzers: debuggers, locality, etc) Trace can be collected on one machine and re-played on other machines (or perform live analysis by streaming) Challenges: Trace Size and Performance

8 Original Record-Replay Approaches InstantReplay ’ 87 Record order or memory accesses overhead may affect program behavior RecPlay ’ 00 Record only synchronizations Not deterministic if have data races Netzer ’ 93 Record optimal trace too expensive to keep track of all memory locations Bacon & Goldstein ’ 91 Record memory bus transactions with hardware high logging bandwidth

9 Motivation Increasing use and development for multi-core processors MT program behavior is non-deterministic To effectively debug software, developers must be able to replay executions that exhibit concurrency bugs Shared memory updates happen in different order

10 Related Concepts Runtime interpretation/translation of binary instructions Requires no static instrumentation, or special symbol information Handle dynamically generated code, self modifying code Recording/Logging: ~ x More recent logging Proposed hardware support (for MT domain) FDR (Flight Data Recorder) BugNet (cache bits set on first load) RTR (Regulated Transitive Reduction) DeLorean (ISCA chunks of instructions) Strata (time layer across all the logs for the running threads) iDNA (Diagnostic infrastructure using NirvanA- Microsoft)

11 Deterministic Replay Re-execute the exact same sequence of instructions as recorded in a previous run Single threaded programs Record Load Values needed for reproducing behavior of a run (Load Log) Registers updated by system calls and signal handlers (Reg Log) Output of special instructions: RDTSC, CPUID (Reg Log) System call (virtualization- cloning arguments, updates) Checkpointing (log summary ~10Million) Multi-threaded programs Log interleaving among threads (shared memory updates ordering – SMO Log)

12 PinSEL – System Effect Log (SEL) Logging program load values needed for deterministic replay: – First access from a memory location – Values modified by the system (system effect) and read by program – Machine and time sensitive instructions (cpuid,rdtsc) Load A; (A = 111) Logged Not Logged Syscall modifies location (B -> 0) and (C -> 99) Load C; (C = 99) Load D; (D = 10) Store A; (A  111) Store B; (B  55) Load B; (B = 0) system call Program execution Load C; (C = 9) Load D; (D = 10) Trace size is ~4-5 bytes per instruction

13 Optimization: Trace select reads Observation: Hardware caches eliminate most off-chip reads Optimize logging: Logger and replayer simulate identical cache memories Simple cache (the memory copy structure) to decide which values to log. No tags or valid bits to check. If the values mismatch they are logged. Average trace size is <1 bit per instruction i = 1; for (j = 0; j < 10; j++) { i = i + j; } k = i; // value read is 46 System_call(); k = i; // value read is 0 (not predicted) The only read not predicted and logged follows the system call

14 Example Overhead PinSEL and PinPLAY Initial work (2006) with single threaded programs: SPEC2000 ref runs: 130x slowdown for pinSEL and ~80x for PinPLAY (w/o in-lining) Working with a subset of SPLASH2 benchmarks: 230x slowdown for PinSEL Now: Geo-mean SPEC2006 Pin 1.4x Logger 83.6x Replayer 1.4x

15 Example: Microsoft iDNA Trace Writer Performance Applicatio n Simulated Instructions (millions) Trace File Size Trace File Bits / Instructio n Native Execution Time Execution Time While Tracing Execution Overhead Gzip24, MB s187s15.98 Excel1,78199 MB s105s5.76 Power Point 7, MB s247s5.66 IE1165 MB s6.94s13.90 Vulcan2, MB s46.6s17.01 Satsolver9, MB s127s12.98 Memchecker and valgrind are in 30-40x range on CPU 2006 iDNA ~11x, (does not log shared-memory dependences explicitly) Use a sequential number for every lock prefixed memory operation: offline data race analysis

16 Logging Shared Memory Ordering (Cristiano ’ s PinSEL/PLAY Overview) Emulation of Directory Based Cache Coherence Identifies RAW, WAR, WAW dependences Indexed by hashing effective address Each entry represents an address range Store A Load B Program execution hash Dir Entry Directory

17 Directory Entries Every DirEntry maintains: Thread id of the last_writer A timestamp is the # of memory ref. the thread has executed Vector of timestamps of last access for each thread to that entry On Loads: update the timestamp for the thread in the entry On Stores: update the timestamp and the last_writer fields Program execution Thread T1 Thread T2 Last writer id: 1: Store A 2: Load A DirEntry: [A:D] Last writer id: DirEntry: [E:H] Directory T1:T2: T1:T2: 1: Load F 2: Store A 3: Load F 3: Store F T1 1 1 T T1 3 Vector

18 Detecting Dependences RAW dependency between threads T and T ’ is established if: T executes a load that maps to the directory entry A T ’ is the last_writer for the same entry WAW dependency between T and T ’ is established if: T executes a store that maps to the directory entry A T ’ is the last_writer for the same entry WAR dependency between T and T ’ is established if: T executes a store that maps to the directory entry A T ’ has accessed the same entry in the past and T is not the last_writer

19 Example Program execution Thread T1 Thread T2 Last writer id: 1: Store A 2: Load A DirEntry: [A:D] Last writer id: DirEntry: [E:H] T1:T2: T1:T2: 1: Load F 2: Store A 3: Load F 3: Store F T1 1 1 T T1 3 WAW RAW WAR T1 2 T2 2 T1 3 T2 3 T2 2 T1 1 SMO logs: Thread T1 cannot execute memory reference 2 until T2 executes its memory reference 2 Thread T2 cannot execute memory reference 2 until T1 executes its memory reference 1 Last access to the DirEntry Last_writer Last access to the DirEntry

20 Ordering Memory Accesses (Reducing log size) Preserving order will reproduce execution a → b: “ a happens-before b ” Ordering is transitive: a → b, b → c means a → c Two instructions must be ordered if: they both access the same memory, and one of them is a write

21 Constraints: Enforcing Order To guarantee a → d: a → d b → d a → c b → c Suppose we need b → c b → c is necessary a → d is redundant P1 a b c d P2 overconstrained

22 Reproduce exact same conflicts: no more, no less Problem Formulation ld A Thread I Thread J Recording st B st C sub ld B add st C ld B st A st C Thread I Thread J Replay Log ld D st D ld A st B st C sub ld B add st C ld B st A st C ld D st D Conflicts (red) Dependence (black)

23  Detect conflicts  Write log Log All Conflicts ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2  3 1  4 3  5 4  6 Log I : 2  3 Log Size: 5*16=80 bytes (10 integers) Dependence Log 16 bytes Assign IC (logical Timestamps) But too many conflicts

24 Netzer ’ s Transitive Reduction ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D TR reduced Log J : 2  3 3  5 4  6 Log I : 2  3 Log Size: 64 bytes (8 integers) TR Reduced Log

25 RTR (Regulated Transitive Reduction): Stricter Dependences to Aid Vectorization ld A Thread I Thread J Replay st B st C sub ld B add st C ld B st A st C ld D st D Log J : 2  3 4  5 Log I : 2  3 Log Size: 48 bytes (6 integers) New Reduced Log stricter Reduced 4% Overhead RTR+FDR (simulated on GEMs).2 MB/core/second logging (Apache)