HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.

Slides:



Advertisements
Similar presentations
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Advertisements

Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Goldilocks: Efficiently Computing the Happens-Before Relation Using Locksets Tayfun Elmas 1, Shaz Qadeer 2, Serdar Tasiran 1 1 Koç University, İstanbul,
Chapter 6: Process Synchronization
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 6: Process Synchronization.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
Mutual Exclusion.
Eraser: A Dynamic Data Race Detector for Multithreaded Programs STEFAN SAVAGE, MICHAEL BURROWS, GREG NELSON, PATRICK SOBALVARRO and THOMAS ANDERSON.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Atomicity in Multi-Threaded Programs Prachi Tiwari University of California, Santa Cruz CMPS 203 Programming Languages, Fall 2004.
Hit or Miss ? !!!.  Cache RAM is high-speed memory (usually SRAM).  The Cache stores frequently requested data.  If the CPU needs data, it will check.
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.
Threading Part 2 CS221 – 4/22/09. Where We Left Off Simple Threads Program: – Start a worker thread from the Main thread – Worker thread prints messages.
Bloom Filters Kira Radinsky Slides based on material from:
Continuously Recording Program Execution for Deterministic Replay Debugging.
The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.
TaintCheck and LockSet LBA Reading Group Presentation by Shimin Chen.
CS533 Concepts of Operating Systems Class 3 Data Races and the Case Against Threads.
CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.
Beyond Bloom Filters: From Approximate Membership Checks to Approximate State Machines By F. Bonomi et al. Presented by Kenny Cheng, Tonny Mak Yui Kuen.
Presenter : Shih-Tung Huang Tsung-Cheng Lin Kuan-Fu Kuo 2015/6/26 EICE team dIP: A Non-Intrusive Debugging IP for Dynamic Data Race Detection in Many-core.
PathExpander: Architectural Support for Increasing the Path Coverage of Dynamic Bug Detection S. Lu, P. Zhou, W. Liu, Y. Zhou, J. Torrellas University.
Software-Based Cache Coherence with Hardware-Assisted Selective Self Invalidations Using Bloom Filters Authors : Thomas J. Ashby, Pedro D´ıaz, Marcelo.
Catching Accurate Profiles in Hardware Satish Narayanasamy, Timothy Sherwood, Suleyman Sair, Brad Calder, George Varghese Presented by Jelena Trajkovic.
Cache Organization of Pentium
/ PSWLAB Eraser: A Dynamic Data Race Detector for Multithreaded Programs By Stefan Savage et al 5 th Mar 2008 presented by Hong,Shin Eraser:
1 Race Conditions/Mutual Exclusion Segment of code of a process where a shared resource is accessed (changing global variables, writing files etc) is called.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Chapter 10 The Stack Stack: An Abstract Data Type An important abstraction that you will encounter in many applications. We will describe two uses:
Nachos Phase 1 Code -Hints and Comments
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
Eraser: A Dynamic Data Race Detector for Multithreaded Programs STEFAN SAVAGE, MICHAEL BURROWS, GREG NELSON, PATRICK SOBALVARRO, and THOMAS ANDERSON Ethan.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
OpenRISC 1000 Yung-Luen Lan, b Cache Model Perspective of the Programming Model. Hence, the hardware implementation details (cache organization.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording (ASLPOS’06) Min Xu Rastislav BodikMark D. Hill Shimin Chen LBA Reading Group Presentation.
Dataflow Analysis for Concurrent Programs using Datarace Detection Ravi Chugh, Jan W. Voung, Ranjit Jhala, Sorin Lerner LBA Reading Group Michelle Goodstein.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Paging (continued) & Caching CS-3013 A-term Paging (continued) & Caching CS-3013 Operating Systems A-term 2008 (Slides include materials from Modern.
Detecting Atomicity Violations via Access Interleaving Invariants
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
Eraser: A dynamic Data Race Detector for Multithreaded Programs Stefan Savage, Michael Burrows, Greg Nelson, Patrick Sobalvarro, Thomas Anderson Presenter:
How does the CPU work? CPU’s program counter (PC) register has address i of the first instruction Control circuits “fetch” the contents of the location.
Soyeon Park, Shan Lu, Yuanyuan Zhou UIUC Reading Group by Theo.
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha.
Cache Organization of Pentium
Speculative Lock Elision
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Memory Consistency Models
The University of Adelaide, School of Computer Science
Memory Consistency Models
Lecture 18: Coherence and Synchronization
Effective Data-Race Detection for the Kernel
Automatic Detection of Extended Data-Race-Free Regions
Chapter 10 The Stack.
Programming with Shared Memory Specifying parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
Eraser: A dynamic data race detector for multithreaded programs
Presentation transcript:

HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation

Motivation Data race detection important S/W solutions slow (not good for production runs) Previous H/W solutions focus on happens- before relation  Cannot detect potential races

Motivating Example

Solution: HARD (h/w lockset) Challenges: – How to efficiently store and maintain lockset for each variable in hardware? – How to efficiently perform the set operation in the lockset algorithm? Main ideas (will be detailed later) – h/w bloom filter – Piggybacking on cache coherence protocols – Reset all bloom filters after exiting a barrier

Outline LockSet (refresh our memory) HARD Evaluation Conclusion

Main Lockset Algorithm Idea: accesses to every shared variable should be protected by some common lock. Data structures: – Thread t’s current lock set: L(t) – Candidate set for a variable v: C(v) Algorithms – Modify L(t) upon lock acquire and release – Initiate C(v) to be a set of all locks – When t accesses v, C(v)=C(v)  L(t) – If C(v) ==  then report violation on variable v

Reducing False Positives

Outline LockSet (refresh our memory) HARD Evaluation Conclusion

HARD Overview LState: exclusive, shared, etc. BFVector: candidate lock set for the cache line Lock Register: Thread’s lockset Counter Register: used for resolving hash collisions (more detail later) 2bits16bits 32bits

HARD Overview: Operations A lock  a ‘1’ in bloom filter Fetching a line from memory: set the BFVector to all 1s, LState to exclusive Update BFVector and LState on accesses Communicate them through coherence protocol Lock register: thread’s lock set 2b16b 32b

Bloom Filter Bloom filter: A bit vector that represents a set of keys – A key is hashed d (e.g. d=3) times and represented by d bits Construct: for every key in the set, set its 3 bits in vector Membership Test: given a key, check if all its 3 bits are 1 – Definitely not in the set if some bits are 0 – May have false positives Bit 0 =H 0 (key)Bit 1 =H 1 (key)Bit 2 =H 2 (key) Filter

Representing LockSet as Bloom Filter 4 hash functions Lockset Intersection: bloom filter intersection Lockset empty: any of the 4bits are all 0

False Negative Caused by Bloom Filter

Prob of False Negatives Suppose the candidate set contains m locks Given a lock, probability of recognizing it as a member: prob_whole = prob_part k prob_part = 1 – (1-1/n) m When k=4, n=4: – (m=1), (m=2), (m=3) – Paper says: “experiments show that no races were missed” But what if the thread currently holds multiple locks? n bits k parts k=4, n=4

If threads hold 1 to 8 locks (not in the paper) n bits =4 k parts = m=1 m=2 m=3 m=4 t=1 : t=2 : t=3 : t=4 : t=5 : t=6 : t=7 : t=8 :

Try another design n bits =8 k parts = m=1 m=2 m=3 m=4 t=1 : t=2 : t=3 : t=4 : t=5 : t=6 : t=7 : t=8 :

Unlock operation  remove bit from bloom filter? 32 bit counter register each bloom filter bit has 2 bit counter Increment the 2-bit counter if the bloom filter bit is set Unlock: decrement the 2-bit counter, if 0, clear bloom filter bit 2b16b 32b

Candidate Set and LState Communications must broadcast changes to C(v) if cache line is in shared state

Handling Barriers Set BFVectors to all 1s after exiting a barrier (what if t2 does not hold any lock?)

Three Approximations Bloom filter to represent lockset Lockset info only in cache – Can only detect races in a short window of execution Cache line granularity – False sharing – Compiler to put shared variables to different lines? – Removing false sharing is generally good

Outline LockSet (refresh our memory) HARD Evaluation Conclusion

Methodology SESC: cycle-accurate execution-driven simulator (MIPS instruction set) Six SPLASH-2 benchmarks Randomly inject a data race: randomly remove a dynamic instance of lock and corresponding unlock Compare with happens-before, ideal lockset

Bug detected, false alarms Ideal: word-granularity, keep state in memory, perfect lockset # of false alarms is # of source code locations, dynamic errors are much more

Mainly bus traffic increase Note that HARD requires bloom filter operation per memory access in processor pipeline

Conclusion Main idea: bloom filter to represent lockset Three approximations: – Bloom filter to represent lockset – Lockset info only in cache – Cache line granularity Problems: – Lockset: false positives – Seems hard to add operations into processor pipeline – Are these the right approximations for monitoring production runs?