Detecting and surviving data races using complementary schedules

Slides:



Advertisements
Similar presentations
On-the-fly Healing of Race Conditions in ARINC-653 Flight Software
Advertisements

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,
OSDI ’10 Research Visions 3 October Epoch parallelism: One execution is not enough Jessica Ouyang, Kaushik Veeraraghavan, Dongyoon Lee, Peter Chen,
Race Detection for Event-driven Mobile Applications
Iterative Context Bounding for Systematic Testing of Multithreaded Programs Madan Musuvathi Shaz Qadeer Microsoft Research.
Race Conditions. Isolated & Non-Isolated Processes Isolated: Do not share state with other processes –The output of process is unaffected by run of other.
An Case for an Interleaving Constrained Shared-Memory Multi- Processor CS6260 Biao xiong, Srikanth Bala.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
S. Narayanasamy, Z. Wang, J. Tigani, A. Edwards, B. Calder UCSD and Microsoft PLDI 2007.
Continuously Recording Program Execution for Deterministic Replay Debugging.
Execution Replay for Multiprocessor Virtual Machines George W. Dunlap Dominic Lucchetti Michael A. Fetterman Peter M. Chen.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.
DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,
1 ExtraVirt: Detecting and recovering from transient processor faults Dominic Lucchetti, Steve Reinhardt, Peter Chen University of Michigan.
Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.
MSWAT: Low-Cost Hardware Fault Detection and Diagnosis for Multicore Systems Siva Kumar Sastry Hari, Man-Lap (Alex) Li, Pradeep Ramachandran, Byn Choi,
Light64: Lightweight Hardware Support for Data Race Detection during Systematic Testing of Parallel Programs A. Nistor, D. Marinov and J. Torellas to appear.
0 Deterministic Replay for Real- time Software Systems Alice Lee Safety, Reliability & Quality Assurance Office JSC, NASA Yann-Hang.
Orchestra: Intrusion Detection Using Parallel Execution and Monitoring of Program Variants in User-Space Babak Salamat, Todd Jackson, Andreas Gal, Michael.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.
CS 153 Design of Operating Systems Spring 2015 Lecture 11: Scheduling & Deadlock.
- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.
- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.
SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.
Eraser: A Dynamic Data Race Detector for Multithreaded Programs STEFAN SAVAGE, MICHAEL BURROWS, GREG NELSON, PATRICK SOBALVARRO, and THOMAS ANDERSON Ethan.
Parallelizing Security Checks on Commodity Hardware Ed Nightingale Dan Peek, Peter Chen Jason Flinn Microsoft Research University of Michigan.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
Cooperative Concurrency Bug Isolation Guoliang Jin, Aditya Thakur, Ben Liblit, Shan Lu University of Wisconsin–Madison Instrumentation and Sampling Strategies.
On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,
Accelerating Dynamic Software Analyses Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 1,
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan.
Seminar of “Virtual Machines” Course Mohammad Mahdizadeh SM. University of Science and Technology Mazandaran-Babol January 2010.
Drinking from Both Glasses: Adaptively Combining Pessimistic and Optimistic Synchronization for Efficient Parallel Runtime Support Man Cao Minjia Zhang.
…and region serializability for all JESSICA OUYANG, PETER CHEN, JASON FLINN & SATISH NARAYANASAMY UNIVERSITY OF MICHIGAN.
Ali Kheradmand, Baris Kasikci, George Candea Lockout: Efficient Testing for Deadlock Bugs 1.
Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.
Detecting Atomicity Violations via Access Interleaving Invariants
Speculation Supriya Vadlamani CS 6410 Advanced Systems.
A Binary Agent Technology for COTS Software Integrity Anant Agarwal Richard Schooler InCert Software.
On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan November 29,
Network-based and Attack-resilient Length Signature Generation for Zero-day Polymorphic Worms Zhichun Li 1, Lanjia Wang 2, Yan Chen 1 and Judy Fu 3 1 Lab.
Flashback : A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging Sudarshan M. Srinivasan, Srikanth Kandula, Christopher.
Speculative Execution in a Distributed File System Ed Nightingale Peter Chen Jason Flinn University of Michigan Best Paper at SOSP 2005 Modified for CS739.
Soyeon Park, Shan Lu, Yuanyuan Zhou UIUC Reading Group by Theo.
CHESS Finding and Reproducing Heisenbugs in Concurrent Programs
Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.
Where Testing Fails …. Problem Areas Stack Overflow Race Conditions Deadlock Timing Reentrancy.
PRES: Probabilistic Replay with Execution Sketching on Multiprocessors Soyeon Park and Yuanyuan Zhou (UCSD) Weiwei Xiong, Zuoning Yin, Rini Kaushik, Kyu.
Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.
Remote Backup Systems.
Presented by: Daniel Taylor
MSWAT: Hardware Fault Detection and Diagnosis for Multicore Systems
Effective Data-Race Detection for the Kernel
Lazy Diagnosis of In-Production Concurrency Bugs
Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang
A test technique is a recipe these tasks that will reveal something
Replication-based Fault-tolerance for Large-scale Graph Processing
EECE.4810/EECE.5730 Operating Systems
Remote Backup Systems.
CSE 542: Operating Systems
EECE.4810/EECE.5730 Operating Systems
Presentation transcript:

Detecting and surviving data races using complementary schedules Kaushik Veeraraghavan Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan

Multicores/multiprocessors are ubiquitous Most desktops, laptops & cellphones use multiprocessors Multithreading is a common way to exploit hardware parallelism Problem: it is hard to write correct multithreaded programs! Kaushik Veeraraghavan

Data races are a serious problem Data race: Two instructions (at least one of which is a write) that access the same shared data without being ordered by synchronization Data races can cause catastrophic failures Therac-25 radiation overdose 2003 Northeast US power blackout proc_info = 0; MySQL bug #3596 crash If (proc_info) { fputs (proc_info, f); } Kaushik Veeraraghavan

First goal: efficient data race detection High coverage (find harmful data races) Accurate (no false positives) Low overhead High coverage Sampling Native (C/C++) ThreadSanitizer (30X) Frost (3X) DataCollider (1.1x with 4 watchpoints) Frost (1.18x @ 3.5% coverage) Managed (Java/C#) FastTrack (8.5X) PACER (1.6-2.1x @ 3% coverage) Kaushik Veeraraghavan

Second goal: data race survival Unknown data race might manifest at runtime Mask harmful effect so system stays running Kaushik Veeraraghavan

Kaushik Veeraraghavan Outline Motivation Design Outcome-based race detection Complementary schedules Implementation: Frost New, fast method to detect the effect of a data race Masks effect of harmful data race bug Evaluation Kaushik Veeraraghavan

Kaushik Veeraraghavan State is what matters All prior data race detectors analyze events Shared memory accesses are very frequent New idea: run multiple replicas and analyze state Goal: replicas diverge if and only if harmful data race proc_info = 0; crash If (proc_info) { fputs (proc_info, f); } proc_info = 0; If (proc_info) { fputs (proc_info, f); } ✔ Kaushik Veeraraghavan

Kaushik Veeraraghavan No false positives Divergence  data race Race-free replicas will never diverge Identical inputs Obey same happens-before ordering Outcome-based race detection Divergence in program or output state indicates race Kaushik Veeraraghavan

Minimize false negatives Harmful data race  divergence Complementary schedules Make replica schedules as dissimilar as possible If instructions A & B are unordered, one replica executes A before B and the other executes B before A Kaushik Veeraraghavan

Complementary schedules in action unlock (*fifo); fifo = NULL; unlock (*fifo); fifo = NULL; ✔ crash We do not know a priori that a race exists Replicas schedule unordered instructions in opposite orders Race detection: replicas diverge in output Race survival: use surviving replica to continue program Kaushik Veeraraghavan

How to construct complementary schedules? Problem: we don’t know which instructions race Try and flip all pairs of unordered instructions Record total ordering of instructions in one replica Only one thread runs at a time Each thread runs non-preemptively until it blocks Other replica executes instructions in reverse order T1 T3 T2 T2 T3 T1 Kaushik Veeraraghavan

Kaushik Veeraraghavan Type I data race bug crash unlock (*fifo); fifo = NULL; Replica 1 ✔ unlock (*fifo); fifo = NULL; Replica 2 fifo = NULL; unlock (*fifo); crash Failure requirement: order of instructions that leads to failure E.g.: if “fifo = NULL;” is ordered first, program crashes Type I bug: all failure requirements point in same direction Guarantee race detection for synchronization-free region as replicas diverge Survival if we can identify correct replica Kaushik Veeraraghavan

Kaushik Veeraraghavan Type II data race bug proc_info = 0; crash If (proc_info) { fputs (proc_info, f); } proc_info = 0; If(proc_info) { fputs(proc_info, f); } Replica 1 ✔ proc_info = 0; If(proc_info) { fputs(proc_info, f); } Replica 2 ✔ Type II bug: failure requirements point in opposite directions Guarantee data race survival for synchronization-free region Both replicas avoid the failure Kaushik Veeraraghavan

Leverage uniparallelism to scale performance CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 TIME Each epoch has three replicas T1 T2 T2 T1 T2 T1 ckpt T2 T1 T2 T1 Frost executes three replicas of each epoch Leading replica provides checkpoint and non-deterministic event log Trailing replicas run complementary schedules Upto 3X overhead, but still cheaper than traditional race detectors Kaushik Veeraraghavan

Analyzing epoch outcomes for race detection CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 TIME T1 T2 T2 T1 T2 T1 T2 T1 T2 T1 Each epoch has three replicas Do replica states match? Race detected if replicas diverge Self-evident failure? Output or memory difference? Frost guarantees replay for offline debugging Kaushik Veeraraghavan

Analyzing epoch outcomes for survival Likely bug Survival strategy A-AA None Commit A F-FF Non-race bug Rollback A-AB/A-BA Type I A-AF/A-FA F-FA/F-AF A-BB Type II Commit B A-BC Commit B or C F-AA F-AB Commit A or B A-BF/A-FB Multiple A-FF Kaushik Veeraraghavan

Analyzing epoch outcomes for survival Likely bug Survival strategy A-AA None Commit A F-FF Non-race bug Rollback A-AB/A-BA Type I A-AF/A-FA F-FA/F-AF A-BB Type II Commit B A-BC Commit B or C F-AA F-AB Commit A or B A-BF/A-FB Multiple A-FF All replicas agree Kaushik Veeraraghavan

Analyzing epoch outcomes for survival Likely bug Survival strategy A-AA None Commit A F-FF Non-race bug Rollback A-AB/A-BA Type I A-AF/A-FA F-FA/F-AF A-BB Type II Commit B A-BC Commit B or C F-AA F-AB Commit A or B A-BF/A-FB Multiple A-FF Two outcomes/trailing replicas differ Kaushik Veeraraghavan

Analyzing epoch outcomes for survival Likely bug Survival strategy A-AA None Commit A F-FF Non-race bug Rollback A-AB/A-BA Type I A-AF/A-FA F-FA/F-AF A-BB Type II Commit B A-BC Commit B or C F-AA F-AB Commit A or B A-BF/A-FB Multiple A-FF Trailing replicas do not fail Kaushik Veeraraghavan

Analyzing epoch outcomes for survival Likely bug Survival strategy A-AA None Commit A F-FF Non-race bug Rollback A-AB/A-BA Type I A-AF/A-FA F-FA/F-AF A-BB Type II Commit B A-BC Commit B or C F-AA F-AB Commit A or B A-BF/A-FB Multiple A-FF Kaushik Veeraraghavan

Kaushik Veeraraghavan Limitations Multiple type I bugs in an epoch Rollback and reduce epoch length to separate bugs Priority-inversion If >2 threads involved in race, 2 replicas insufficient to flip races Heuristic: threads with frequent constraints are adjacent in order Epoch boundaries Insert epochs only on system calls. Detection of Type II bugs Usually some difference in program state or output Kaushik Veeraraghavan

Frost detects and survives all harmful races Application Bug manifestation Outcome % survived % detected Recovery time (sec) pbzip2 crash F-AA 100% 0.01 Apache #21287 double free A-BB/A-AB 0.00 Apache #25520 corrupted out. A-BC Apache #45605 assertion A-AB MySQL #644 0.02 MySQL #791 missing output MySQL #2011 0.22 MySQL #3596 F-BC MySQL #12848 F-FA 0.29 pfscan infinite loop Glibc #12486 Kaushik Veeraraghavan

Frost detects all harmful races as traditional detector Application Harmful race detected Benign races Traditional Frost pbzip2 5 3 1 Apache: #21287 55 2 Apache: #25520 61 Apache: #45605 65 MySQL: #644 4 2899 MySQL: #791 808 MySQL: #2011 1414 MySQL: #3596 658 MySQL: #12848 1449 pfscan Glibc: #12486 6 9 Kaushik Veeraraghavan

Frost: performance given spare cores 12% 8% 3% 11% Overhead 3% to 12% given spare cores Kaushik Veeraraghavan

Frost: performance without spare cores 194% 127% Overhead ≈200% for cpu-bound apps without spare cores Kaushik Veeraraghavan

Kaushik Veeraraghavan Frost summary Two new ideas Outcome-based race detection Complementary schedules Fast data race detection with high coverage 3%—12% overhead, given spare cores ≈200% overhead, without spare cores Survives all harmful data race bugs in our tests Kaushik Veeraraghavan

Kaushik Veeraraghavan Backup Kaushik Veeraraghavan

Performance: scalability on a 32-core pfscan: Frost scales upto 7 cores Kaushik Veeraraghavan