Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang

Slides:

Advertisements

Similar presentations

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.

Advertisements

Software & Services Group PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs Harish Patil, Cristiano Pereira,

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

R2: An application-level kernel for record and replay Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, Z. Zhang, (MSR Asia, Tsinghua, MIT),

Parrot: A Practical Runtime for Deterministic, Stable, and Reliable Threads Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang,

Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.

SHelp: Automatic Self-healing for Multiple Application Instances in a Virtual Machine Environment Gang Chen, Hai Jin, Deqing Zou, Weizhong Qiang, Gang.

An efficient data race detector for DIOTA Michiel Ronsse, Bastiaan Stougie, Jonas Maebe, Frank Cornelis, Koen De Bosschere Department of Electronics and.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Accurate and Efficient Replaying of File System Traces Nikolai Joukov, TimothyWong, and Erez Zadok Stony Brook University (FAST 2005) USENIX Conference.

1 PATH: Page Access Tracking Hardware to Improve Memory Management Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown University.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

Parrot: A Practical Runtime for Deterministic, Stable, and Reliable threads HEMING CUI, YI-HONG LIN, HAO LI, XINAN XU, JUNFENG YANG, JIRI SIMSA, BEN BLUM,

Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.

DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,

MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.

DTHREADS: Efficient Deterministic Multithreading

Rex: Replication at the Speed of Multi-core Zhenyu Guo, Chuntao Hong, Dong Zhou*, Mao Yang, Lidong Zhou, Li Zhuang Microsoft ResearchCMU* 1.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

University of Michigan Electrical Engineering and Computer Science 1 Practical Lock/Unlock Pairing for Concurrent Programs Hyoun Kyu Cho 1, Yin Wang 2,

Deterministic Replay of Java Multithreaded Applications Jong-Deok Choi and Harini Srinivasan slides made by Qing Zhang.

What is the Cost of Determinism?

Computer System Architectures Computer System Software

Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.

Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Parallelizing Security Checks on Commodity Hardware E.B. Nightingale, D. Peek, P.M. Chen and J. Flinn U Michigan.

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

LOOM: Bypassing Races in Live Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1.

- 1 - Dongyoon Lee †, Mahmoud Said*, Satish Narayanasamy †, Zijiang James Yang*, and Cristiano L. Pereira ‡ University of Michigan, Ann Arbor † Western.

SSGRR A Taxonomy of Execution Replay Systems Frank Cornelis Andy Georges Mark Christiaens Michiel Ronsse Tom Ghesquiere Koen De Bosschere Dept. ELIS.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Our work on virtualization Chen Haogang, Wang Xiaolin {hchen, Institute of Network and Information Systems School of Electrical Engineering.

Replay Compilation: Improving Debuggability of a Just-in Time Complier Presenter: Jun Tao.

CS333 Intro to Operating Systems Jonathan Walpole.

Cooperative Concurrency Bug Isolation Guoliang Jin, Aditya Thakur, Ben Liblit, Shan Lu University of Wisconsin–Madison Instrumentation and Sampling Strategies.

1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.

Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.

Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Tongping Liu, Charlie Curtsinger, Emery Berger D THREADS : Efficient Deterministic Multithreading Insanity: Doing the same thing over and over again and.

Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.

Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University.

CSC 360, Instructor: Kui Wu Thread & PThread. CSC 360, Instructor: Kui Wu Agenda 1.What is thread? 2.User vs kernel threads 3.Thread models 4.Thread issues.

Aditya Thakur Rathijit Sen Ben Liblit Shan Lu University of Wisconsin–Madison Workshop on Dynamic Analysis 2009 Cooperative Crug Isolation.

Debugging on Shared Memory Introduction to Valgrind Multithreads Tools & Principles 林孟潇

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

Execution Replay and Debugging. Contents Introduction Parallel program: set of co-operating processes Co-operation using –shared variables –message passing.

State Machine Replication State Machine Replication through transparent distributed protocols State Machine Replication through a shared log.

Contents 1.Overview 2.Multithreading Model 3.Thread Libraries 4.Threading Issues 5.Operating-system Example 2 OS Lab Sun Suk Kim.

IThreads A Threading Library for Parallel Incremental Computation Pramod Bhatotia Pedro Fonseca, Björn Brandenburg (MPI-SWS) Umut Acar (CMU) Rodrigo Rodrigues.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Agile Paging: Exceeding the Best of Nested and Shadow Paging

Chapter Overview General Concepts IA-32 Processor Architecture

Introduction to threads

Hathi: Durable Transactions for Memory using Flash

vCAT: Dynamic Cache Management using CAT Virtualization

Reducing OLTP Instruction Misses with Thread Migration

Adaptive Cache Partitioning on a Composite Core

CS399 New Beginnings Jonathan Walpole.

Timothy Zhu and Huapeng Zhou

Chapter 4: Multithreaded Programming

Effective Data-Race Detection for the Kernel

Scaling a file system to many cores using an operation log

Department of Computer Science University of California, Santa Barbara

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Declarative Transfer Learning from Deep CNNs at Scale

Department of Computer Science University of California, Santa Barbara

Nikola Grcevski Testarossa JIT Compiler IBM Toronto Lab

DMP: Deterministic Shared Memory Multiprocessing

Presentation transcript:

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software Systems Lab Columbia University

Nondeterminism in Multithreading Different runs  different behaviors, depending on thread schedules Complicates a lot of things Understanding Testing Debugging …

Nondeterministic Synchronization Thread 0 Thread 1 Thread 0 Thread 1 mutex_lock(M) *obj = … mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) free(obj) mutex_unlock(M) mutex_lock(M) *obj = … mutex_unlock(M) Apache Bug #21287 Data Race Thread 0 Thread 1 Thread 0 Thread 1 …… barrier_wait(B) print(result) …… barrier_wait(B) result += … …… barrier_wait(B) print(result) …… barrier_wait(B) result += … FFT in SPLASH2

Deterministic Multithreading (DMT) Same input  same schedule Addresses many problems due to nondeterminism Existing DMT systems enforce either of Sync-schedule: deterministic total order of synch operations (e.g., lock()/unlock()) Mem-schedule: deterministic order of shared memory accesses (e.g., load/store)

Sync-schedule [TERN OSDI '10], [Kendo ASPLOS '09], etc Pros: efficient (16% overhead in Kendo) Cons: deterministic only when no races Many programs contain races [Lu ASPLOS '08] Thread 0 Thread 1 Apache Bug #21287 mutex_lock(M) *obj = … mutex_unlock(M) free(obj) Thread 0 Thread 1 FFT in SPLASH2 …… barrier_wait(B) print(result) result += …

Mem-schedule [COREDET ASPLOS '10], [dOS OSDI '10], etc Pros: deterministic despite of data races Cons: high overhead (e.g., 1.2~10.1X slowdown in dOS) Thread 0 Thread 1 FFT in SPLASH2 …… barrier_wait(B) print(result) result += …

Open Challenge [WODET '11] Either determinism or efficiency, but not both Type of Schedule Determinism Efficiency Sync   Mem Can we get both?

  Yes, we can! Sync Mem PEREGRINE Type of Schedule Determinism Efficiency Sync   Mem PEREGRINE

PEREGRINE Insight Races rarely occur Hybrid schedule Intuitively, many races  already detected Empirically, six real apps  up to 10 races occured Hybrid schedule Sync-schedule in race-free portion (major) Mem-schedule in racy portion (minor)

PEREGRINE: Efficient DMT Schedule Relaxation Record execution trace for new input Relax trace into hybrid schedule Reuse on many inputs: deterministic + efficient Reuse rate is high (e.g., 90.3% for Apache, [TERN OSDI '10]) Automatic using new program analysis techniques Run in Linux, user space Handle Pthread synchronization operations Work with server programs [TERN OSDI '10]

Summary of Results Evaluated on a diverse set of 18 programs 4 real applications: Apache, PBZip2, aget, pfscan 13 scientific programs (10 from SPLASH2, 3 from PARSEC) Racey (popular stress testing tool for DMT) Deterministically resolve all races Efficient: 54% faster to 49% slower Stable: frequently reuse schedules for 9 programs Many benefits: e.g., reuse good schedules [TERN OSDI '10]

Outline PEREGRINE overview An example Evaluation Conclusion

PEREGRINE Overview Program Source INPUT Instrumentor LLVM Miss Hit Match? INPUT INPUT, Si <Ci, Si> Program Execution Traces <C1, S1> … <Cn, Sn> Program <C,S> Recorder Analyzer Replayer OS OS Schedule Cache

Outline PEREGRINE overview An example Evaluation Conclusion

An Example main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); if ((flag=atoi(argv[3]))==1) result += …; printf(“%d\n”, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); pthread_mutex_unlock(&mutex); // Read input. // Create children threads. // Work. // Missing pthread_join() // if “flag” is 1, update “result”. // Read from “result”. // Allocate data with “size/nthread”. // Read data from disk and compute. // Grab mutex. // Write to “result”.

Instrumentor main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(“%d\n”, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); pthread_mutex_unlock(&mutex); // Instrument command line arguments. // Instrument read() function within myRead().

Instrumentor main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(“%d\n”, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); pthread_mutex_unlock(&mutex); // Instrument command line arguments. // Instrument synchronization operation. // Instrument read() function. // Instrument synchronization operation. // Instrument synchronization operation.

$./a.out 2 2 0 Recorder main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(“%d\n”, result); } worker() { char *data; data = malloc(size/nthread); for(i=0; i<size/nthread; ++i) data[i] = myRead(i); pthread_mutex_lock(&mutex); pthread_mutex_unlock(&mutex); Thread 0 Thread 1 main() nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() (2<nthread)==0 worker() worker() data=malloc() data=malloc() (0<size/nthread)==1 (0<size/nthread)==1 data[i]=myRead() data[i]=myRead() (1<size/nthread)==0 (1<size/nthread)==0 lock() result+=…; unlock() lock() result+=…; unlock() (flag==1)==0 printf(…,result)

$./a.out 2 2 0 Recorder Thread 1 nthread=atoi() size=atoi() pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() main() printf(…,result)

Analyzer: Hybrid Schedule pthread_create() lock() unlock() Thread 1 Thread 0 Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() main() printf(…,result)

Analyzer: Hybrid Schedule Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() main() pthread_create() lock() unlock() Thread 1 Thread 0 result+=…; printf(…,result)

Analyzer: Hybrid Schedule Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() main() pthread_create() lock() unlock() Thread 1 Thread 0 printf(…,result) result+=…; printf(…,result)

Analyzer: Hybrid Schedule Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() main() pthread_create() lock() unlock() Thread 1 Thread 0 printf(…,result) result+=…; printf(…,result)

Analyzer: Precondition ./a.out 2 2 0 Analyzer: Precondition Thread 0 Thread 1 Challenges Ensure schedule is feasible Ensure no new races main() nthread=atoi() main(argc, char *argv[]) { nthread = atoi(argv[1]); size = atoi(argv[2]); for(i=1; i<nthread; ++i) pthread_create(worker); worker(); // Missing pthread_join() if ((flag=atoi(argv[3]))==1) result += …; printf(“%d\n”, result); } …… size=atoi() (1<nthread)==1 pthread_create() (2<nthread)==0 worker() worker() data=malloc() data=malloc() (0<size/nthread)==1 (0<size/nthread)==1 data[i]=myRead() data[i]=myRead() Hybrid Schedule (1<size/nthread)==0 (1<size/nthread)==0 lock() pthread_create() lock() unlock() Thread 1 Thread 0 printf(…,result) result+=…; result+=…; unlock() lock() result+=…; unlock() (flag==1)==0 printf(…,result)

Naïve Approach to Computing Preconditions Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() main() nthread==2 size==2 flag!=1 (1<nthread)==1 (2<nthread)==0 (0<size/nthread)==1 (1<size/nthread)==0 (0<size/nthread)==1 (1<size/nthread)==0 (flag==1)==0 printf(…,result)

Analyzer: Preconditions (a Naïve Way) Thread 1 nthread=atoi() size=atoi() (1<nthread)==1 pthread_create() worker() data=malloc() (0<size/nthread)==1 data[i]=myRead() (1<size/nthread)==0 (flag==1)==0 (2<nthread)==0 Thread 0 lock() result+=…; unlock() main() nthread==2 size==2 flag!=1 Problem: over-constraining! size must be 2 to reuse Absorbed most of our brain power in this paper! Solution: two new program analysis techniques; see paper printf(…,result)

Analyzer: Preconditions Thread 0 Thread 1 nthread==2 main() nthread=atoi() flag!=1 size=atoi() (1<nthread)==1 (2<nthread)==0 (1<nthread)==1 pthread_create() (2<nthread)==0 worker() worker() data=malloc() data=malloc() (0<size/nthread)==1 (0<size/nthread)==1 data[i]=myRead() data[i]=myRead() (1<size/nthread)==0 (1<size/nthread)==0 lock() result+=…; unlock() lock() result+=…; unlock() (flag==1)==0 (flag==1)==0 printf(…,result)

./a.out 2 1000 3 Replayer Preconditions Hybrid Schedule Thread 0 main() Preconditions nthread=atoi() size=atoi() nthread==2 (1<nthread)==1 pthread_create() flag!=1 (2<nthread)==0 worker() worker() data=malloc() data=malloc() (0<size/nthread)==1 (0<size/nthread)==1 data[i]=myRead() data[i]=myRead() Hybrid Schedule (1<size/nthread)==0 (1<size/nthread)==0 lock() pthread_create() lock() unlock() Thread 1 Thread 0 printf(…,result) result+=…; result+=…; unlock() lock() result+=…; unlock() (flag==1)==0 printf(…,result)

Benefits of PEREGRINE Thread 0 Thread 1 Deterministic: resolve race on result; no new data races Efficient: loops on data[] run in parallel Stable [TERN OSDI '10]: can reuse on any data size or contents Other applications possible; talk to us! main() nthread=atoi() size=atoi() (1<nthread)==1 (1<nthread)==1 (2<nthread)==0 pthread_create() (2<nthread)==0 worker() worker() data=malloc() data=malloc() (0<size/nthread)==1 (0<size/nthread)==1 data[i]=myRead() data[i]=myRead() (1<size/nthread)==0 (1<size/nthread)==0 lock() result+=…; unlock() lock() result+=…; unlock() (flag==1)==0 printf(…,result)

Outline PEREGRINE overview An example Evaluation Conclusion

General Experiment Setup Program-workload Apache: download a 100KB html page using ApacheBench PBZip2: compress a 10MB file Aget: download linux-3.0.1.tar.bz2, 77MB. Pfscan: scan keyword “return” 100 files from gcc project 13 scientific benchmarks (10 from SPLASH2, 3 from PARSEC): run for 1-100 ms Racey: default workload Machine: 2.67GHz dual-socket quad-core Intel Xeon machine (eight cores) with 24GB memory Concurrency: eight threads for all experiments

Determinism Program # Races Sync-schedule Hybrid schedule Apache   PBZip2 4  barnes 5 fft 10 lu-non-contig streamcluster racey 167974

Overhead in Reusing Schedules

% of Instructions Left in the Trace

Conclusion Hybrid schedule: combine the best of both sync-schedule and mem-schedules PEREGRINE Schedule relaxation to compute hybrid schedules Deterministic (make all 7 racy programs deterministic) Efficient (54% faster to 49% slower) Stable (frequently reuse schedule for 9 out of 17) Have broad applications

Thank you! Questions?