Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University.

Slides:

Advertisements

Similar presentations

Dataflow Analysis for Datarace-Free Programs (ESOP 11) Arnab De Joint work with Deepak DSouza and Rupesh Nasre Indian Institute of Science, Bangalore.

Advertisements

PEREGRINE: Efficient Deterministic Multithreading through Schedule Relaxation Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang Software.

Threads Relation to processes Threads exist as subsets of processes Threads share memory and state information within a process Switching between threads.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Pallavi Joshi  Chang-Seo Park  Koushik Sen  Mayur Naik ‡  Par Lab, EECS, UC Berkeley‡

Effective Static Deadlock Detection

Runtime Verification Ali Akkaya Boğaziçi University.

An Case for an Interleaving Constrained Shared-Memory Multi-Processor Jie Yu and Satish Narayanasamy University of Michigan.

A Randomized Dynamic Program Analysis for Detecting Real Deadlocks Koushik Sen CS 265.

Scalable and Precise Dynamic Datarace Detection for Structured Parallelism Raghavan RamanJisheng ZhaoVivek Sarkar Rice University June 13, 2012 Martin.

Chair of Software Engineering From Program slicing to Abstract Interpretation Dr. Manuel Oriol.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.

Heming Cui, Gang Hu, Jingyue Wu, Junfeng Yang Columbia University

Parrot: A Practical Runtime for Deterministic, Stable, and Reliable Threads Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, Junfeng Yang,

Effectively Model Checking Real-World Distributed Systems Junfeng Yang Joint work with Huayang Guo, Ming Wu, Lidong Zhou, Gang Hu, Lintao Zhang, Heming.

Ashish Kundu CS590F Purdue 02/12/07 Language-Based Information Flow Security Andrei Sabelfield, Andrew C. Myers Presentation: Ashish Kundu

Mayur Naik Alex Aiken John Whaley Stanford University Effective Static Race Detection for Java.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

RADAR: Dataflow Analysis for Concurrent Programs using Datarace Detection Ravi Chugh, Jan Voung, Ranjit Jhala, Sorin Lerner {rchugh, jvoung, jhala, lerner}

Multithreading in Java Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.

Parrot: A Practical Runtime for Deterministic, Stable, and Reliable threads HEMING CUI, YI-HONG LIN, HAO LI, XINAN XU, JUNFENG YANG, JIRI SIMSA, BEN BLUM,

Parallelizing Data Race Detection Benjamin Wester Facebook David Devecsery, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan.

DoublePlay: Parallelizing Sequential Logging and Replay Kaushik Veeraraghavan Dongyoon Lee, Benjamin Wester, Jessica Ouyang, Peter M. Chen, Jason Flinn,

Cormac Flanagan UC Santa Cruz Velodrome: A Sound and Complete Dynamic Atomicity Checker for Multithreaded Programs Jaeheon Yi UC Santa Cruz Stephen Freund.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

University of Michigan Electrical Engineering and Computer Science 1 Practical Lock/Unlock Pairing for Concurrent Programs Hyoun Kyu Cho 1, Yin Wang 2,

Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,

What is the Cost of Determinism?

XFindBugs: eXtended FindBugs for AspectJ Haihao Shen, Sai Zhang, Jianjun Zhao, Jianhong Fang, Shiyuan Yao Software Theory and Practice Group (STAP) Shanghai.

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Microsoft Research Asia Ming Wu, Haoxiang Lin, Xuezheng Liu, Zhenyu Guo, Huayang Guo, Lidong Zhou, Zheng Zhang MIT Fan Long, Xi Wang, Zhilei Xu.

Introduction to Threads CS240 Programming in C. Introduction to Threads A thread is a path execution By default, a C/C++ program has one thread called.

Object Oriented Programming Lecture 8: Introduction to laboratorial exercise – part II, Introduction to GUI frames in Netbeans, Introduction to threads.

Accelerating Precise Race Detection Using Commercially-Available Hardware Transactional Memory Support Serdar Tasiran Koc University, Istanbul, Turkey.

- 1 - Dongyoon Lee, Peter Chen, Jason Flinn, Satish Narayanasamy University of Michigan, Ann Arbor Chimera: Hybrid Program Analysis for Determinism * Chimera.

LOOM: Bypassing Races in Live Applications with Execution Filters Jingyue Wu, Heming Cui, Junfeng Yang Columbia University 1.

1 Effective Static Race Detection for Java Mayur, Alex, CS Department Stanford University Presented by Roy Ganor 14/2/08 Point-To Analysis Seminar.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science S HERIFF : Precise Detection & Automatic Mitigation of False Sharing Tongping Liu,

CS345 Operating Systems Threads Assignment 3. Process vs. Thread process: an address space with 1 or more threads executing within that address space,

CS333 Intro to Operating Systems Jonathan Walpole.

Professor: Shu-Ching Chen TA: Samira Pouyanfar.  An independent stream of instructions that can be scheduled to run  A path of execution int a, b; int.

On-Demand Dynamic Software Analysis Joseph L. Greathouse Ph.D. Candidate Advanced Computer Architecture Laboratory University of Michigan December 12,

Lecture 7: POSIX Threads - Pthreads. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Drinking from Both Glasses: Adaptively Combining Pessimistic and Optimistic Synchronization for Efficient Parallel Runtime Support Man Cao Minjia Zhang.

Dataflow Analysis for Concurrent Programs using Datarace Detection Ravi Chugh, Jan W. Voung, Ranjit Jhala, Sorin Lerner LBA Reading Group Michelle Goodstein.

Detecting and Eliminating Potential Violation of Sequential Consistency for concurrent C/C++ program Duan Yuelu, Feng Xiaobing, Pen-chung Yew.

U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Software Systems Advanced Synchronization Emery Berger and Mark Corner University.

Motivation  Parallel programming is difficult  Culprit: Non-determinism Interleaving of parallel threads But required to harness parallelism  Sequential.

Tongping Liu, Charlie Curtsinger, Emery Berger D THREADS : Efficient Deterministic Multithreading Insanity: Doing the same thing over and over again and.

Reducing Combinatorics in Testing Product Lines Chang Hwan Peter Kim, Don Batory, and Sarfraz Khurshid University of Texas at Austin.

Dongyoon Lee, Benjamin Wester, Kaushik Veeraraghavan, Satish Narayanasamy, Peter M. Chen, and Jason Flinn University of Michigan, Ann Arbor Respec: Efficient.

CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.

Effective Static Deadlock Detection Mayur Naik* Chang-Seo Park +, Koushik Sen +, David Gay* *Intel Research, Berkeley + UC Berkeley.

Sampling Dynamic Dataflow Analyses Joseph L. Greathouse Advanced Computer Architecture Laboratory University of Michigan University of British Columbia.

Grigore Rosu Founder, President and CEO Professor of Computer Science, University of Illinois

Effective Static Deadlock Detection Mayur Naik (Intel Research) Chang-Seo Park and Koushik Sen (UC Berkeley) David Gay (Intel Research)

Serialization Sets A Dynamic Dependence-Based Parallel Execution Model Matthew D. Allen Srinath Sridharan Gurindar S. Sohi University of Wisconsin-Madison.

Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.

2.2 Threads  Process: address space + code execution  There is no law that states that a process cannot have more than one “line” of execution.  Threads:

FastTrack: Efficient and Precise Dynamic Race Detection [FlFr09] Cormac Flanagan and Stephen N. Freund GNU OS Lab. 23-Jun-16 Ok-kyoon Ha.

Kendo: Efficient Deterministic Multithreading in Software M. Olszewski, J. Ansel, S. Amarasinghe MIT to be presented in ASPLOS 2009 slides by Evangelos.

Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.

Detecting Data Races in Multi-Threaded Programs

Optimistic Hybrid Analysis

On-Demand Dynamic Software Analysis

Amir Kamil and Katherine Yelick

Heming Cui, Jingyue Wu, John Gallagher, Huayang Guo, Junfeng Yang

Threads and Memory Models Hal Perkins Autumn 2011

Threads and Memory Models Hal Perkins Autumn 2009

On-Demand Dynamic Software Analysis

Amir Kamil and Katherine Yelick

Presentation transcript:

Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University 1

Motivation 2 soundness (# of analyzed schedules / # of total schedules) precision Total Schedules Analyzed Schedules Static Analysis Static Analysis Dynamic Analysis Dynamic Analysis Analyzed Schedules ? ? Analyzing parallel programs is difficult.

Precision: Analyze the program over a small set of schedules. Soundness: Enforce these schedules at runtime. Schedule Specialization 3 soundness (# of analyzed schedules / # of total schedules) precision Total Schedules Static Analysis Static Analysis Dynamic Analysis Dynamic Analysis Analyzed Schedules Enforced Schedules Schedule Specializatio n Schedule Specializatio n

Enforcing Schedules Using Peregrine Deterministic multithreading – e.g. DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet (ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11) – Performance overhead e.g. Kendo: 16%, Tern & Peregrine: 39.1% Peregrine – Record schedules, and reuse them on a wide range of inputs. – Represent schedules explicitly. 4

Precision: Analyze the program over a small set of schedules. Soundness: Enforce these schedules at runtime. Schedule Specialization 5 soundness (# of analyzed schedules / # of total schedules) precision Static Analysis Static Analysis Dynamic Analysis Dynamic Analysis Analyzed Schedules Enforced Schedules Schedule Specializatio n Schedule Specializatio n

Framework Extract control flow and data flow enforced by a set of schedules 6 Schedule Specialization Schedule Specialization Program C/C++ program with Pthread Total order of synchronizations Specialized Program Specialized Program Extra def-use chains Extra def-use chains

Outline Example Control-Flow Specialization Data-Flow Specialization Results Conclusion 7

Running Example int results[p_max]; int global_id = 0; int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } void *worker(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 8 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock Race-free?

Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 9 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0

Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 10 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0 atoi create i = 0 i < p

Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 11 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0 create atoi i = 0 i < p create ++i create i < p

Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 12 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0 atoi create i = 0 i < p ++i create i < p ++i i < p join i < p i = 0 ++i join i < p ++i i < p return

Control-Flow Specialized Program 13 int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); i = 0; // i < p == true pthread_create(&child[i], 0, worker.clone1, 0); ++i; // i < p == true pthread_create(&child[i], 0, worker.clone2, 0); ++i; // i < p == false i = 0; // i < p == true pthread_join(child[i], 0); ++i; // i < p == true pthread_join(child[i], 0); ++i; // i < p == false return 0; } atoi create i = 0 i < p ++i create i < p ++i i < p join i < p i = 0 ++i join i < p ++i i < p return

More Challenges on Control-Flow Specialization Ambiguity 14 call CallerCallee call S1 A schedule has too many synchronizations ret S2

Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 15 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++

Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 16 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++

Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 17 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = 0 global_id = 1 my_id = 0 global_id = 1 my_id = global_id global_id++ my_id = global_id global_id++

Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 18 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = 0 global_id = 1 my_id = 0 global_id = 1 my_id = 1 global_id = 2 my_id = 1 global_id = 2

Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 1; pthread_mutex_unlock(&global_id_lock); results[0] = compute(0); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 2; pthread_mutex_unlock(&global_id_lock); results[1] = compute(1); return 0; } 19 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = 0 global_id = 1 my_id = 0 global_id = 1 my_id = 1 global_id = 2 my_id = 1 global_id = 2

More Challenges on Data-Flow Specialization Must/May alias analysis –global_id Reasoning about integers –results[0] = compute(0) –results[1] = compute(1) Many def-use chains 20

Evaluation Applications – Static race detector – Alias analyzer – Path slicer Programs – PBZip – aget – 8 programs in SPLASH2 – 7 programs in PARSEC 21

22 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives

23 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives

24 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives

25 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives

Static Race Detector: Harmful Races Detected 4 in aget 2 in radix 1 in fft 26

Precision of Schedule-Aware Alias Analysis 27

Precision of Schedule-Aware Alias Analysis 28

Precision of Schedule-Aware Alias Analysis 29

Conclusion and Future Work Designed and implemented schedule specialization framework – Analyzes the program over a small set of schedules – Enforces these schedules at runtime Built and evaluated three applications – Easy to use – Precise Future work – More applications – Similar specialization ideas on sequential programs 30

Related Work Program analysis for parallel programs – Chord (PLDI ’06), RADAR (PLDI ’08), FastTrack (PLDI ’09) Slicing – Horgon (PLDI ’90), Bouncer (SOSP ’07), Jhala (PLDI ’05), Weiser (PhD thesis), Zhang (PLDI ’04) Deterministic multithreading – DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet (ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11) Program specialization – Consel (POPL ’93), Gluck (ISPL ’95), Jørgensen (POPL ’92), Nirkhe (POPL ’92), Reps (PDSPE ’96) 31

Backup Slides 32

Specialization Time 33

Handling Races We do not assume data-race freedom. We could if our only goal is optimization. 34

Input Coverage Use runtime verification for the inputs not covered A small set of schedules can cover a wide range of inputs 35

36