Sound and Precise Analysis of Parallel Programs through Schedule Specialization Jingyue Wu, Yang Tang, Gang Hu, Heming Cui, Junfeng Yang Columbia University 1
Motivation 2 soundness (# of analyzed schedules / # of total schedules) precision Total Schedules Analyzed Schedules Static Analysis Static Analysis Dynamic Analysis Dynamic Analysis Analyzed Schedules ? ? Analyzing parallel programs is difficult.
Precision: Analyze the program over a small set of schedules. Soundness: Enforce these schedules at runtime. Schedule Specialization 3 soundness (# of analyzed schedules / # of total schedules) precision Total Schedules Static Analysis Static Analysis Dynamic Analysis Dynamic Analysis Analyzed Schedules Enforced Schedules Schedule Specializatio n Schedule Specializatio n
Enforcing Schedules Using Peregrine Deterministic multithreading – e.g. DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet (ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11) – Performance overhead e.g. Kendo: 16%, Tern & Peregrine: 39.1% Peregrine – Record schedules, and reuse them on a wide range of inputs. – Represent schedules explicitly. 4
Precision: Analyze the program over a small set of schedules. Soundness: Enforce these schedules at runtime. Schedule Specialization 5 soundness (# of analyzed schedules / # of total schedules) precision Static Analysis Static Analysis Dynamic Analysis Dynamic Analysis Analyzed Schedules Enforced Schedules Schedule Specializatio n Schedule Specializatio n
Framework Extract control flow and data flow enforced by a set of schedules 6 Schedule Specialization Schedule Specialization Program C/C++ program with Pthread Total order of synchronizations Specialized Program Specialized Program Extra def-use chains Extra def-use chains
Outline Example Control-Flow Specialization Data-Flow Specialization Results Conclusion 7
Running Example int results[p_max]; int global_id = 0; int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } void *worker(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 8 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock Race-free?
Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 9 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0
Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 10 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0 atoi create i = 0 i < p
Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 11 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0 create atoi i = 0 i < p create ++i create i < p
Control-Flow Specialization int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); for (i = 0; i < p; ++i) pthread_create(&child[i], 0, worker, 0); for (i = 0; i < p; ++i) pthread_join(child[i], 0); return 0; } 12 create join atoi ++i create return i = 0 i < p ++i join i < p i = 0 atoi create i = 0 i < p ++i create i < p ++i i < p join i < p i = 0 ++i join i < p ++i i < p return
Control-Flow Specialized Program 13 int main(int argc, char *argv[]) { int i; int p = atoi(argv[1]); i = 0; // i < p == true pthread_create(&child[i], 0, worker.clone1, 0); ++i; // i < p == true pthread_create(&child[i], 0, worker.clone2, 0); ++i; // i < p == false i = 0; // i < p == true pthread_join(child[i], 0); ++i; // i < p == true pthread_join(child[i], 0); ++i; // i < p == false return 0; } atoi create i = 0 i < p ++i create i < p ++i i < p join i < p i = 0 ++i join i < p ++i i < p return
More Challenges on Control-Flow Specialization Ambiguity 14 call CallerCallee call S1 A schedule has too many synchronizations ret S2
Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 15 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++
Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 16 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++ my_id = global_id global_id++
Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 17 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = 0 global_id = 1 my_id = 0 global_id = 1 my_id = global_id global_id++ my_id = global_id global_id++
Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); int my_id = global_id++; pthread_mutex_unlock(&global_id_lock); results[my_id] = compute(my_id); return 0; } 18 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = 0 global_id = 1 my_id = 0 global_id = 1 my_id = 1 global_id = 2 my_id = 1 global_id = 2
Data-Flow Specialization int global_id = 0; void *worker.clone1(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 1; pthread_mutex_unlock(&global_id_lock); results[0] = compute(0); return 0; } void *worker.clone2(void *arg) { pthread_mutex_lock(&global_id_lock); global_id = 2; pthread_mutex_unlock(&global_id_lock); results[1] = compute(1); return 0; } 19 Thread 0 Thread 1 Thread 2 create join lock unlock lock unlock global_id = 0 my_id = 0 global_id = 1 my_id = 0 global_id = 1 my_id = 1 global_id = 2 my_id = 1 global_id = 2
More Challenges on Data-Flow Specialization Must/May alias analysis –global_id Reasoning about integers –results[0] = compute(0) –results[1] = compute(1) Many def-use chains 20
Evaluation Applications – Static race detector – Alias analyzer – Path slicer Programs – PBZip – aget – 8 programs in SPLASH2 – 7 programs in PARSEC 21
22 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives
23 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives
24 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives
25 ProgramOriginalSpecialized aget720 PBZip21250 fft960 blackscholes30 swaptions1650 streamcluster40 canneal210 bodytrack40 ferret60 raytrace2150 cholesky317 radix5314 water-spatial lu-contig18 barnes water-nsquared ocean Static Race Detector # of False Positives
Static Race Detector: Harmful Races Detected 4 in aget 2 in radix 1 in fft 26
Precision of Schedule-Aware Alias Analysis 27
Precision of Schedule-Aware Alias Analysis 28
Precision of Schedule-Aware Alias Analysis 29
Conclusion and Future Work Designed and implemented schedule specialization framework – Analyzes the program over a small set of schedules – Enforces these schedules at runtime Built and evaluated three applications – Easy to use – Precise Future work – More applications – Similar specialization ideas on sequential programs 30
Related Work Program analysis for parallel programs – Chord (PLDI ’06), RADAR (PLDI ’08), FastTrack (PLDI ’09) Slicing – Horgon (PLDI ’90), Bouncer (SOSP ’07), Jhala (PLDI ’05), Weiser (PhD thesis), Zhang (PLDI ’04) Deterministic multithreading – DMP (ASPLOS ’09), Kendo (ASPLOS ’09), CoreDet (ASPLOS ’10), Tern (OSDI ’10), Peregrine (SOSP ’11), DTHREADS (SOSP ’11) Program specialization – Consel (POPL ’93), Gluck (ISPL ’95), Jørgensen (POPL ’92), Nirkhe (POPL ’92), Reps (PDSPE ’96) 31
Backup Slides 32
Specialization Time 33
Handling Races We do not assume data-race freedom. We could if our only goal is optimization. 34
Input Coverage Use runtime verification for the inputs not covered A small set of schedules can cover a wide range of inputs 35
36