Presentation is loading. Please wait.

Presentation is loading. Please wait.

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University.

Similar presentations


Presentation on theme: "AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University."— Presentation transcript:

1 AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University

2 GOALS Automatic parallelization without loss of performance – Use automatic detection of parallelism – Parallelization is overzealous – Remove overhead-inducing parallelism – Ensure no performance loss over original program Generic tuning framework – Empirical approach – Use program execution to measure benefits – Offline tuning

3 AUTO Vs. MANUAL PARALLELIZATION Source Program Hand parallelized Parallelizing Compiler Parallel Program Significant development time State-of-the-art auto- parallelization in the order of minutes User tunes the program for performance

4 AUTO-PARALLELISM OVERHEAD int foo() { #pragma omp private(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragma omp private(j,t) #pragma omp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } fork join Fork/Join overheads Load balancing Work in parallel section Fork/Join overheads Load balancing Work in parallel section Loop level parallelism

5 NEED FOR AUTOMATIC TUNING Identify, at compile time, the optimization strategy for maximum performance Beneficial parallelism – Which loops to parallelize – Parallel loop coverage

6 OUR APPROACH Best combination of loops to parallelize Offline tuning Decisions based on actual execution time Best combination of loops to parallelize Offline tuning Decisions based on actual execution time

7 CETUS: VERSION GENERATION Cetus Version Generator Symbolic Data Dependence Analysis Induction Variable Substitution Scalar and Array Privatization Reduction Recognition

8 SEARCH SPACE NAVIGATION Search Space -> The set of parallelizable loops Generic Tuning Algorithm – Capture Interaction – Use program execution time as decision metric COMBINED ELIMINATION – Each loop is an on/off optimization – Selective parallelization Pan, Z., Eigenmann, R.: Fast and eective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330

9 TUNING ALGORITHM BATCH ELIMINATIONITERATIVE ELIMINATION COMBINED ELIMINATION -Considers separately, the effects of each optimization -Instant elimination -Considers interactions -More tuning time New Base Case -Considers interactions amongst a subset -Iterates over the smaller subset and performs batch elimination

10 CETUNE INTERFACE int foo() { #pragma cetus parallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; } for (i=0; i<10; i++) { a[i] = c; #pragma cetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } int foo() { #pragma cetus parallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; } for (i=0; i<10; i++) { a[i] = c; #pragma cetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } cetus –ompGen –tune-ompGen=1,1 Parallelize both loops cetus –ompGen –tune-ompGen=1,0 cetus –ompGen –tune-ompGen=0,1 Parallelize one and serialize the other cetus –ompGen –tune-ompGen=0,0 Serialize both loops

11 EMPIRICAL MEASUREMENT Input source code (train data set) Version generation using tuner input Back end code generation Runtime performance measurement Train data set Decision based on RIP Next point in the search space Automatic parallelization using Cetus Start configuration Final configuration ICC Intel Xeon Dual Quad-core

12 RESULTS

13

14

15 CONTRIBUTIONS Described a compiler + empirical system that detects parallel loops in serial and parallel programs and selects the combination of parallel loops that gives highest performance Finding profitable parallelism can be done using a generic tuning method The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program

16 THANK YOU!


Download ppt "AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University."

Similar presentations


Ads by Google