# Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc.

## Presentation on theme: "Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc."— Presentation transcript:

Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc.

A LLIANCE ’98 Outline zA Simple OpenMP Example zAnalysis and Adaptation zDebugging zPerformance Tuning zAdvanced Topics

A LLIANCE ’98 A Simple Example real*8 function ddot(n,x,y) integer n real*8 x(n), y(n) ddot = 0.0 !\$omp parallel do private(i) !\$omp& reduction(+:ddot) do i=1,n ddot = ddot + x(i)*y(i) enddo return end dotprod.f x y 1 n

A LLIANCE ’98 A Less Simple Example real*8 function ddot(n,x,y) integer n real*8 x(n), y(n), ddot1 !\$omp parallel private(ddot1) ddot1 = 0.0 !\$omp do private(i) do i=1,n ddot1 = ddot1 + x(i)*y(i) enddo !\$omp end do nowait !\$omp atomic ddot = ddot + ddot1 !\$omp end parallel dotprod2.f x y 1 n ddot1 ddot ddot1

A LLIANCE ’98 Analysis and Adaptation zThread-safety zAutomatic Parallelization zFinding Parallel Opportunities zClassifying Data zA Different Approach

A LLIANCE ’98 Thread-safety zConfirm code works with -automatic in serial f77 -automatic -DEBUG:trap_uninitialized=ON a.out zSynchronize access to static data logical function overflows integer count save count data /count/0/ overflows =.false. !\$omp critical count = count + 1 if (count.gt. 10) overflows =.true. !\$omp end critical

A LLIANCE ’98  Power Fortran Accelerator yDetects parallelism yImplements parallelism zUsing PFA module swap MIPSpro MIPSpro.beta721 f77 -pfa zPFA options to try y-IPAenables interprocedural analysis y-OPT:roundoff=3enables reductions Automatic Parallelization

A LLIANCE ’98 Basic Compiler Transformations Work variable privatization: DO I=1,N x =.... y(I) = x ENDDO !\$omp parallel do !\$omp& private(x) DO I=1,N x =.... y(I) = x ENDDO

A LLIANCE ’98 Basic Compiler Transformations Parallel reduction : DO I=1,N. x =... sum = sum + x. ENDDO !\$omp parallel !\$omp private(x, sum1) sum1 = 0.0 !\$omp do DO I=1,N. x =... sum1 = sum1 + x. ENDDO !\$omp atomic sum = sum + sum1 !\$omp parallel do !\$omp& private(x) !\$omp& reduction(+:sum) DO I=1,N. x =... sum = sum + x. ENDDO

A LLIANCE ’98 Basic Compiler Transformations Induction variable substitution: i1 = 0 i2 = 0 DO I=1,N i1 = i1 + 1 B(i1) =... i2 = i2 + I A(i2) = … ENDDO !\$omp parallel do !\$omp& private(I) DO I=1,N B(I) =... A((I**2 + I)/2) = … ENDDO

A LLIANCE ’98 Automatic Limitations zIPA is slow for large codes zWithout IPA, only small loops go parallel zAnalysis must be repeated with each compile zCan’t parallelize data dependent algorithms zResults usually don’t scale

A LLIANCE ’98 Compiler Listing zGenerate listing with ‘-pfa keep’ f77 -pfa keep zThe listing gives many useful clues: yLoop optimization tables yData dependencies yExplanations about applied transformations yOptimization summary yTransformed OpenMP source code zUse listing to help write OpenMP version zWorkshop MPF presents listing graphically

A LLIANCE ’98 Picking Parallel Loops zAvoid inherently serial loops yTime stepping loops yIterative convergence loops zParallelize at highest level possible zChoose loops with large trip count zAlways parallelize in same dimension, if possible zWorkshop MPF’s static analysis can help

A LLIANCE ’98 Profiling zUse SpeedShop to profile your program yCompile normally in serial ySelect typical data set  Profile with ‘ssrun’ : ssrun -ideal ssrun -pcsamp  Examine profile with ‘prof’ : prof -gprof.ideal. zLook for routines with: yLarge combined ‘self’ and ‘child’ time ySmall invocation count

A LLIANCE ’98 Example Profile self kids called/total parents index cycles(%) self(%) kids(%) called+self name index self kids called/total children [...] 20511398 453309309775 1/1 PSET [4] [5] 453329821173(100.00%) 20511398( 0.00%) 453309309775(100.00%) 1 RUN [5] 18305495901 149319136904 267589/268116 DCTDX [6] 19503577587 22818946546 527/527 DKZMH [13] 13835415346 24761094596 526/526 DUDTZ [14] 12919215922 24761094596 526/526 DVDTZ [15] 11953815047 25150873141 527/527 DTDTZ [16] 4541238123 24964028293 66920/66920 DPDX [18] 3883200260 24920009235 66802/66803 DFTDX [19] 5749986857 17489462744 527/527 DCDTZ [21] 8874949202 11380650840 526/526 WCONT [24] 10830140377 0 527/527 HYD [30] 3873808360 1583161052 527/527 ADVU [36] 3592836688 1580156951 526/526 ADVV [37] 1852017128 1583161052 527/527 ADVC [39] 1680678888 1583161052 527/527 ADVT [40] [...] apsi.profile

A LLIANCE ’98 Multiple Parallel Loops zNested parallel loops yPrefer outermost loop yPreserve locality -- chose same index as in other parallel loops yIf relative sizes of trip counts are not known  Use NEST() clause xUse IF clause to select best based on dataset zNon nested parallel loops yConsider fusing loops yExecute code between loops in parallel xPrivatize data in redundant calculations

A LLIANCE ’98 Nested Parallel Loops subroutine copy (imx,jmx,kmx,imp2,jmp2,kmp2,w,ws) do nv=1,5 !\$omp do do k = 1,kmx do j = 1,jmx do i = 1,imx ws(i,j,k,nv) = w(i,j,k,nv) end do !\$omp end do nowait end do !\$omp barrier return end copy.f

A LLIANCE ’98 Variable Classification zIn OpenMP, data is shared by default zOpenMP provides several privatization mechanisms zA correct OpenMP program must have its variables properly classified z!\$omp parallel !\$omp& PRIVATE(x,y,z) !\$omp& FIRSTPRIVATE (q) !\$omp& LASTPRIVATE(I) z common /blk/ l,m,n !\$omp THREADPRIVATE(/blk/)

A LLIANCE ’98 Shared Variables zShared is OpenMP default zMost things are shared yThe major arrays yVariables whose indices match loop index !\$omp parallel do do I = 1,N do J = 1, M x(I) = x(I) + y(J) yVariables only read in parallel region yVariables read, then written, requiring synchronization xmaxval = max(maxval, currval)

A LLIANCE ’98 Private Variables zLocal variables in called routines are automatically private zCommon access patterns yWork variables written then read (PRIVATE) yVariables read on first iteration, then written (FIRSTPRIVATE) yVariables read after last iteration (LASTPRIVATE) program main !\$omp parallel call compute !\$omp end parallel end subroutine compute integer i,j,k [...] return end

A LLIANCE ’98 Classifying Variables in Common zPrivate variables in common block yPrivatize with THREADPRIVATE(/…/) zMixed shared and private variables in common block ySplit common into shared common and private common if necessary yPromote private variables to formal arguments, if possible, and remove them from common

A LLIANCE ’98 Classifying Variables Examine Refs Only Read in // Region Shared Modified in // Region Contains parallel loop index (Different iterations reference different parts) Shared Does not contain parallel loop index Continue

A LLIANCE ’98 Classifying Variables Move to Common Variable Type Formal Parameter Common Member Referenced in called routines Threadprivate Static Private Pointee Private (Alloc at top of // Region) Only referenced in // Region Private Local to subr Automatic

A LLIANCE ’98 Variable Typing DIMENSION HELP(NZ),HELPA(NZ),AN(NZ),BN(NZ),CN(NZ)... [...] !\$omp parallel !\$omp& default(shared) !\$omp& private(help,helpa,i,j,k,dv,topow,nztop,an,bn,cn) !\$omp& reduction(+: wwind, wsq) HELP(1)=0.0 HELP(NZ)=0.0 NZTOP=NZ-1 !\$omp pdo DO 40 I=1,NX DO 30 J=1,NY DO 10 K=2,NZTOP [...] 40 CONTINUE !\$omp end pdo !\$omp end parallel wcont.f wcont_omp.f dwdz.f

A LLIANCE ’98 Synchronization zReductions yMax, min values yGlobal sums, products, etc. yUse REDUCTION() clause for scalars !\$omp do reduction(max: ymax) do i=1,n y(i) = a*x(i) + y(i) ymax = max(ymax,y(i)) enddo yCode array reductions by hand maxpy.f

A LLIANCE ’98 Array Reductions !\$omp parallel private(hist1,i,j,ibin) do i=1,nbins hist1(i) = 0 enddo !\$omp do do i=1,m do j=1,m ibin = 1 + data(j,i)*rscale*nbins hist1(ibin) = hist1(ibin) + 1 enddo !\$omp critical do i=1,nbins hist(i) = hist(i) + hist1(i) enddo !\$omp end critical !\$omp end parallel histogram.f histogram.omp.f

A LLIANCE ’98 Building the Parallel Program zAnalyze, Insert Directives, and Compile: module swap MIPSpro MIPSpro.beta721 f77 -mp -n32 - or - source /usr/local/apps/KAI/setup.csh guidef77 -n32 zRun multiple times; compare output to serial setenv OMP_NUM_THREADS 3 setenv OMP_DYNAMIC false a.out zDebug

A LLIANCE ’98 Correctness and Debugging zOpenMP is easier than MPI, but bugs are still possible zCommon Parallel Bugs zDebugging Approaches

A LLIANCE ’98 Debugging Tips zCheck parallel P=1 results setenv OMP_NUM_THREADS 1 setenv OMP_DYNAMIC false a.out zIf results differ from serial, check: yUninitialized private data yMissing lastprivate clause zIf results are same as serial, check for: yUnsynchronized access to shared variables yShared variables that should be private yVariable size THREADPRIVATE common declarations

A LLIANCE ’98 Parallel Debugging is Hard zWhat can go wrong? yIncorrectly classified variables yUnsynchronized writes yData read before written yUninitialized private data yFailure to update global data yOther race conditions yTiming-dependent bugs parbugs.f

A LLIANCE ’98 Parallel Debugging Is Hard zWhat else can go wrong? yUnsynchronized I/O yThread stack collisions xIncrease with mp_set_slave_stacksize() function or KMP_STACKSIZE variable yPrivatization of improperly declared arrays yInconsistently declared private common blocks

A LLIANCE ’98 Debugging Options zPrint statements zMultithreaded debuggers zAutomatic parallel debugger

A LLIANCE ’98 zAdvantages yWYSIWYG yCan be useful yCan monitor scheduling of iterations on threads zDisadvantages ySlow, human-time intensive bug hunting zTips yInclude thread ID yChecksum shared memory regions yProtect I/O with a CRITICAL section Print Statements

A LLIANCE ’98 Multithreaded Debugger zAdvantages yCan find causes of deadlock, such as threads waiting at different barriers zDisadvantages yLocates symptom, not cause yHard to reproduce errors, especially those which are timing-dependent yDifficult to relate parallel (MP) library calls back to original source yHuman intensive

A LLIANCE ’98 WorkShop Debugger zGraphical User Interface zUsing the debugger yAdd debug symbols with ‘-g’ on compile and link: f77 -g -mp - or - guidef77 -g yRun the debugger setenv OMP_NUM_THREADS 3 setenv OMP_DYNAMIC false cvd a.out yFollow threads and try to reproduce the bug

A LLIANCE ’98 Automatic OpenMP Debugger zAdvantages ySystematically finds parallel bugs xDeadlocks and race conditions xUninitialized data xReuse of PRIVATE data outside parallel regions xMeasures thread stack usage yUses computer time rather than human time zDisadvantages yData set dependent yRequires sequentially consistent program yIncreased memory usage and CPU time

A LLIANCE ’98 KAI’s Assure zLooks like an OpenMP compiler zGenerates an ideal parallel computer simulation zItemizes parallel bugs zLocates exact location of bug in source zIncludes GUI to browse error reports

A LLIANCE ’98 Serial Consistency zParallel program must have a serial counterpart yAlgorithm can’t depend on number of threads yCode can’t manually assign domains to threads yCan’t call omp_get_thread_num() yCan’t use OpenMP lock API. zSerial code defines correct behavior ySerial code should be well debugged yAssure sometimes finds serial bugs as well

A LLIANCE ’98 Using Assure zPick a project database file name: e.g., “buggy.prj” zCompile all source files with “assuref77”: source /usr/local/apps/KAI/setup.csh assuref77 -WA,-pname=./buggy.prj -c buggy.f assuref77 -WA,-pname=./buggy.prj buggy.o ySource files in multiple directories must specify same project file zRun with a small, but representative workload a.out setenv DISPLAY your_machine:0 assureview buggy.prj

A LLIANCE ’98 AssureView: Errors

A LLIANCE ’98 AssureView:Source

A LLIANCE ’98 AssureView: Call Graph

A LLIANCE ’98 AssureView: Common Mismatch

A LLIANCE ’98 Assure Tips zSelect small, but representative data sets zIncrease test coverage with multiple data sets zNo need to run job to completion (control-c) zGet intermediate reports (e.g., every 2 minutes) setenv KDD_INTERVAL 2m a.out & assureview buggy.prj [ wait a few minutes ] assureview buggy.prj zQuickly learn about stack usage and call graph setenv KDD_DELAY 48h

A LLIANCE ’98 A Different Approach to Parallelization zLocate candidate parallel loop(s) zIdentify obvious shared and private variables zInsert OpenMP directives zCompile with Assure parallel debugger zRun program zView parallel errors with AssureView zUpdate directives md.f md.omp.f

A LLIANCE ’98 Parallel Performance zLimiters of Parallel Performance zDetecting Performance Problems zFixing Performance Problems

A LLIANCE ’98 Parallel Performance zLimiters of performance yAmdahl’s law yLoad imbalance ySynchronization yOverheads yFalse sharing Easy Hard Obvious Subtle

A LLIANCE ’98 Amdahl’s Law zMaximum Efficiency zFraction parallel limits scalability zKey: Parallelize everything significant

A LLIANCE ’98 Load Imbalance zUnequal work loads lead to idle threads and wasted time time !\$omp parallel do !\$omp end parallel do

A LLIANCE ’98 Synchronization zLost time waiting for locks time !\$omp parallel !\$omp end parallel !\$omp critical !\$omp end critical

A LLIANCE ’98 zSuccessful loop parallelization requires large loops. z!\$OMP PARALLEL DO SCHEDULE(STATIC) startup time y~3500 cycles or 20 micro-seconds on 4 processors y~200,000 cycles or 1 milli-second on 128 processors zLoop time should be large compared to parallel overheads zData size must grow faster than number of threads to maintain parallel efficiency Parallel Loop Size Max loop speedup = serial loop execution + parallel loop startup number of processors

A LLIANCE ’98 False Sharing zFalse sharing occurs when multiple threads repeated write to the same cache line zUse perfex to detect if cache invalidation is a problem perfex -a -y -mp zUse SpeedShop to find the location of the problem ssrun -dc_hwc ssrun -dsc_hwc false.f

A LLIANCE ’98 Measuring Parallel Performance zMeasure wall clock time with ‘timex’ setenv OMP_DYNAMIC false setenv OMP_NUM_THREADS 1 timex a.out setenv OMP_NUM_THREADS 16 timex a.out zProfilers (speedshop, perfex) yFind remaining serial time yIdentify false sharing zGuide’s instrumented parallel library

A LLIANCE ’98 Using GuideView zCompile with Guide OpenMP compiler and normal compile options source /usr/local/apps/KAI/setup.csh guidef77 -c -Ofast=IP27 -n32 -mips4 source.f... zLink with instrumented library guidef77 -WGstats source.o … zRun with real parallel workload setenv KMP_STACKSIZE 32M a.out zView performance report guideview guide_stats

A LLIANCE ’98 Compare achieved to ideal Performance GuideView Identify parallel bottlenecks such as Barriers, Locks, and Sequential time Compare multiple runs

A LLIANCE ’98 Analyze each thread’s performance See how performance bottlenecks change as processors are added

A LLIANCE ’98 Performance Data By Region Analyze each Parallel region Find serial regions that are hurt by parallelism Sort or filter regions to navigate to hotspots

A LLIANCE ’98 Dynamic Scheduling zRelieve load imbalance zStatic even scheduling yEqual size iteration chunks yBased on runtime loop limits yTotally parallel scheduling yOpenMP default zDynamic and Guided scheduling yThreads do some work then get next chunk !\$omp parallel do !\$omp& schedule(static) !\$omp parallel do !\$omp& schedule(dynamic,8) !\$omp parallel do !\$omp& schedule(guided,8)

A LLIANCE ’98 Limiting Parallel Overheads zMerge adjacent parallel regions zWhen safe, avoid barrier at end of !\$omp do zEliminate small parallel loops zUse IF clause to limit parallelism zIncrease problem size !\$omp parallel !\$omp& if(imax.gt. 1000) !\$omp do do I=1,100 [...] enddo !\$omp end do nowait !\$omp do do I=1,100 [...] enddo !\$omp end parallel

A LLIANCE ’98 Advanced Topics zOpenMP can be used with MPI to achieve two- level parallelism setenv OMP_NUM_THREADS 4 mpirun -np 4 a.out zData distribution and affinity directives man mp zExplicit domain decomposition with OpenMP

A LLIANCE ’98 Reference zSpeaker contact info yFaisal Saied, fsaied@ncsa.uiuc.edu, 217-244-9481 yFady Najjar, najjer@urbana.sgi.com, 217-244-4934 yBill Magro, magro@kai.com, 217-398-3284 zssrun, timex, perfex, cvd, cvpav, cvperf, f77, f90 ySee man pages or “insight” documents zGuide Documentation yOn modi4: /usr/local/apps/KAI/guide35/docs zAssure Documentation yOn modi4: /usr/local/apps/KAI/assure35/docs

Download ppt "Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc."

Similar presentations