Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc.

Similar presentations

Presentation on theme: "Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc."— Presentation transcript:

1 Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc.

2 A LLIANCE ’98 Outline zA Simple OpenMP Example zAnalysis and Adaptation zDebugging zPerformance Tuning zAdvanced Topics

3 A LLIANCE ’98 A Simple Example real*8 function ddot(n,x,y) integer n real*8 x(n), y(n) ddot = 0.0 !$omp parallel do private(i) !$omp& reduction(+:ddot) do i=1,n ddot = ddot + x(i)*y(i) enddo return end dotprod.f x y 1 n

4 A LLIANCE ’98 A Less Simple Example real*8 function ddot(n,x,y) integer n real*8 x(n), y(n), ddot1 !$omp parallel private(ddot1) ddot1 = 0.0 !$omp do private(i) do i=1,n ddot1 = ddot1 + x(i)*y(i) enddo !$omp end do nowait !$omp atomic ddot = ddot + ddot1 !$omp end parallel dotprod2.f x y 1 n ddot1 ddot ddot1

5 A LLIANCE ’98 Analysis and Adaptation zThread-safety zAutomatic Parallelization zFinding Parallel Opportunities zClassifying Data zA Different Approach

6 A LLIANCE ’98 Thread-safety zConfirm code works with -automatic in serial f77 -automatic -DEBUG:trap_uninitialized=ON a.out zSynchronize access to static data logical function overflows integer count save count data /count/0/ overflows =.false. !$omp critical count = count + 1 if ( 10) overflows =.true. !$omp end critical

7 A LLIANCE ’98  Power Fortran Accelerator yDetects parallelism yImplements parallelism zUsing PFA module swap MIPSpro MIPSpro.beta721 f77 -pfa zPFA options to try y-IPAenables interprocedural analysis y-OPT:roundoff=3enables reductions Automatic Parallelization

8 A LLIANCE ’98 Basic Compiler Transformations Work variable privatization: DO I=1,N x =.... y(I) = x ENDDO !$omp parallel do !$omp& private(x) DO I=1,N x =.... y(I) = x ENDDO

9 A LLIANCE ’98 Basic Compiler Transformations Parallel reduction : DO I=1,N. x =... sum = sum + x. ENDDO !$omp parallel !$omp private(x, sum1) sum1 = 0.0 !$omp do DO I=1,N. x =... sum1 = sum1 + x. ENDDO !$omp atomic sum = sum + sum1 !$omp parallel do !$omp& private(x) !$omp& reduction(+:sum) DO I=1,N. x =... sum = sum + x. ENDDO

10 A LLIANCE ’98 Basic Compiler Transformations Induction variable substitution: i1 = 0 i2 = 0 DO I=1,N i1 = i1 + 1 B(i1) =... i2 = i2 + I A(i2) = … ENDDO !$omp parallel do !$omp& private(I) DO I=1,N B(I) =... A((I**2 + I)/2) = … ENDDO

11 A LLIANCE ’98 Automatic Limitations zIPA is slow for large codes zWithout IPA, only small loops go parallel zAnalysis must be repeated with each compile zCan’t parallelize data dependent algorithms zResults usually don’t scale

12 A LLIANCE ’98 Compiler Listing zGenerate listing with ‘-pfa keep’ f77 -pfa keep zThe listing gives many useful clues: yLoop optimization tables yData dependencies yExplanations about applied transformations yOptimization summary yTransformed OpenMP source code zUse listing to help write OpenMP version zWorkshop MPF presents listing graphically

13 A LLIANCE ’98 Picking Parallel Loops zAvoid inherently serial loops yTime stepping loops yIterative convergence loops zParallelize at highest level possible zChoose loops with large trip count zAlways parallelize in same dimension, if possible zWorkshop MPF’s static analysis can help

14 A LLIANCE ’98 Profiling zUse SpeedShop to profile your program yCompile normally in serial ySelect typical data set  Profile with ‘ssrun’ : ssrun -ideal ssrun -pcsamp  Examine profile with ‘prof’ : prof -gprof.ideal. zLook for routines with: yLarge combined ‘self’ and ‘child’ time ySmall invocation count

15 A LLIANCE ’98 Example Profile self kids called/total parents index cycles(%) self(%) kids(%) called+self name index self kids called/total children [...] /1 PSET [4] [5] (100.00%) ( 0.00%) (100.00%) 1 RUN [5] / DCTDX [6] /527 DKZMH [13] /526 DUDTZ [14] /526 DVDTZ [15] /527 DTDTZ [16] /66920 DPDX [18] /66803 DFTDX [19] /527 DCDTZ [21] /526 WCONT [24] /527 HYD [30] /527 ADVU [36] /526 ADVV [37] /527 ADVC [39] /527 ADVT [40] [...] apsi.profile

16 A LLIANCE ’98 Multiple Parallel Loops zNested parallel loops yPrefer outermost loop yPreserve locality -- chose same index as in other parallel loops yIf relative sizes of trip counts are not known  Use NEST() clause xUse IF clause to select best based on dataset zNon nested parallel loops yConsider fusing loops yExecute code between loops in parallel xPrivatize data in redundant calculations

17 A LLIANCE ’98 Nested Parallel Loops subroutine copy (imx,jmx,kmx,imp2,jmp2,kmp2,w,ws) do nv=1,5 !$omp do do k = 1,kmx do j = 1,jmx do i = 1,imx ws(i,j,k,nv) = w(i,j,k,nv) end do !$omp end do nowait end do !$omp barrier return end copy.f

18 A LLIANCE ’98 Variable Classification zIn OpenMP, data is shared by default zOpenMP provides several privatization mechanisms zA correct OpenMP program must have its variables properly classified z!$omp parallel !$omp& PRIVATE(x,y,z) !$omp& FIRSTPRIVATE (q) !$omp& LASTPRIVATE(I) z common /blk/ l,m,n !$omp THREADPRIVATE(/blk/)

19 A LLIANCE ’98 Shared Variables zShared is OpenMP default zMost things are shared yThe major arrays yVariables whose indices match loop index !$omp parallel do do I = 1,N do J = 1, M x(I) = x(I) + y(J) yVariables only read in parallel region yVariables read, then written, requiring synchronization xmaxval = max(maxval, currval)

20 A LLIANCE ’98 Private Variables zLocal variables in called routines are automatically private zCommon access patterns yWork variables written then read (PRIVATE) yVariables read on first iteration, then written (FIRSTPRIVATE) yVariables read after last iteration (LASTPRIVATE) program main !$omp parallel call compute !$omp end parallel end subroutine compute integer i,j,k [...] return end

21 A LLIANCE ’98 Classifying Variables in Common zPrivate variables in common block yPrivatize with THREADPRIVATE(/…/) zMixed shared and private variables in common block ySplit common into shared common and private common if necessary yPromote private variables to formal arguments, if possible, and remove them from common

22 A LLIANCE ’98 Classifying Variables Examine Refs Only Read in // Region Shared Modified in // Region Contains parallel loop index (Different iterations reference different parts) Shared Does not contain parallel loop index Continue

23 A LLIANCE ’98 Classifying Variables Move to Common Variable Type Formal Parameter Common Member Referenced in called routines Threadprivate Static Private Pointee Private (Alloc at top of // Region) Only referenced in // Region Private Local to subr Automatic

24 A LLIANCE ’98 Variable Typing DIMENSION HELP(NZ),HELPA(NZ),AN(NZ),BN(NZ),CN(NZ)... [...] !$omp parallel !$omp& default(shared) !$omp& private(help,helpa,i,j,k,dv,topow,nztop,an,bn,cn) !$omp& reduction(+: wwind, wsq) HELP(1)=0.0 HELP(NZ)=0.0 NZTOP=NZ-1 !$omp pdo DO 40 I=1,NX DO 30 J=1,NY DO 10 K=2,NZTOP [...] 40 CONTINUE !$omp end pdo !$omp end parallel wcont.f wcont_omp.f dwdz.f

25 A LLIANCE ’98 Synchronization zReductions yMax, min values yGlobal sums, products, etc. yUse REDUCTION() clause for scalars !$omp do reduction(max: ymax) do i=1,n y(i) = a*x(i) + y(i) ymax = max(ymax,y(i)) enddo yCode array reductions by hand maxpy.f

26 A LLIANCE ’98 Array Reductions !$omp parallel private(hist1,i,j,ibin) do i=1,nbins hist1(i) = 0 enddo !$omp do do i=1,m do j=1,m ibin = 1 + data(j,i)*rscale*nbins hist1(ibin) = hist1(ibin) + 1 enddo !$omp critical do i=1,nbins hist(i) = hist(i) + hist1(i) enddo !$omp end critical !$omp end parallel histogram.f histogram.omp.f

27 A LLIANCE ’98 Building the Parallel Program zAnalyze, Insert Directives, and Compile: module swap MIPSpro MIPSpro.beta721 f77 -mp -n32 - or - source /usr/local/apps/KAI/setup.csh guidef77 -n32 zRun multiple times; compare output to serial setenv OMP_NUM_THREADS 3 setenv OMP_DYNAMIC false a.out zDebug

28 A LLIANCE ’98 Correctness and Debugging zOpenMP is easier than MPI, but bugs are still possible zCommon Parallel Bugs zDebugging Approaches

29 A LLIANCE ’98 Debugging Tips zCheck parallel P=1 results setenv OMP_NUM_THREADS 1 setenv OMP_DYNAMIC false a.out zIf results differ from serial, check: yUninitialized private data yMissing lastprivate clause zIf results are same as serial, check for: yUnsynchronized access to shared variables yShared variables that should be private yVariable size THREADPRIVATE common declarations

30 A LLIANCE ’98 Parallel Debugging is Hard zWhat can go wrong? yIncorrectly classified variables yUnsynchronized writes yData read before written yUninitialized private data yFailure to update global data yOther race conditions yTiming-dependent bugs parbugs.f

31 A LLIANCE ’98 Parallel Debugging Is Hard zWhat else can go wrong? yUnsynchronized I/O yThread stack collisions xIncrease with mp_set_slave_stacksize() function or KMP_STACKSIZE variable yPrivatization of improperly declared arrays yInconsistently declared private common blocks

32 A LLIANCE ’98 Debugging Options zPrint statements zMultithreaded debuggers zAutomatic parallel debugger

33 A LLIANCE ’98 zAdvantages yWYSIWYG yCan be useful yCan monitor scheduling of iterations on threads zDisadvantages ySlow, human-time intensive bug hunting zTips yInclude thread ID yChecksum shared memory regions yProtect I/O with a CRITICAL section Print Statements

34 A LLIANCE ’98 Multithreaded Debugger zAdvantages yCan find causes of deadlock, such as threads waiting at different barriers zDisadvantages yLocates symptom, not cause yHard to reproduce errors, especially those which are timing-dependent yDifficult to relate parallel (MP) library calls back to original source yHuman intensive

35 A LLIANCE ’98 WorkShop Debugger zGraphical User Interface zUsing the debugger yAdd debug symbols with ‘-g’ on compile and link: f77 -g -mp - or - guidef77 -g yRun the debugger setenv OMP_NUM_THREADS 3 setenv OMP_DYNAMIC false cvd a.out yFollow threads and try to reproduce the bug

36 A LLIANCE ’98 Automatic OpenMP Debugger zAdvantages ySystematically finds parallel bugs xDeadlocks and race conditions xUninitialized data xReuse of PRIVATE data outside parallel regions xMeasures thread stack usage yUses computer time rather than human time zDisadvantages yData set dependent yRequires sequentially consistent program yIncreased memory usage and CPU time

37 A LLIANCE ’98 KAI’s Assure zLooks like an OpenMP compiler zGenerates an ideal parallel computer simulation zItemizes parallel bugs zLocates exact location of bug in source zIncludes GUI to browse error reports

38 A LLIANCE ’98 Serial Consistency zParallel program must have a serial counterpart yAlgorithm can’t depend on number of threads yCode can’t manually assign domains to threads yCan’t call omp_get_thread_num() yCan’t use OpenMP lock API. zSerial code defines correct behavior ySerial code should be well debugged yAssure sometimes finds serial bugs as well

39 A LLIANCE ’98 Using Assure zPick a project database file name: e.g., “buggy.prj” zCompile all source files with “assuref77”: source /usr/local/apps/KAI/setup.csh assuref77 -WA,-pname=./buggy.prj -c buggy.f assuref77 -WA,-pname=./buggy.prj buggy.o ySource files in multiple directories must specify same project file zRun with a small, but representative workload a.out setenv DISPLAY your_machine:0 assureview buggy.prj

40 A LLIANCE ’98 AssureView: Errors

41 A LLIANCE ’98 AssureView:Source

42 A LLIANCE ’98 AssureView: Call Graph

43 A LLIANCE ’98 AssureView: Common Mismatch

44 A LLIANCE ’98 Assure Tips zSelect small, but representative data sets zIncrease test coverage with multiple data sets zNo need to run job to completion (control-c) zGet intermediate reports (e.g., every 2 minutes) setenv KDD_INTERVAL 2m a.out & assureview buggy.prj [ wait a few minutes ] assureview buggy.prj zQuickly learn about stack usage and call graph setenv KDD_DELAY 48h

45 A LLIANCE ’98 A Different Approach to Parallelization zLocate candidate parallel loop(s) zIdentify obvious shared and private variables zInsert OpenMP directives zCompile with Assure parallel debugger zRun program zView parallel errors with AssureView zUpdate directives md.f md.omp.f

46 A LLIANCE ’98 Parallel Performance zLimiters of Parallel Performance zDetecting Performance Problems zFixing Performance Problems

47 A LLIANCE ’98 Parallel Performance zLimiters of performance yAmdahl’s law yLoad imbalance ySynchronization yOverheads yFalse sharing Easy Hard Obvious Subtle

48 A LLIANCE ’98 Amdahl’s Law zMaximum Efficiency zFraction parallel limits scalability zKey: Parallelize everything significant

49 A LLIANCE ’98 Load Imbalance zUnequal work loads lead to idle threads and wasted time time !$omp parallel do !$omp end parallel do

50 A LLIANCE ’98 Synchronization zLost time waiting for locks time !$omp parallel !$omp end parallel !$omp critical !$omp end critical

51 A LLIANCE ’98 zSuccessful loop parallelization requires large loops. z!$OMP PARALLEL DO SCHEDULE(STATIC) startup time y~3500 cycles or 20 micro-seconds on 4 processors y~200,000 cycles or 1 milli-second on 128 processors zLoop time should be large compared to parallel overheads zData size must grow faster than number of threads to maintain parallel efficiency Parallel Loop Size Max loop speedup = serial loop execution + parallel loop startup number of processors

52 A LLIANCE ’98 False Sharing zFalse sharing occurs when multiple threads repeated write to the same cache line zUse perfex to detect if cache invalidation is a problem perfex -a -y -mp zUse SpeedShop to find the location of the problem ssrun -dc_hwc ssrun -dsc_hwc false.f

53 A LLIANCE ’98 Measuring Parallel Performance zMeasure wall clock time with ‘timex’ setenv OMP_DYNAMIC false setenv OMP_NUM_THREADS 1 timex a.out setenv OMP_NUM_THREADS 16 timex a.out zProfilers (speedshop, perfex) yFind remaining serial time yIdentify false sharing zGuide’s instrumented parallel library

54 A LLIANCE ’98 Using GuideView zCompile with Guide OpenMP compiler and normal compile options source /usr/local/apps/KAI/setup.csh guidef77 -c -Ofast=IP27 -n32 -mips4 source.f... zLink with instrumented library guidef77 -WGstats source.o … zRun with real parallel workload setenv KMP_STACKSIZE 32M a.out zView performance report guideview guide_stats

55 A LLIANCE ’98 Compare achieved to ideal Performance GuideView Identify parallel bottlenecks such as Barriers, Locks, and Sequential time Compare multiple runs

56 A LLIANCE ’98 Analyze each thread’s performance See how performance bottlenecks change as processors are added

57 A LLIANCE ’98 Performance Data By Region Analyze each Parallel region Find serial regions that are hurt by parallelism Sort or filter regions to navigate to hotspots

58 A LLIANCE ’98 Dynamic Scheduling zRelieve load imbalance zStatic even scheduling yEqual size iteration chunks yBased on runtime loop limits yTotally parallel scheduling yOpenMP default zDynamic and Guided scheduling yThreads do some work then get next chunk !$omp parallel do !$omp& schedule(static) !$omp parallel do !$omp& schedule(dynamic,8) !$omp parallel do !$omp& schedule(guided,8)

59 A LLIANCE ’98 Limiting Parallel Overheads zMerge adjacent parallel regions zWhen safe, avoid barrier at end of !$omp do zEliminate small parallel loops zUse IF clause to limit parallelism zIncrease problem size !$omp parallel !$omp& if( 1000) !$omp do do I=1,100 [...] enddo !$omp end do nowait !$omp do do I=1,100 [...] enddo !$omp end parallel

60 A LLIANCE ’98 Advanced Topics zOpenMP can be used with MPI to achieve two- level parallelism setenv OMP_NUM_THREADS 4 mpirun -np 4 a.out zData distribution and affinity directives man mp zExplicit domain decomposition with OpenMP

61 A LLIANCE ’98 Reference zSpeaker contact info yFaisal Saied, yFady Najjar, yBill Magro, zssrun, timex, perfex, cvd, cvpav, cvperf, f77, f90 ySee man pages or “insight” documents zGuide Documentation yOn modi4: /usr/local/apps/KAI/guide35/docs zAssure Documentation yOn modi4: /usr/local/apps/KAI/assure35/docs

Download ppt "Programming the Origin2000 with OpenMP: Part II William Magro Kuck & Associates, Inc."

Similar presentations

Ads by Google