Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance Optimization Getting your programs to run faster.

Similar presentations


Presentation on theme: "Performance Optimization Getting your programs to run faster."— Presentation transcript:

1 Performance Optimization Getting your programs to run faster

2 Why optimize Better turn-around on jobs Run more programs/scenarios Release resources to other applications You want the job to finish before you retire

3 Ways to get more performance Run on bigger, faster hardware clock speed, more memory, … Tweak your algorithm Optimize your code

4 Loop Unrolling Converting passes of a loop into in-line streams of code Useful when loops do calculations on data in arrays Unrolling can take advantage of pipeline processing units in processors Compiler may preload operands into CPU registers

5 Loop Unrolling – disadvantages may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128

6 Loop Unrolling – simple example Loop do i=1,n a(i) = b(i) +x*c(i) enddo Unrolled Loop do i=1,n,4 a(i) = b(i) +x*c(i) a(i+1) = b(i+1) +x*c(i+1) a(i+2) = b(i+2) +x*c(i+2) a(i+3) = b(i+3) +x*c(i+3) enddo

7 Loop Unrolling – simple example Performance – Rolled P3 550mhz – 13 mflops Itanium – 30 mflops Performance Unrolled P3 550mhz – 30 mflops Itanium – 107 mflops *from: LCI and NCSA

8 Loop Unrolling int a[100]; for (i=0;i<100;i++){ a[i] = a[i] * 2; } int a[100]; for (i=0;i<100;i+=5){ a[i] = a[i] * 2; a[i+1]=a[i+1]*2; a[i+2]=a[i+2]*2; a[i+3]=a[i+3]*2; a[i+4]=a[i+4]*2; }

9 Loop unrolling int a[10][10]; for (i=0;i<10;i++){ for (j=0;j<10;j++) { a[i][j] = a[i][j] *2;} int a[10][10]; for (i=0;i<10;i++){ a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2; a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2; a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2; a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2; a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;}

10 Loop unrolling – Matrix Dot Product float a[100]; float b[100]; float z; for (i=0;i<100;i++){ z = z + a[i] * b[i]; } float a[100]; float b[100]; float z; for (i=0;i<100;i+=2){ z = z + a[i] * b[i]; z = z + a[i+1] * b[i+1]; }

11 Unrolling Loops You can do it automatically

12 Unrolling Loops – compiler options GNU Compilers -funroll-loops -funrull-all-loops (not recommended) PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M

13 Unrolling Loops – Compiler Options Intel Compilers -unrollM (up to M times) -unroll

14 Design your program to minimize cache faults Align data arrays with cache boundaries

15 If your algorithm has repetitive iterations across specific rows or columns… try making your array dimensions match the cache buffer size of your computer i.e. if you array is 1000 x 1000 (single byte integers) and you have a 1024 byte cache… allocate the array as 1024 x 1024

16 Align data arrays with cache boundaries..or if you array is 500x500 … allocate it as 512x512

17 Align data arrays with cache boundaries

18 1,11,21,31,41,51,61,7 2,12,22,32,4 1, 1 1, 2 1, 3 1, 4 1, 5 1, 6 1,7 2, 1 2, 2 2, 3 2, 4 2, 5 2, 6 2,7 3, 1 3, 2 3, 3 1, 1 1, 2 1, 3 1, 4 1, 5 1, 6 1, 7 2, 1 2, 2 2, 3

19 Taking Memory in Order Optimizing the use of cache row major order vs column major order row major --  a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –  a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…

20 Taking Memory in Order Remember C and Fortran store arrays in the opposite manner  C – row major  Fortran – column major

21 Taking Memory in Order c  Fortran 

22 Taking Memory in Order do i=1,m do j=1,n a(i,j)=b(i,j)+c(i) end do do j=1,m do i=1,n a(i,j)=b(i,j)+c(i) end do loop time: 23.42 loop runs at 4.48 Mflops loop time: 2.80 loop runs at 37.48 Mflops

23 Floating Point Division FP Division is very expensive in terms of processor time 20-60 clock cycles to compute Usually not pipelined FP Division required by IEEE “rules”

24 Floating point division – use reciprocal float a[100]; for (i=0;i<100;i++){ a[i]=a[i]/2; } float a[100]; Float denom; denom = 1/2; for (i=0;i<100;i++){ a[i]=a[i]*denom; }

25 Compiler options for IEEE Compatibility PGI Compilers  -Knoieee Intel Compilers  -mp GNU Compilers  can’t do Floating Point Division

26 Compilers can’t optimize if divisor is not scalar Breaks IEEE “rules” May impact portability

27 Function Inlining Build functions/subroutines in as inline parts of the programs code… … rather than functions/subroutines minimizes functions calls (and management of…)

28 Function Inlining Compile with – -Minline  compiler tries to inline what it can (meet compiler criteria) -Minline=except:func  excludes func from inlining -Minline=func  inline only func

29 Function Inlining …Compile with- -Minline=myfile.lib  inlines functions from inline library file -Minline=levels:n  inlines functions up to n levels of calls  usually default = 1

30 MPI Tuning Minimize messages Pointers/counts MPI Derived datatypes MPI_Pack/MPI_Unpack Using shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.

31 Compiler optimizations -O0 –no optimization -O1 –local optimization, register allocation -O2 –local/limited global optimization -O3 –aggressive global optimization -Munroll – loop unrolling -Mvect - vectorization -Minline – function inlining

32 gcc Compiler Optimatizations http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html See:

33


Download ppt "Performance Optimization Getting your programs to run faster."

Similar presentations


Ads by Google