# Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA.

## Presentation on theme: "Www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA."— Presentation transcript:

www.cineca.it Optimization techniques Carlo Cavazzoni, HPC department, CINECA

www.cineca.it Modern node architecture CPU RAM Disk cache I, D Small & fast

www.cineca.it Cache Hierarchy register L1 L2 L3 RAM L1: Instruction and data Size: L1 … Ln Speed: L1 … Ln CPU looks for data in L1, if it is there (L1 cache hit), if not (L1 cache miss) and looks in L2 … cache miss penaly in terms of clock cycle

www.cineca.it CACHE Direct Mapped 32 Kbyte 0 32 K 64 K 128 K cache

www.cineca.it Cache set associative 32 Kbyte 16 Kbyte 32 Kbyte 0 32 K 64 K 128 K 16 Kbyte 16 K Es. 2-ways 48 K cache LastRecentlyUsed Round Robin Random

www.cineca.it Loop optimization

www.cineca.it Loop fusion Locality in time do i=1, n a(i) = b(i) + 1.0 enddo do i=2, n c(i) = sqrt(a(i-1)) enddo do i=2, n a(i) = b(i) + 1.0 c(i) = sqrt(a(i-1)) enddo a(1) = b(1) + 1.0 if n is big enough, a is loaded, offloaded and loaded again into cache Reuse the a(i) loaded into cache

www.cineca.it Loop interchange Locality in space do i=1, n do j=1, n a(i,j) = b(i,j) + 1.0 enddo Load elements into cache lines and use only one before replacing them with new elements Load elements into cache and use all of them before replacing them with new elements do j=1, n do i=1, n a(i,j) = b(i,j) + 1.0 enddo 0x00 0x01 0x02 0x03 ab jj ii

www.cineca.it Cache thrashing real, dimension (1024) :: a,b COMMON /my_com/ a, b do i=1, 1024 a(i) = b(i) + 1.0 enddo offset shift matrixes w.r.t. cache no more problems Avoid power of 2 for matrix dimensions integer offset = (linea_cache)/SIZE(REAL) real, dimension (1024+offset) :: a,b COMMON /my_com/ a, b do i=1, 1024 a(i) = b(i) + 1.0 enddo Padding size cache = 4*1024, direct mapped, a, b contiguous cache thrashing array size = multiple of cache size possible source of cache thrashing Set Associative help reducing thrashing problems

www.cineca.it Loop unrolling do j=1, n do i=1, (n-1) a(i,j)= b(i,j)+b(i+1,j)+1.0 enddo Equivalent Loops. Fewer jump. Fewer dependencies. Fill pipelines and vector units. do j=1, n do i=1, (n-1), 2 a(i,j) = b(i,j) +b(i+1,j)+1.0 a(i+1,j) = b(i+1,j)+b(i+2,j)+1.0 enddo

www.cineca.it Optimize with numerical libraries Less coding Tested and (almost) bug free Standard Efficient implementation Optimized

www.cineca.it BLAS Basic Linear Algebra Subprogram, Parallel BLAS and Basic Linear Algebra Communication Subsystem (www.netlib.org) Level 1 BLAS: Vector-Vector operations (scalar only). Level 2 BLAS, PBLAS: Vector-Matrix operations (scalar and parallel). Level 3 BLAS, PBLAS: Matrix-Matrix operation (scalar and parallel). Level 1 and 2 BLACS: vector reduction, vector and matrics communications.

www.cineca.it Lapack and Scalapack Linear Algebra Package and Scalable Lapack (www.netlib.org) Matrix decomposition. Solution of Linear Systems. Eigenvalues and Eigenvetors Linear Least Square solutions

www.cineca.it MKL ESSL ACML CUBLAS MAGMA PLASMA

www.cineca.it MASS (IBM) Accelerated version of SQRT, SIN, COS, EXP, LOG, ecc… Scalar and vector

www.cineca.it VML Equivalent to MASS (vector version only) For Intel processors Accelerated version of: sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y

www.cineca.it VML do i = 1, n r = r + sin( a(i) ) end do call vdsin( n, a, y ) do i = 1, n r = r + y( i ) end do CALL vml_subroutine( n, a, y )

www.cineca.it BLAS Matrix multiplication DGEMM (transa, transb, l, n, m, alpha, a, lda, b, ldb, beta, c, ldc) c = alpha op( a ) * op( b ) + beta c C lm = n A ln B nm + C lm C lm = n A T ln B nm + C lm C lm = n A ln B T nm + C lm C lm = n A T ln B T nm + C lm real*8 a(lda,*), b(ldb,*), c(ldc,*)

www.cineca.it Profileing with gprof Compiler flag -pg or -p (depend on the compiler) gcc -pg –c mio.c./a.out gmon.outgprof

www.cineca.it gcc -pg -funroll-loops –O2 dotprod.c -static [cineca@rfxoff1 Carlo]\$./a.out d = 1000000.000000 % cumulative self self total time seconds seconds calls us/call us/call name 68.57 0.05 0.05 2 23437.50 23437.50 set_vector 31.43 0.07 0.02 1 21484.38 21484.38 dot_product 0.00 0.07 0.00 1 0.00 68359.38 main gprof

www.cineca.it Profileing by hand Find hot spot in your application Use temporization functions CALL SYSTEM_CLOCK(iclk1, count_rate=nclk) CALL critical_subroutine( …… ) CALL SYSTEM_CLOCK(iclk2) PRINT *,REAL(iclk2-iclk1)/nclk t1 = cclock() CALL critical_subroutine( …… ) t2 = cclock() PRINT *, (t2-t1) CALL CPU_TIME( t3 ) CALL critical_subroutine( …… ) CALL CPU_TIME( t4 ) PRINT *, (t4-t3)

www.cineca.it Mesure performances #include double cclock_() { /* Restituisce il valore del CLOCK di sistema in secondi */ struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec; }

www.cineca.it PROGRAM test_dgemm IMPLICIT NONE INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ) ) ALLOCATE( z( dim, dim ) ) y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock( ) do j = 1, dim do i = 1, dim do k = 1, dim x(i,j) = x(i,j) + y(i,k) * z(k,j) end do t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z ) END PROGRAM PROGRAM test_dgemm IMPLICIT NONE INTEGER, PARAMETER :: dim = 1000 REAL*8, ALLOCATABLE :: x(:,:), y(:,:), z(:,:) INTEGER :: i,j,k REAL*8 :: t1, t2 REAL*8 :: cclock EXTERNAL :: cclock ALLOCATE( x( dim, dim ), y( dim, dim ), z( dim, dim ) ) y = 1.0d0 z = 1.0d0 / DBLE( dim ) x = 0.0d0 t1 = cclock() ! x = matmul( y, z ) call dgemm('N', 'N', dim, dim, dim, 1.0d0, y, c dim, z, dim,0.0d0, x, dim) t2 = cclock() write(*,*) ' Matrix sum = ', sum(x) write(*,*) ' tempo (secondi) ', t2-t1 DEALLOCATE( x, y, z ) END PROGRAM

www.cineca.it ATLAS BLAS compatible Automatically Tuned Linear Algebra Software http://sourceforge.net/ http://math-atlas.sourceforge.net/devel/

www.cineca.it http://www.fftw.org Fast Fourier Trasform FFT complex to complex FFT complex to real Parallel FFT Moulti-thread FFT

www.cineca.it do i=1,n do j=1,m y(j,i) = x(i,j) enddo Case Study: matrix transposition y x Think Fortran: Consecutive elements in memory

www.cineca.it Suppose 2-way set associative y x For each value of x I need to load into cache a whole line. 1) X allocate the 2nd way. 2) Risk of thrashing 3) When the cache is full, the proc. Start to overwrite cache lines data mapped in cache y allocate the 1st way What happens with the cache

www.cineca.it Suppose 2-way set associative y x As before for each value of x I need to load into cache a whole cache line. data mapped in cache y allocate the 1st way What happens with the cache, cont. We can see that a lot of data are loaded into the cache but they are not used!

www.cineca.it Suppose 2-way set associative y x Block Algorithm Load a block of data into cache Swap data in cache Write data back to memory

www.cineca.it 31 do i=1,n do j=1,m y(j,i) = x(i,j) enddo do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo do i=1,bsiz do j=1,bsiz y(j+joff, i+ioff) = buf(j,i) enddo bsiz = block size nb = n / bsiz mb = m / bsiz You need to handle: MOD(n / bsiz) /= 0 OR MOD(m / bsiz) /= 0 Solution: Block algorithm

www.cineca.it Whole block transpose do ib = 1, nb ioff = (ib-1) * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, bsiz buf(i,j) = x(i+ioff,j+joff) enddo do j = 1, bsiz do i = 1, j-1 bswp = buf(i,j) buf(i,j) = buf(j,i) buf(j,i) = bswp enddo do i=1,bsiz do j=1,bsiz y(j+joff,i+ioff) = buf(j,i) enddo IF( min( 1, MOD(n,bsiz) ).GT. 0 ) THEN ioff = nb * bsiz do jb = 1, mb joff = (jb-1) * bsiz do j = 1, bsiz do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, bsiz y(j+joff,i+ioff) = buf(i,j) enddo END IF IF( MIN(1, MOD(m, bsiz)).GT. 0 ) THEN joff = mb * bsiz do ib = 1, nb ioff = (ib-1) * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, bsiz buf(i,j) = x(i+ioff, j+joff) enddo do i = 1, bsiz do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo END IF IF( MIN(1,MOD(n,bsiz)).GT.0.AND. & & MIN(1,MOD(m,bsiz)).GT.0 ) THEN joff = mb * bsiz ioff = nb * bsiz do j = 1, MIN(bsiz, m-joff) do i = 1, MIN(bsiz, n-ioff) buf(i,j) = x(i+ioff, j+joff) enddo do i = 1, MIN(bsiz, n-ioff) do j = 1, MIN(bsiz, m-joff) y(j+joff,i+ioff) = buf(i,j) enddo END IF 1 2 3

www.cineca.it 33 Performance tuning and analysis: user codes Straightforward implementation Block implementation

www.cineca.it DO l=1,nphase IF(au1(l,l) /= 0.D0) THEN lp1=l+1 div=1.D0/au1(l,l) DO lj=lp1,nphase au1(l,lj)=au1(l,lj)*div END DO bu1(l)=bu1(l)*div au1(l,l)=0.D0 DO li=1,nphase amul=au1(li,l) DO lj=lp1,nphase au1(li,lj)=au1(li,lj)-amul*au1(l,lj) END DO bu1(li)=bu1(li)-amul*bu1(l) END DO END IF END DO IF( a(1,1) /= 0.D0 ) THEN div = 1.D0 / a(1,1) a(1,2) = a(1,2) * div a(1,3) = a(1,3) * div b(1) = b(1) * div a(1,1) = 0.D0 !li=2 amul = a(2,1) a(2,2) = a(2,2) - amul * a(1,2) a(2,3) = a(2,3) - amul * a(1,3) b(2) = b(2) - amul * b(1) !li=3 amul = a(3,1) a(3,2) = a(3,2) - amul * a(1,2) a(3,3) = a(3,3) - amul * a(1,3) b(3) = b(3) - amul * b(1) END IF IF( a(2,2) /= 0.D0 ) THEN div=1.D0/a(2,2) a(2,3)=a(2,3)*div b(2)=b(2)*div a(2,2)=0.D0 !li=1 amul=a(1,2) a(1,3)=a(1,3)-amul*a(2,3) b(1)=b(1)-amul*b(2) !li=3 amul=a(3,2) a(3,3)=a(3,3)-amul*a(2,3) b(3)=b(3)-amul*b(2) END IF IF( a(3,3) /= 0.D0 ) THEN div=1.D0/a(3,3) b(3)=b(3)*div a(3,3)=0.D0 !li=1 amul=a(1,3) b(1)=b(1)-amul*b(3) !li=2 amul=a(2,3) b(2)=b(2)-amul*b(3) END IF Parameter Dependent Code & Unrolling per un dato set di parametri (di input), riesco ad eliminare ogni loop, ottimizzando cache e pipe di esecuzione

www.cineca.it Debugging (post mortem) program hello_bug real(kind=8) :: a( 10 ) call clearv( a, 10000 ) print *, SUM( a ) end program subroutine clearv( a, n) real(kind=8) :: a( * ) integer :: n integer :: i do i = 1, n a( n ) = 0.0 end do end subroutine gfortran –g hello_bug.f90 Remove core size limit ulimit –c unlimited./a.out Segmentation fault (core dumped) gdb./a.out core

www.cineca.it Debugging (in vivo) gfortran -g hello_bug.f90 gdb./a.out GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-32.el5) Copyright (C) 2009 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see:... Reading symbols from /plx/userinternal/acv0/a.out...done. (gdb) run Starting program: /plx/userinternal/acv0/a.out warning: no loadable sections found in added symbol-file system-supplied DSO at 0x2aaaaaaab000 Program received signal SIGSEGV, Segmentation fault. 0x0000000000400833 in clearv (a=0x7fffffffe2f0, n=@0x400968) at hello_bug.f90:12 12 a( n ) = 0.0

www.cineca.it #include double cclock_() { /* Restituisce il valore del CLOCK di sistema in secondi */ struct timeval tmp; double sec; gettimeofday( &tmp, (struct timezone *)0 ); sec = tmp.tv_sec + ((double)tmp.tv_usec)/1000000.0; return sec; } program mat_mul integer, parameter :: n = 100 real*8 :: a(n,n), b(n,n), c(n,n) real*8 :: t1, t2 real*8 :: cclock external cclock a = 1.0d0 b = 1.0d0 t1 = cclock() call dgemm('N', 'N', n, n, n, 1.0d0, a, n, b, n, 0.0d0, c, n ) t2 = cclock() write(*,*) SUM(c), t2-t1 end program 1) gcc –c cclock.c 2) f95 matmul_prof.f90 cclock.o -L. -lblas Link Fortran and C

www.cineca.it Link Fortran and C program rand real(kind=8) :: a external crand call crand( a ) print *,'this is random ', a end program #include void crand_( double * x ) { (*x) = ( (double)random()/(double)RAND_MAX ); } Link a C subroutine with a Fortran program rand.f90 crand.c Fortran passes arguments by reference, C passes them by value

www.cineca.it Link a Fortran subroutine with a C program #include int main() { int n; double a[10], d; n = 10; d = 1.0; setv_( a, &d, &n ); printf("%lf\n", a[0]); } subroutine setv( a, d, n ) real(kind=8) :: a( * ) real(kind=8) :: d integer :: n integer :: i do i = 1, n a( i ) = d end do end subroutine cvec.c vset.f90 Link Fortran and C

www.cineca.it Make Command If a code is large and/or it shares subroutines with other codes, it is useful to split the source in many files that could be placed in different directories. In F90 there are dependencies among program units, i.e. modules must be compiled before than any other program units. Therefore there is a well defined order for compiling source files To avoid compiling by hands the sources in the proper order, the make command could be used

www.cineca.it Make Command The make command can be programmed to do the job for you using a file containing instruction and dircetive. By default the make command looks in the present directory for a file colled Makefile or makefile

www.cineca.it A simple makefile # this is a comment within the makefile myprog.x : modules.o main.o f90 –o myprog.x modules.o main.o modules.o : modules.f90 f90 –c modules.f90 main.o : modules.o main.f90 f90 –c main.f90 this tell to the make command that myprog.x depend from modules.o and main.o make execute the command only when modules.o and main.o have been built to compile the code, from the console the programmer issue the command: > make

www.cineca.it A less simple makefile # this is a comment within the makefile myprog.x : modules.o main.o f90 –o myprog.x modules.o main.o main.o : modules.o.f90.o f90 –c \$< this is an implicit dependency, it state that all files.o depend and should be generated from the corresponding.f90 files this is a make macro, and it is expandend with the proper.f90 filename In the above example, make try to built myprog.x but it realizes that main.o and modules.o should be generated first. Then it starts looking for a rule to make the.o, and it finds that main.o depend on modules.o, and thern make build an internal hierarchy for compilation in which modules.o come before main.o. Finally make finds the implicit rule and starts compiling the sources.

Similar presentations