Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Compiler and Developing Tools Yao-Yuan Chuang. 2 Fortran and C Compilers  Freeware The GNU Fortran and C compilers, g77, g95, gcc, and gfortran are.

Similar presentations

Presentation on theme: "1 Compiler and Developing Tools Yao-Yuan Chuang. 2 Fortran and C Compilers  Freeware The GNU Fortran and C compilers, g77, g95, gcc, and gfortran are."— Presentation transcript:

1 1 Compiler and Developing Tools Yao-Yuan Chuang

2 2 Fortran and C Compilers  Freeware The GNU Fortran and C compilers, g77, g95, gcc, and gfortran are popular. The port for window is either MinGW or Cygwin.  Proprietary Portland Group ( Intel Compiler ( Absoft Lehay Comparison (http://fortran- Fortran under Linux (http://www/

3 3 Compiler  Compilers act on Fortran or C source code (*.f or *.c) and generate assembler source file. (*.s)  An assembler converts the assembler source into an object file (*.a)  A linker then combines the object files (*.o) and the library files (*.a) to a executable file.  Without specifying the executable filename, by default it is called a.out.

4 4 C vs. Fortran  Fortran Introduction to Scientific Computing (  C/C++ Computational Physics (http://www.physics.ohio-

5 5 Code Optimization Yao-Yuan Chuang

6 6 Code Optimization  A compiler is a program reads the source program written in high-level language and translate it into machine language.  “ An optimized compiler ” generated “ optimize ” machine language code which takes less time to run, occupied less memory or both.  Assembly Language generated from GNU C compiler. (

7 7 Code Optimization  Make your code run faster  Make your code use less memory  Make your code use less disk space  Optimization always comes before Parallelization

8 8 Optimization Strategy Unoptimized Code Set up reference case Compiler optimization (-O1 to –O3, other options) Include Numerical Lib Profiling Identify bottleneck Optimization Techniques Check result Optimization loop Optimized code

9 9 CPU Insights Instruction cacheRegister (L1 cache) Main Memory L2/L3 cache pipeline Load/store unit Load/store unit FP unit FX unit FMA unit Vector unit Specialized unit

10 10 Memory Access CPU L1 cache L2/L3 cache RAM Disk 1 ~ 4 cycles 8 ~ 20 cycles 8000 – 120000 cycles TLB Translation Look-aside Buffer List of most recently accessed Memory pages TLB miss: 30 ~ 60 cycles PFT Page Frame Table List of memory pages location memory pages

11 11 Measuring Performance  Measurement Guidelines  Time Measurement  Profiling tools  Hardware Counters

12 12 Measuring Guidelines  Make sure you have full access to the processor  Always check the correctness of the results  Use all the tools available  Watch out for overhead  Compare to theoretical peak performance

13 13 Time Measurement  On Linux % time hello  In Fortran 90 Call SYSTEM_CLOCK(count1,count_rate,count_max) … calculation CALL SYSTEM_CLOCK(count2,count_rate,count_max) Or using etime() function with PGI compiler

14 14 Components of computing time  User time The user time corresponds to the amount of time the instructions in your program taken on the CPU  System time Most scientific program require to use the OS kernel to carry out certain tasks, such as I/O. While carrying out these tasks, your program is not occupying the CPU. The system time is a measure of the time your program spends waiting for kernel services.  Elapsed time The elapsed time corresponds to the wall-clock time, or real-world time taken by the program.

15 15 Profiling Tools % gcc –pg newtest.c –g –o newtest % gprof newtest > newtest.out % less newtest.out From profiling information, we know number of the function calls and how much time spend in each function, hence, we can improve the most ‘critical’ step within the program for optimized performance. Tells you the portion of time the program spends in each of the subroutines and/or functions Mostly useful when your program has a lot of subroutines and/or functions Use profiling at the beginning of optimization process PGI profiling tool is calld pgprof which –Mprof=func

16 16 Hardware Counters  All modern processors has built-in event counters  Processors may have several registers reserved for counters  It is possible to start, stop and reset counters  Software API can be used to access counters  Using Hardware Counters is a must in Optimization

17 17 Software API-PAPI  Performance Application Programming Interface  A standardized API to access hardware counters  Available on most systems: Linux, Windows NT, Solaris, …  Motivation To provide solid foundation for cross platform performance analysis tools To present a set of standard definitions for performance metrics To provide a standardized API To be easy to use, well documented, and freely available  Web site:

18 18 Optimization Techniques  Compiler Options  Use Existing Libraries  Numerical Instabilities  FMA units  Vector units  Array Considerations  Tips and Tricks

19 19 Compiler Options  Substantial gain can be easily obtained by playing with compiler options  Optimization options are “ a must ”. The first and second level of optimization will rarely give no benefits.  Optimization options can range from – O1 to – O5 with some compilers. -O3 to – O5 might lead to slower code, so try them independently on each subroutine.  Always check your results when trying optimization options.  Compiler options might include hardware specifics such as accessing vector units.

20 20 Compiler Options  GNU C compiler gcc -O0 –O1 –O2 –O3 –finline-functions …  PGI Workstation Compiler pgcc, pgf90, and pg77 -O0 –O1-O2-O3 …  Intel Fortran and C compiler ifc and icc -O0 –O1 –O2 –O3 –ip –xW –tpp7 …

21 21 Existing Libraries  Existing libraries are usually highly optimized  Try several libraries and compare if possible  Recompile libraries on the platform you are running if you have source  Vendors libraries are usually well optimized for their platform  Popular mathematical libraries: BLAS, LAPACK, ESSL, FFTW, MKL, ACML, ATLAS, GSL …  Watch out for cross language (calling Fortran in C or calling C in Fortran) usage

22 22 Numerical Instabilities  Specific to each problem  Could lead to much longer time  Could lead to wrong result  Examine the mathematics of the solver  Look for operations involving very large and very small numbers  Be careful when using higher compiler optimization options

23 23 FMA units Y=A*X + B 00011000000010001101000011010101 * + Y X B In 1 cycle

24 24 Vector units 00011000000010001101000011010101 Op (+, -,*) = 32 bit precision 00011000000010001101000011010101 Op (+, -,*) = 64 bit precision x1 x2 x3 x4 x1 x2 128 bit long vector unit for P4 and Opteron 4 single precision FLOPs/cycle 2 double precision FLOPs/cycle

25 25 Array Considerations do i=1,5 do j = 1,5 a(i,j)= … enddo In Fortran do j=1,5 do i = 1,5 a(i,j)= … enddo for(j=1;j<=5;j++){ for(i=1;i<=5;i++){ a[i][j]=… } In C/C++ for(i=1;i<=5;i++){ for(j=1;j<=5;j++){ a[i][j]=… } Corresponding memory representation Outer 1 1 1 1 1 Inner 1 2 3 4 5 Outer 1 1 1 1 1 Inner 1 2 3 4 5

26 26 Tips and Tricks  Sparse Arrays Hard to optimize because needs to jumps when accessing memory Minimize the memory jumps Carefully analyze the construction of the sparse array, using pointer technique but it can be confusing Lower your expectation

27 27 Minimize number of Operations  During optimization, first thing needed to do is reducing the number of unnecessary operations performed by the CPU. do k=1,10 do j=1,5000 do i=1,5000 a(i,j,k)=3.0*m*d(k)+c(j)*23.1-b(i) enddo do k=1,10 dtmp(k)=3.0*m*d(k) do j=1,5000 ctmp(j)=c(j)*23.1 do i=1,5000 a(i,j,k)=dtmp(k)+ctmp(j)-b(i) enddo 1250 millions of operations 500 millions of operations

28 28 Complex Numbers  Watch for operations on complex numbers that have imaginary or real part equals to zero. ! Real part = 0 complex *16 a(1000,1000),b complex *16 c(1000,1000) do j=1,1000 do i=1,1000 c(i,j) = a(i,j)*b enddo real *8 aI(1000,1000) complex *16 b,c(1000,1000) do j=1,1000 do i=1,1000 c(i,j) = (-IMAG(b)*aI(i,j), aI(I,j)*REAL(b)); enddo 6 millions of operations2 millions of operations

29 29 Loop Overhead and Object do j = 1,1000000 do i = 1,1000000 do k = 1,2 a(i,j,k)=b(i,j)*c(k) enddo do j = 1,1000000 do i = 1,1000000 a(i,j,1)=b(i,j)*c(1) a(i,j,2)=b(I,j)*c(2) enddo Object declarations In Object-Oriented Language AVOID objects Declarations within the most inner loops

30 30 Function call Overhead do k = 1,1000000 do j = 1,1000000 do i = 1,5000 a(i,j,k)=fl(c(i),b(j),k enddo function fl(x,y,m) real*8 x,y,tmp integer m tmp=x*m-y return tmp end do k = 1,1000000 do j = 1,1000000 do i = 1,5000 a(i,j,k)=c(i)*k-b(j) enddo This can also be achieved with compilers inlining options. The compiler will then replace all function calls by a copy of the function code, sometimes leading to very large binary executable. % ifc –ip % icc –ip % gcc –finline-functions

31 31 Blocking  Blocking is used to reduce cache and TLB misses in nested matrix operations. The idea is to process as much data brought in the cache as possible do i = 1,n do j = 1,n do k = 1,n C(I,j)=C(I,j)+A(I,k)*B(k,j) enddo do ib = 1,n,bsize do jb = 1,n,bsize do kb = 1,n,bsize do i = ib,min(n,ib+bsize-1) do j = jb,min(n,jb+bsize-1) do k = kb,min(n,kb+bsize-1) C(I,j)=C(I,j)+ A(I,k)*B(k,j) enddo

32 32 Loop Fusion  The main advantage of loop fusion is the reduction of cache misses when the same array is used in both loops. It also reduces loop overhead and allow a better control of multiple instructions in a single cycle, when hardware allows it. do i = 1,100000 a = a + x(i) + 2.0 *z(i) enddo do j = 1,100000 v = 3.0*x(j) – 3.314159267 enddo do i = 1,100000 a = a + x(i) + 2.0 *z(i) v = 3.0*x(i) – 3.314159267 enddo

33 33 Loop Unrolling  The main advantage of loop unrolling is to reduce or eliminate data dependencies in loops. This is particularly useful when using a superscalar architecture. do i = 1,1000 a = a + x(i) * y(i) enddo do i = 1,1000,4 a = a + x(i) * y(i) + x(i+1)* y(i+1) + x(i+2)* y(i+2) + x(i+3)* y(i+3) enddo 2000 cycles 1250 cycles 2 FMAs or vector units (length of 2)

34 34 Sum Reduction  Sum reduction is another way of reducing or eliminating data dependencies in loops. It is more explicit than the loop unroll. do i = 1,1000 a = a + x(i) * y(i) enddo do i = 1,1000,4 a1 = a1 + x(i) * y(i) + x(i+1)* y(i+1) a2 = a2 + x(i+2)* y(i+2) + x(i+3)* y(i+3) enddo a = a1 + a2 2000 cycles 751 cycles 2 FMAs or vector units (length of 2)

35 35 Better Performance in Math  Replace division by multiplications Contrary to floating point multiplications, additions, or subtractions, divisions are very costly in terms of clock cycles. 1 multiplication = 1 cycle, 1 division = 14 ~ 20 cycles.  Repeated multiplications for exponentials Exponential is a functional call, if the exponent is small, multiplication should be done manually.

36 36 Portland Group Compiler  A comprehensive discussion of the Portland Group compiler optimization options is given in the PGI User ’ s Guide, available at  Information on how to use Portland Group compiler options can be obtained on the command line with % pgf77 –fastsse –help  Detailed information on the optimization and transformation (i.e. loop unrolling) carried out by the compiler is given by the – Minfo option. This is often useful when your code produces unexpected results.

37 37 Portland Group Compiler  Important compiler optimization options for the Portland Group compiler include: -fastincludes “ -O2 – Munroll – Mnoframe – Mlre ” -fastsse includes “ -fast – Mvec=sse – Mcache_align ” -Mipa=fastenables inter-procedural analysis (IPA) and optimization -Mipa=fast,inline enables IPA-based optimization and function inlining -Mpft … -Mpfo enables profile and data feedback based optimizations -Minlineinline functions and subroutines -Mconcurtry to autoparallelize loops for SMP/dual core systems -mcmodel=medium enable data > 2GB on opterons running 64-bit linux A good start for your compilation needs is: -fastsse – Mipa=fast

38 38 Optimization Levels  With the Portland Group compiler the different optimization levels correspond to: -O0the level-zero flag specifies no optimization. The intermediate code is generated and used for the machine code. -O1the level-one specifies local optimizations, i.e. local to a basic block -O2the level-two specifies global optimizations. These optimizations occur over all the basic blocks, and the control-flow structure. -O3level-three specifies an aggressive global optimization. All level-one and level-two optimizations are also carried out.

39 39 -Munroll option  The – Munroll compiler option unrolls loops. This has the effect of reducing the number of iterations in the loop by executing multople instances of the loop statements in each iteration. For example:  Loop unrolling reduces the overhead of maintaining the loop index, and permits better instruction scheduling (control of sending the instructions to the CPU). do i = 1, 100 z = z + a(i) * b(i) enddo do i = 1, 100, 2 z = z + a(i) * b(i) z = z + a(i+1) * b(i+1) enddo

40 40 -Mvect=sse option  The Portland Group compiler can be used to vecoroze code. Vectorization transforms loops to improve memory access performance (i.e. maximize the usage of the various memory components, such as registers, cache).  SSE is an acronym for Streaming SIMD Extensions, and is a set of CPU instructions, first introduced with the Intel Pentium III and AMD Athlon, which allows for the same operation acting on multiple data items concurrently.  The use of this compiler option can double the execution speed of a code.

41 41 Intermediate Language  The intermediate language used by the compiler is a language somewhere between the high-level language used by the programmer (i.e. Fortran, C), and the assembly language used by the machine.  The intermediate language is easier for the compiler to manipulate than source code. It contains not only the algorithm specified in the source code, but expressions for calculating the memory addresses (which can also be subject to optimization).  The intermediate language makes it much easier for the compiler to optimize source code.

42 42 Intermediate Language - quadruples  Calculations in intermediate languages are simplified into quadruples. Arithmetic expressions are broken down into calculations involving only two operands and one operator. This makes sense when considering how CPU carry out a calculation. This is simplification is illustrated with the following expression which can be simplified into quadruples by using temporary variables: A = -B + C * D / E T1 = D / E T2 = C * T1 T3 = -B A = T3 + T2

43 43 Basic Blocks  A more realistic example of intermediate language is given by using the example code  This code can be broken down into three basic blocks of code. A basic block is a collection of statements used to define local variables in compiler optimization. A basic block begins with a statement that either follows a branch (e.g. an IF), or is itself the target of a branch. A basic block has only one entrance (the top), and one exit (the bottom). do while ( n) k = k + j * 2 m = j * 2 j = j + 1 enddo

44 44 Basic Block Flow Graph A:: t1:=j t2:=n t2 jump (B) t3 jump(C) TRUE B::t4:=k t5:=j t6:=t5*2 t7:=t4+t6 k:=t7 t8:=j t9:=t8*2 m:=t9 t10:=j t11:=t10+1 j:=t11 jump(A) TRUE

45 45 Write Efficient C and Code Optimization  Use unsigned integer instead of Integer  Combining division and remainder  Division and remainder by powers of two  Use switch instead of if … else …  Loop unrolling  Use lookup tables  ( asp)

Download ppt "1 Compiler and Developing Tools Yao-Yuan Chuang. 2 Fortran and C Compilers  Freeware The GNU Fortran and C compilers, g77, g95, gcc, and gfortran are."

Similar presentations

Ads by Google