Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compiler and Developing Tools

Similar presentations


Presentation on theme: "Compiler and Developing Tools"— Presentation transcript:

1 Compiler and Developing Tools
Yao-Yuan Chuang

2 Fortran and C Compilers
Freeware The GNU Fortran and C compilers, g77, g95, gcc, and gfortran are popular. The port for window is either MinGW or Cygwin. Proprietary Portland Group (http://www.pgroup.com) Intel Compiler (http://www.intel.com) Absoft Lehay Comparison (http://fortran-2000.com/ArnaudRecipes/CompilerTricks.html) Fortran under Linux (http://www/nikhef.nl/~templon/fortran.html)

3 Compiler Compilers act on Fortran or C source code (*.f or *.c) and generate assembler source file. (*.s) An assembler converts the assembler source into an object file (*.a) A linker then combines the object files (*.o) and the library files (*.a) to a executable file. Without specifying the executable filename, by default it is called a.out.

4 C vs. Fortran Fortran Introduction to Scientific Computing
(http://pro3.chem.pitt.edu/richard/chem3400/) C/C++ Computational Physics (http://www.physics.ohio-state.edu/~ntg/780/computational_physics_resources.php)

5 Code Optimization Yao-Yuan Chuang

6 Code Optimization A compiler is a program reads the source program written in high-level language and translate it into machine language. “An optimized compiler” generated “optimize” machine language code which takes less time to run, occupied less memory or both. Assembly Language generated from GNU C compiler. (http://linuxgazette.net/issue71/joshi.html)

7 Code Optimization Make your code run faster
Make your code use less memory Make your code use less disk space Optimization always comes before Parallelization

8 Optimization Strategy
Unoptimized Code Set up reference case Compiler optimization (-O1 to –O3, other options) Include Numerical Lib Profiling Identify bottleneck Optimization loop Optimization Techniques Optimized code Check result

9 CPU Insights L2/L3 cache Instruction cache Register (L1 cache)
Load/store unit Load/store unit pipeline FP unit FX unit FMA unit Vector unit Specialized unit Main Memory

10 Memory Access CPU L1 cache TLB Translation Look-aside Buffer
List of most recently accessed Memory pages TLB miss: 30 ~ 60 cycles 1 ~ 4 cycles L2/L3 cache 8 ~ 20 cycles PFT Page Frame Table List of memory pages location RAM memory pages 8000 – cycles Disk

11 Measuring Performance
Measurement Guidelines Time Measurement Profiling tools Hardware Counters

12 Measuring Guidelines Make sure you have full access to the processor
Always check the correctness of the results Use all the tools available Watch out for overhead Compare to theoretical peak performance

13 Time Measurement On Linux % time hello In Fortran 90
Call SYSTEM_CLOCK(count1,count_rate,count_max) … calculation CALL SYSTEM_CLOCK(count2,count_rate,count_max) Or using etime() function with PGI compiler

14 Components of computing time
User time The user time corresponds to the amount of time the instructions in your program taken on the CPU System time Most scientific program require to use the OS kernel to carry out certain tasks, such as I/O. While carrying out these tasks, your program is not occupying the CPU. The system time is a measure of the time your program spends waiting for kernel services. Elapsed time The elapsed time corresponds to the wall-clock time, or real-world time taken by the program.

15 Profiling Tools % gcc –pg newtest.c –g –o newtest
% gprof newtest > newtest.out % less newtest.out From profiling information, we know number of the function calls and how much time spend in each function, hence, we can improve the most ‘critical’ step within the program for optimized performance. Tells you the portion of time the program spends in each of the subroutines and/or functions Mostly useful when your program has a lot of subroutines and/or functions Use profiling at the beginning of optimization process PGI profiling tool is calld pgprof which –Mprof=func

16 Hardware Counters All modern processors has built-in event counters
Processors may have several registers reserved for counters It is possible to start, stop and reset counters Software API can be used to access counters Using Hardware Counters is a must in Optimization

17 Software API-PAPI Performance Application Programming Interface
A standardized API to access hardware counters Available on most systems: Linux, Windows NT, Solaris, … Motivation To provide solid foundation for cross platform performance analysis tools To present a set of standard definitions for performance metrics To provide a standardized API To be easy to use, well documented, and freely available Web site:

18 Optimization Techniques
Compiler Options Use Existing Libraries Numerical Instabilities FMA units Vector units Array Considerations Tips and Tricks

19 Compiler Options Substantial gain can be easily obtained by playing with compiler options Optimization options are “a must”. The first and second level of optimization will rarely give no benefits. Optimization options can range from –O1 to –O5 with some compilers. -O3 to –O5 might lead to slower code, so try them independently on each subroutine. Always check your results when trying optimization options. Compiler options might include hardware specifics such as accessing vector units.

20 Compiler Options GNU C compiler gcc
-O0 –O1 –O2 –O3 –finline-functions … PGI Workstation Compiler pgcc, pgf90, and pg77 -O0 –O1 -O2 -O3 … Intel Fortran and C compiler ifc and icc -O0 –O1 –O2 –O3 –ip –xW –tpp7 …

21 Existing Libraries Existing libraries are usually highly optimized
Try several libraries and compare if possible Recompile libraries on the platform you are running if you have source Vendors libraries are usually well optimized for their platform Popular mathematical libraries: BLAS, LAPACK, ESSL, FFTW, MKL, ACML, ATLAS, GSL … Watch out for cross language (calling Fortran in C or calling C in Fortran) usage

22 Numerical Instabilities
Specific to each problem Could lead to much longer time Could lead to wrong result Examine the mathematics of the solver Look for operations involving very large and very small numbers Be careful when using higher compiler optimization options

23 FMA units * Y X + B Y=A*X + B 00011000000010001101000011010101
In 1 cycle X + B

24 Vector units 32 bit precision x1 Op x2 (+, = -,*) x3 x4
Op (+, -,*) = 32 bit precision x1 x2 x3 x4 Op (+, -,*) = 64 bit precision x1 x2 128 bit long vector unit for P4 and Opteron 4 single precision FLOPs/cycle 2 double precision FLOPs/cycle

25 Array Considerations In Fortran do i=1,5 do j=1,5 do j = 1,5
a(i,j)= … enddo do j=1,5 do i = 1,5 a(i,j)= … enddo In C/C++ for(j=1;j<=5;j++){ for(i=1;i<=5;i++){ a[i][j]=… } for(i=1;i<=5;i++){ for(j=1;j<=5;j++){ a[i][j]=… } Corresponding memory representation Outer Inner Outer Inner

26 Tips and Tricks Sparse Arrays
Hard to optimize because needs to jumps when accessing memory Minimize the memory jumps Carefully analyze the construction of the sparse array, using pointer technique but it can be confusing Lower your expectation

27 Minimize number of Operations
During optimization, first thing needed to do is reducing the number of unnecessary operations performed by the CPU. do k=1,10 do j=1,5000 do i=1,5000 a(i,j,k)=3.0*m*d(k)+c(j)*23.1-b(i) enddo 1250 millions of operations do k=1,10 dtmp(k)=3.0*m*d(k) do j=1,5000 ctmp(j)=c(j)*23.1 do i=1,5000 a(i,j,k)=dtmp(k)+ctmp(j)-b(i) enddo 500 millions of operations

28 Complex Numbers Watch for operations on complex numbers that have imaginary or real part equals to zero. ! Real part = 0 complex *16 a(1000,1000),b complex *16 c(1000,1000) do j=1,1000 do i=1,1000 c(i,j) = a(i,j)*b enddo real *8 aI(1000,1000) complex *16 b,c(1000,1000) do j=1,1000 do i=1,1000 c(i,j) = (-IMAG(b)*aI(i,j), aI(I,j)*REAL(b)); enddo 6 millions of operations 2 millions of operations

29 Loop Overhead and Object
do j = 1, do i = 1, a(i,j,1)=b(i,j)*c(1) a(i,j,2)=b(I,j)*c(2) enddo do j = 1, do i = 1, do k = 1,2 a(i,j,k)=b(i,j)*c(k) enddo Object declarations In Object-Oriented Language AVOID objects Declarations within the most inner loops

30 Function call Overhead
do k = 1, do j = 1, do i = 1,5000 a(i,j,k)=c(i)*k-b(j) enddo do k = 1, do j = 1, do i = 1,5000 a(i,j,k)=fl(c(i),b(j),k enddo function fl(x,y,m) real*8 x,y,tmp integer m tmp=x*m-y return tmp end This can also be achieved with compilers inlining options. The compiler will then replace all function calls by a copy of the function code, sometimes leading to very large binary executable. % ifc –ip % icc –ip % gcc –finline-functions

31 Blocking Blocking is used to reduce cache and TLB misses in nested matrix operations. The idea is to process as much data brought in the cache as possible do ib = 1,n,bsize do jb = 1,n,bsize do kb = 1,n,bsize do i = ib,min(n,ib+bsize-1) do j = jb,min(n,jb+bsize-1) do k = kb,min(n,kb+bsize-1) C(I,j)=C(I,j)+ A(I,k)*B(k,j) enddo do i = 1,n do j = 1,n do k = 1,n C(I,j)=C(I,j)+A(I,k)*B(k,j) enddo

32 Loop Fusion The main advantage of loop fusion is the reduction of cache misses when the same array is used in both loops. It also reduces loop overhead and allow a better control of multiple instructions in a single cycle, when hardware allows it. do i = 1,100000 a = a + x(i) *z(i) enddo do j = 1,100000 v = 3.0*x(j) – do i = 1,100000 a = a + x(i) *z(i) v = 3.0*x(i) – enddo

33 Loop Unrolling The main advantage of loop unrolling is to reduce or eliminate data dependencies in loops. This is particularly useful when using a superscalar architecture. do i = 1,1000 a = a + x(i) * y(i) enddo do i = 1,1000,4 a = a + x(i) * y(i) + x(i+1)* y(i+1) + x(i+2)* y(i+2) + x(i+3)* y(i+3) enddo 2000 cycles 1250 cycles 2 FMAs or vector units (length of 2)

34 Sum Reduction Sum reduction is another way of reducing or eliminating data dependencies in loops. It is more explicit than the loop unroll. do i = 1,1000 a = a + x(i) * y(i) enddo do i = 1,1000,4 a1 = a1 + x(i) * y(i) + x(i+1)* y(i+1) a2 = a2 + x(i+2)* y(i+2) + x(i+3)* y(i+3) enddo a = a1 + a2 2000 cycles 751 cycles 2 FMAs or vector units (length of 2)

35 Better Performance in Math
Replace division by multiplications Contrary to floating point multiplications, additions, or subtractions, divisions are very costly in terms of clock cycles. 1 multiplication = 1 cycle, 1 division = 14 ~ 20 cycles. Repeated multiplications for exponentials Exponential is a functional call, if the exponent is small, multiplication should be done manually.

36 Portland Group Compiler
A comprehensive discussion of the Portland Group compiler optimization options is given in the PGI User’s Guide, available at Information on how to use Portland Group compiler options can be obtained on the command line with % pgf77 –fastsse –help Detailed information on the optimization and transformation (i.e. loop unrolling) carried out by the compiler is given by the –Minfo option. This is often useful when your code produces unexpected results.

37 Portland Group Compiler
Important compiler optimization options for the Portland Group compiler include: -fast includes “-O2 –Munroll –Mnoframe –Mlre” -fastsse includes “-fast –Mvec=sse –Mcache_align” -Mipa=fast enables inter-procedural analysis (IPA) and optimization -Mipa=fast,inline enables IPA-based optimization and function inlining -Mpft … -Mpfo enables profile and data feedback based optimizations -Minline inline functions and subroutines -Mconcur try to autoparallelize loops for SMP/dual core systems -mcmodel=medium enable data > 2GB on opterons running 64-bit linux A good start for your compilation needs is: -fastsse –Mipa=fast

38 Optimization Levels With the Portland Group compiler the different optimization levels correspond to: -O0 the level-zero flag specifies no optimization. The intermediate code is generated and used for the machine code. -O1 the level-one specifies local optimizations, i.e. local to a basic block -O2 the level-two specifies global optimizations. These optimizations occur over all the basic blocks, and the control-flow structure. -O3 level-three specifies an aggressive global optimization. All level-one and level-two optimizations are also carried out.

39 -Munroll option The –Munroll compiler option unrolls loops. This has the effect of reducing the number of iterations in the loop by executing multople instances of the loop statements in each iteration. For example: Loop unrolling reduces the overhead of maintaining the loop index, and permits better instruction scheduling (control of sending the instructions to the CPU). do i = 1, 100, 2 z = z + a(i) * b(i) z = z + a(i+1) * b(i+1) enddo do i = 1, 100 z = z + a(i) * b(i) enddo

40 -Mvect=sse option The Portland Group compiler can be used to vecoroze code. Vectorization transforms loops to improve memory access performance (i.e. maximize the usage of the various memory components, such as registers, cache). SSE is an acronym for Streaming SIMD Extensions, and is a set of CPU instructions, first introduced with the Intel Pentium III and AMD Athlon, which allows for the same operation acting on multiple data items concurrently. The use of this compiler option can double the execution speed of a code.

41 Intermediate Language
The intermediate language used by the compiler is a language somewhere between the high-level language used by the programmer (i.e. Fortran, C), and the assembly language used by the machine. The intermediate language is easier for the compiler to manipulate than source code. It contains not only the algorithm specified in the source code, but expressions for calculating the memory addresses (which can also be subject to optimization). The intermediate language makes it much easier for the compiler to optimize source code.

42 Intermediate Language - quadruples
Calculations in intermediate languages are simplified into quadruples. Arithmetic expressions are broken down into calculations involving only two operands and one operator. This makes sense when considering how CPU carry out a calculation. This is simplification is illustrated with the following expression which can be simplified into quadruples by using temporary variables: A = -B + C * D / E T1 = D / E T2 = C * T1 T3 = -B A = T3 + T2

43 Basic Blocks A more realistic example of intermediate language is given by using the example code This code can be broken down into three basic blocks of code. A basic block is a collection of statements used to define local variables in compiler optimization. A basic block begins with a statement that either follows a branch (e.g. an IF), or is itself the target of a branch. A basic block has only one entrance (the top), and one exit (the bottom). do while (j .lt. n) k = k + j * 2 m = j * 2 j = j + 1 enddo

44 Basic Block Flow Graph A:: t1 :=j t2 :=n t3 :=t1 .lt. t2 jump (B) t3
jump(C) TRUE B:: t4 :=k t5 :=j t6 :=t5*2 t7 :=t4+t6 k :=t7 t8 :=j t9 :=t8*2 m :=t9 t10 :=j t11 :=t10+1 j :=t11 jump(A) TRUE

45 Write Efficient C and Code Optimization
Use unsigned integer instead of Integer Combining division and remainder Division and remainder by powers of two Use switch instead of if … else … Loop unrolling Use lookup tables (http://www.codeproject.com/cpp/C___Code_Optimization.asp)


Download ppt "Compiler and Developing Tools"

Similar presentations


Ads by Google