Presentation on theme: "The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Basic Intel software requirements for."— Presentation transcript:
The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Basic Intel software requirements for Bureau applications specialists for development and optimisation on SUN HPC platform Ilia Bermous Senior ITO, CAWCR 22 January 2010
Stages in Optimisation Process Compilation and building an executable Execution Optimisation or code restructuring for better performance
Compilation and Building Executable (1) Are any static analysis tools similar in flavour to Cray cflint available? User should be able easily to identify what and how Optimised Vectorised Parallelised by the compiler an example can be transformation and formatted listing provided with NEC SX compilers and cross-compilers
Example subroutine vec_par(a,b,c,n,m) real, dimension (n,n,n) :: a,b,c !cdir concur do i=1,n do k=1,m do j=1,n a(k,j,i)=b(k,j,i) + c(k,j,i) end do end parallelisation directive
Transformation Listing. do i = 1, n. if (n.gt. 0) then. J1 = and(n,3). do j = 1, J1. !CDIR NODEP. do k = 1, m. a(k,j,i) = b(k,j,i) + c(k,j,i). end do. do j = 1, n/4. !CDIR NODEP. do k = 1, m. a(k,(j-1)*4+J1+1,i) = b(k,(j-1)*4+J1+1,i) + c(k,(j-1)*. 1 4+J1+1,i). a(k,(j-1)*4+2+J1,i) = b(k,(j-1)*4+2+J1,i) + c(k,(j-1)* J1,i). a(k,(j-1)*4+3+J1,i) = b(k,(j-1)*4+3+J1,i) + c(k,(j-1)* J1,i). a(k,j*4+J1,i) = b(k,j*4+J1,i) + c(k,j*4+J1,i). end do. endif. end do loop interchange loop unrolling
Formatted Listing... 4: P------> do i=1,n 5: |X-----> do k=1,m 6: ||+----> do j=1,n 7: ||| a(k,j,i)=b(k,j,i) + c(k,j,i) 8: ||+---- end do 9: |X----- end do 10: P end do... loops 5-9 are interchanged and vectorised loop 4-10 is parallelised
Compilation and Building Executable (2) Requirement for more robust Fortran/C compilers we have been affected by compiler bugs: when we started to use Fortran compiler for our applications immediately an optimisation bug was detected, also a number of other compiler problems have been reported to Intel Current Intel compiler versions still have too many bugs According to a recent report (*), 169 bugs were fixed in the latest 11th version of Fortran compiler. From my point of view, some of them are very dangerous. (*) Release notes for each compiler revision should include a section stating what has actually been implemented for this particular revision
Execution At the end of execution important performance characteristics should be readily available to the user to be able to identify whether the application has run efficiently or not On NEC SX for any Fortran application the following characteristics are printed out with an environment term setting MFLOPS Vector Operation Ratio & Average Vector Length Instruction/Operand Cache miss time Bank Conflict time without any impact on the application performance NEC ftrace tool provides similar performance characteristics for any program unit compiled with a special “-ftrace” option with code directives, this info can be obtained for any code sections, starting and ending anywhere in any program unit
Program Information Output Global Data of 3 processes : Min [U,R] Max [U,R] Average ========================== Real Time (sec) : [0,1] [0,2] User Time (sec) : [0,0] [0,2] System Time (sec) : [0,1] [0,2] Vector Time (sec) : [0,1] [0,0] Instruction Count : [0,1] [0,0] Vector Instruction Count : [0,1] [0,2] Vector Element Count : [0,1] [0,2] FLOP Count : [0,0] [0,2] MOPS : [0,1] [0,0] MFLOPS : [0,2] [0,0] Average Vector Length : [0,0] [0,2] Vector Operation Ratio (%) : [0,0] [0,2] Memory size used (MB) : [0,0] [0,1] MIPS : [0,1] [0,0] Instruction Cache miss (sec): [0,1] [0,2] Operand Cache miss (sec): [0,0] [0,2] Bank Conflict Time (sec): [0,0] [0,2] Max. Concurrent Processes : 8 [0,0] 8 [0,0] 8 MOPS (concurrent) : [0,0] [0,2] MFLOPS (concurrent) : [0,0] [0,1] MIPS (concurrent) : [0,2] [0,0] Event Busy Count : 0 [0,0] 0 [0,0] 0 Event Wait (sec) : [0,0] [0,0] Lock Busy Count : [0,2] [0,0] Lock Wait (sec) : [0,0] [0,1] Barrier Busy Count : 0 [0,0] 0 [0,0] 0 Barrier Wait (sec) : [0,0] [0,0] 0.000
I/O Information ****** File Information ****** Unit No. : 20 File Name : BX Named : YES Current Directory : /bm/flush3/iliab/gasp/test/ I/O Exec. Count : READ WRITE OPEN CLOSE INQUIRE FIND DEFINE FILE 0 0 Format : UNFORMATTED Blank : ---- Access : DIRECT Recl (Byte) : Max Record No. : 3911 File Size (Byte) : File Descriptor : 3 File System Type : NFS Open Mode : READWRITE Terminal Assignment : NO I/O Buffer Size (KByte,F_SETBUF) : 1024 Total(In/Out) Input Output Total Data Size (Byte) : , , 0 Max Data Size (Byte) : 45056, 0 Min Data Size (Byte) : 35640, 0 Ave Data Size (Byte) : 45003, 45003, 0 Transfer Rate (KByte/sec) : , , Total(In/Out/Aux) Input Output RTP-call Count : 535, 534, 0 System-call Count (read/write) Exec. Count : 68, 0 Ave Data Size (Byte) : , 0 Real Time (sec) : , , User Time (sec) : , , F_INPUT Option : NO F_OUTPUT Option : NO F_NORCW Option : NO F_PARTRCW Option : NO F_EXPRCW Option : NO F_UFMTFLOAT1 Option : NO F_UFMTFLOAT2 Option : NO F_UFMTIEEE Option : NO F_UFMTENDIAN Option : NO F_UFMTADJUST Option : NO F_HSDIR Option : NO F_VSPACING Option : NO F_PROMOTE Option : NO
Optimisation (1) Need to know What are the primary performance characteristics for an application performance improvement? How should these primary performance characteristics be measured? How should these primary performance characteristics be addressed?
Optimisation (2) Optimisation manuals and documentation: Manuals need to include description of technique illustrated by simple examples. Significant improvement is required in the existing manuals, for example “Intel(R) Fortran Compiler Optimizing Applications Document Number: US” The document "Consistency of Floating-Point Results using the Intel Compiler" was very useful for understanding on how to get reproducible results, but it content should be included in the main manuals Are there Intel websites available where further related information can be found? Manuals and Release notes should be in one place with good indexing and searching.
Summary We need to have a user friendly software environment at each stage during performance tuning procedure