Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization.

Code Tuning and Optimization Doug Sondak sondak@bu.edu Boston University Scientific Computing and Visualization

Outline  Introduction  Example code  Timing  Profiling  Cache  Tuning Information Services & Technology 2 10/6/2015

Introduction  Timing  Where is most time being used?  Tuning  How to speed it up  Often as much art as science  Parallel Performance  How to assess how well parallelization is working Information Services & Technology 3 10/6/2015

Example Code Information Services & Technology 4 10/6/2015

Example Code  Simulation of response of eye to stimuli  Response is affected by adjacent inputs  A dark area next to a bright area makes the bright area look brighter  Based on Grossberg & Todorovic paper  Appendix in paper contains all equations  errors in eqns (A4) and (A5) – cross out “log2”  Paper contains 6 levels of response  Our code only contains levels 1 through 5  Level 6 takes a long time to compute, and would skew our timings! Information Services & Technology 5 10/6/2015

Example Code (cont’d)  All calculations done on a square array  Array size and other constants are defined in gt.h (C) or in the “mods” module at the top of the code (Fortran)  Due to nature of algorithm, array is padded on all sides  npad is size of padding Information Services & Technology 6 10/6/2015

Example Code – Level 1  Luminance (input) distribution  Paper (and code) use “yin-yang square”  Array I  magnitude of “bright” is ihigh  magnitude of “dark” is ilow Information Services & Technology 7 10/6/2015 bright dark Fig. 4 in paper

Example Code – Level 2  Level 2 – Circular Concentric On and Off Units  Excitation and inhibition vary with distance Information Services & Technology 8 10/6/2015 Fig. 5 in paper

Level 2 Equations Information Services & Technology 9 10/6/2015 I pq =initial input (yin-yang)

Example Code – Level 3  Oriented Direction-of-Contrast-Sensitive Units  Respond to angle  12 discrete angles  Respond to direction of contrast, i.e., light-to-dark or dark-to-light Information Services & Technology 10 10/6/2015 Fig. 6(d) in paper

Level 3 Equations Information Services & Technology 11 10/6/2015

Example Code - Level 4  Oriented Direction-of-Contrast-Insensitive Units  Respond to angle  Do not respond to direction of contrast, i.e., light-to-dark or dark-to-light Information Services & Technology 12 10/6/2015 Fig. 8(a) in paper

Level 4 Equations Information Services & Technology 13 10/6/2015

Example Code – Level 5  Level 5 – Boundary Contour Units  Pool nearby excitations Information Services & Technology 14 10/6/2015 Fig. 8(d) in paper

Level 5 Equation Information Services & Technology 15 10/6/2015

Timing Information Services & Technology 16 10/6/2015

Timing  When tuning/parallelizing a code, need to assess effectiveness of your efforts  Can time whole code and/or specific sections  Some types of timers  unix time command  function/subroutine calls  profiler Information Services & Technology 17 10/6/2015

CPU Time or Wall-Clock Time?  CPU time  How much time the CPU is actually crunching away  User CPU time  Time spent executing your source code  System CPU time  Time spent in system calls such as i/o  Wall-clock time  What you would measure with a stopwatch Information Services & Technology 18 10/6/2015

CPU Time or Wall-Clock Time? (cont’d)  Both are useful  For serial runs without interaction from keyboard, CPU and wall-clock times are usually close  If you prompt for keyboard input, wall-clock time will accumulate if you get a cup of coffee, but CPU time will not Information Services & Technology 19 10/6/2015

CPU Time or Wall-Clock Time? (3)  Parallel runs  Want wall-clock time, since CPU time will be about the same or even increase as number of procs. is increased  Wall-clock time may not be accurate if sharing processors  Wall-clock timings should always be performed in batch mode Information Services & Technology 20 10/6/2015

Unix Time Command  easiest way to time code  simply type time before your run command  output differs between c-type shells (cshell, tcshell) and Bourne-type shells (bsh, bash, ksh) Information Services & Technology 21 10/6/2015

Unix Time Command (cont’d) twister:~ % time mycode 1.570u 0.010s 0:01.77 89.2% 75+1450k 0+0io 64pf+0w Information Services & Technology 22 10/6/2015 user CPU time (s) system CPU time (s) wall-clock time (s) (u+s)/wc avg. shared + unshared text space input + output operations page faults + no. times proc. was swapped

Unix Time Command (3)  Bourne shell results Information Services & Technology 23 10/6/2015 $ time mycode Real 1.62 User 1.57 System 0.03 wall-clock time (s) user CPU time (s) system CPU time (s)

Exercise 1  Copy files from /scratch/sondak/gt cp /scratch/sondak/gt/*.  Choose C (gt.c) or Fortran (gt.f90)  Compile with no optimization: pgcc –O0 –o gt gt.cc pgf90 –O0 –o gt gt.f90 Submit rungt script to batch queue qsub rungt Information Services & Technology 24 10/6/2015 capital oh small ohzero

Exercise 1 (cont’d)  Check status qstat –u username  After run has completed a file will appear named rungt.o??????, where ?????? represents the process number  File contains result of time command  Write down wall-clock time  Re-compile using –O3  Re-run and check time Information Services & Technology 25 10/6/2015

Function/Subroutine Calls  often need to time part of code  timers can be inserted in source code  language-dependent Information Services & Technology 26 10/6/2015

cpu_time  intrinsic subroutine in Fortran  returns user CPU time (in seconds)  no system time is included  0.01 sec. resolution on p-series Information Services & Technology 27 10/6/2015 real :: t1, t2 call cpu_time(t1)... do stuff to be timed... call cpu_time(t2) print*, 'CPU time = ', t2-t1, ' sec.'

system_clock  intrinsic subroutine in Fortran  good for measuring wall-clock time  on p-series:  resolution is 0.01 sec.  max. time is 24 hr. Information Services & Technology 28 10/6/2015

system_clock (cont’d)  t1 and t2 are tic counts  count_rate is optional argument containing tics/sec. Information Services & Technology 29 10/6/2015 integer :: t1, t2, count_rate call system_clock(t1, count_rate)... do stuff to be timed... call system_clock(t2) print*,'wall-clock time = ', & real(t2-t1)/real(count_rate), ‘sec’

times  can be called from C to obtain CPU time  0.01 sec. resolution on p-series  can also get system time with tms_stime Information Services & Technology 30 10/6/2015 #include void main(){ int tics_per_sec; float tic1, tic2; struct tms timedat; tics_per_sec = sysconf(_SC_CLK_TCK); times(&timedat); tic1 = timedat.tms_utime; … do stuff to be timed … times(&timedat); tic2 = timedat.tms_utime; printf("CPU time = %5.2f\n", (float)(tic2-tic1)/(float)tics_per_sec); }

gettimeofday  can be called from C to obtain wall-clock time   sec resolution on p-series Information Services & Technology 31 10/6/2015 #include void main(){ struct timeval t; double t1, t2; gettimeofday(&t, NULL); t1 = t.tv_sec + 1.0e-6*t.tv_usec; … do stuff to be timed … gettimeofday(&t, NULL); t2 = t.tv_sec + 1.0e-6*t.tv_usec; printf(“wall-clock time = %5.3f\n", t2-t1); }

MPI_Wtime  convenient wall-clock timer for MPI codes   sec resolution on p-series Information Services & Technology 32 10/6/2015

MPI_Wtime (cont’d)  Fortran  C Information Services & Technology 33 10/6/2015 double precision t1, t2 t1 = mpi_wtime()... do stuff to be timed... t2 = mpi_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = MPI_Wtime();... do stuff to be timed... t2 = MPI_Wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

omp_get_time  convenient wall-clock timer for OpenMP codes  resolution available by calling omp_get_wtick()  0.01 sec. resolution on p-series Information Services & Technology 34 10/6/2015

omp_get_wtime (cont’d)  Fortran  C Information Services & Technology 35 10/6/2015 double precision t1, t2, omp_get_wtime t1 = omp_get_wtime()... do stuff to be timed... t2 = omp_get_wtime() print*,'wall-clock time = ', t2-t1 double t1, t2; t1 = omp_get_wtime();... do stuff to be timed... t2 = omp_get_wtime(); printf(“wall-clock time = %5.3f\n”,t2-t1);

Timer Summary Information Services & Technology 36 10/6/2015 CPUWall Fortrancpu_timesystem_clock Ctimesgettimeofday MPIMPI_Wtime OpenMPomp_get_time

Exercise 2  Put wall-clock timer around each “level” in the example code  Print time for each level  Compile and run Information Services & Technology 37 10/6/2015

PROFILING Information Services & Technology 38 10/6/2015

Profilers  profile tells you how much time is spent in each routine  gives a level of granularity not available with previous timers  e.g., function may be called from many places  various profilers available, e.g.  gprof (GNU)  pgprof (Portland Group)  Xprofiler (AIX) Information Services & Technology 39 10/6/2015

gprof  compile with -pg  file gmon.out will be created when you run  gprof executable > myprof  for multiple procs. (MPI), copy or link gmon.out.n to gmon.out, then run gprof Information Services & Technology 40 10/6/2015

gprof (cont’d) Information Services & Technology 41 10/6/2015 ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds % cumulative self self total time seconds seconds calls ms/call ms/call name 20.5 89.17 89.17 10 8917.00 10918.00.conduct [5] 7.6 122.34 33.17 323 102.69 102.69.getxyz [8] 7.5 154.77 32.43.__mcount [9] 7.2 186.16 31.39 189880 0.17 0.17.btri [10] 7.2 217.33 31.17.kickpipes [12] 5.1 239.58 22.25 309895200 0.00 0.00.rmnmod [16] 2.3 249.67 10.09 269 37.51 37.51.getq [24]

gprof (3) Information Services & Technology 42 10/6/2015 ngranularity: Each sample hit covers 4 bytes. Time: 435.04 seconds called/total parents index %time self descendents called+self name index called/total children 0.00 340.50 1/1.__start [2] [1] 78.3 0.00 340.50 1.main [1] 2.12 319.50 10/10.contrl [3] 0.04 7.30 10/10.force [34] 0.00 5.27 1/1.initia [40] 0.56 3.43 1/1.plot3da [49] 0.00 1.27 1/1.data [73]

pgprof  compile with Portland Group compiler  pgf90 (pgf95, etc.)  pgcc  –Mprof=func  similar to –pg  run code  pgprof –exe executable  pops up window with flat profile Information Services & Technology 43 10/6/2015

pgprof (cont’d) Information Services & Technology 44 10/6/2015

pgprof (3)  To save profile data to a file:  re-run pgprof using –text flag  at command prompt type p > filename  filename is the name you want to give the profile file  type quit to get out of profiler Information Services & Technology 45 10/6/2015

Exercise 3  Use pgprof to profile code  compile using –Mprof=func  run code  create profile using pgprof –exe gt  Note which routines use most time  Please close pgprof when you’re through  Leaving window open ties up a license Information Services & Technology 46 10/6/2015

Line-Level Profiling  Times individual lines  For pgprof, compile with the flag –Mprof=line  Optimizer will re-order lines  profiler will lump lines in some loops or other constructs  may want to compile without optimization, may not  In flat profile, double-click on function to get line-level data Information Services & Technology 47 10/6/2015

Line-Level Profiling (cont’d) Information Services & Technology 48 10/6/2015

Exercise 4  Compile code with –Mprof=line and –O0 and run  will take about 5 minutes to run due to overhead from line- level profiling and lack of optimization  Examine line-level profile for most time-consuming routine  Note lines with longest time consumption  Save your profile data to a file (we will need it later)  re-run pgprof using –text flag  at command prompt type p > prof Information Services & Technology 49 10/6/2015

CACHE Information Services & Technology 50 10/6/2015

Cache  Cache is a small chunk of fast memory between the main memory and the registers Information Services & Technology 51 10/6/2015 secondary cache registers primary cache main memory

Cache (cont’d)  If variables are used repeatedly, code will run faster since cache memory is much faster than main memory  Variables are moved from main memory to cache in lines  L1 cache line sizes on our machines  Opteron (katana cluster) 64 bytes  Xeon (katana cluster) 64 bytes  Power4 (p-series) 128 bytes  PPC440 (Blue Gene) 32 bytes  Pentium III (linux cluster) 32 bytes Information Services & Technology 52 10/6/2015

Cache (3)  Why not just make the main memory out of the same stuff as cache?  Expensive  Runs hot  This was actually done in Cray computers  Liquid cooling system Information Services & Technology 53 10/6/2015

Cache (4)  Cache hit  Required variable is in cache  Cache miss  Required variable not in cache  If cache is full, something else must be thrown out (sent back to main memory) to make room  Want to minimize number of cache misses Information Services & Technology 54 10/6/2015

Cache (5) Information Services & Technology 55 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] Main memory “mini” cache holds 2 lines, 4 words each for(i=0; i<10; i++) x[i] = i; a b …

Cache (6) Information Services & Technology 56 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] will ignore i for simplicity need x[0], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits for(i=0; i<10; i++) x[i] = i; a b … x[0 ] x[1] x[2 ] x[3]

Cache (7) Information Services & Technology 57 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] need x[4], not in cache cache miss load line from memory into cache next 3 loop indices result in cache hits for(i=0; i<10; i++) x[i] = i; a b … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7]

Cache (8) Information Services & Technology 58 10/6/2015 … x[0 ] x[1] x[2 ] x[3] x[4] x[5] x[6] x[7] x[8 ] x[9 ] need x[8], not in cache cache miss load line from memory into cache no room in cache! replace old line for(i=0; i<10; i++) x[i] = i; a b … x[4] x[5] x[6] x[7] x[8 ] x[9 ] a b

Cache (9)  Contiguous access is important  In C, multidimensional array is stored in memory as a[0][0] a[0][1] a[0][2] Information Services & Technology 59 10/6/2015 …

Cache (10)  In Fortran and Matlab, multidimensional array is stored the opposite way: a(1,1) a(2,1) a(3,1) Information Services & Technology 60 10/6/2015 …

Cache (11)  Rule: Always order your loops appropriately  will usually be taken care of by optimizer  suggestion: don’t rely on optimizer Information Services & Technology 61 10/6/2015 for(i=0; i<N; i++){ for(j=0; j<N; j++){ a[i][j] = 1.0; } do j = 1, n do i = 1, n a(i,j) = 1.0 enddo CFortran

TUNING TIPS Information Services & Technology 62 10/6/2015

Tuning Tips  Some of these tips will be taken care of by compiler optimization  It’s best to do them yourself, since compilers vary  Two important rules  minimize number of operations  access cache contiguously Information Services & Technology 63 10/6/2015

Tuning Tips (cont’d)  Access arrays in contiguous order  For multi-dimensional arrays, rightmost index varies fastest for C and C++, leftmost for Fortran and Matlab Bad Good Information Services & Technology 64 10/6/2015 for(i=0; i<N; i++){ for(j=0; j<N; j++{ a[i][j] = 1.0; } for(j=0; j<N; j++){ for(i=0; i<N; i++{ a[i][j] = 1.0; }

Tuning Tips (3)  Eliminate redundant operations in loops Bad: Good: Information Services & Technology 65 10/6/2015 for(i=0; i<N; i++){ x = 10; } … x = 10; for(i=0; i<N; i++){ } …

Tuning Tips (4)  Minimize if statements within loops  They may inhibit pipelining Information Services & Technology 66 10/6/2015 for(i=0; i<N; i++){ if(i==0) perform i=0 calculations else perform i>0 calculations }

Tuning Tips (5)  Better Way: Information Services & Technology 67 10/6/2015 perform i=0 calculations for(i=1; i<N; i++){ perform i>0 calculations }

Tuning Tips (6)  Divides are expensive  Intel x86 clock cycles per operation  add3-6  multiply4-8  divide 32-45  Bad:  Good: Information Services & Technology 68 10/6/2015 for(i=0; i<N; i++) x[i] = y[i]/scalarval; qs = 1.0/scalarval; for(i=0; i<N; i++) x[i] = y[i]*qs ;

Tuning Tips (7) There is overhead associated with a function call Bad: Good: Information Services & Technology 69 10/6/2015 for(i=0; i<N; i++) myfunc(i); myfunc ( ); void myfunc( ){ for(int i=0; i<N; i++){ do stuff }

Tuning Tips (8) Minimize calls to math functions Bad: Good: Information Services & Technology 70 10/6/2015 for(i=0; i<N; i++) z[i] = log(x[i]) * log(y[i]); for(i=0; i<N; i++){ z[i] = log(x[i] + y[i]);

Tuning Tips (9) recasting may be costlier than you think Bad: Good: Information Services & Technology 71 10/6/2015 sum = 0.0; for(i=0; i<N; i++) sum += (float) i isum = 0; for(i=0; i<N; i++) isum += i; sum = (float) isum

Exercise 5  The example code that has been provided is written in a clear, readable style, that also happens to violate lots of the tuning tips that we have just reviewed.  Examine the line-level profile. What lines are using the most time? Is there anything we might be able to do to make it run faster?  We will discuss options as a group  come up with a strategy  modify code  re-compile and run  compare timings  Re-examine line level profile, come up with another strategy, repeat procedure, etc. Information Services & Technology 72 10/6/2015

Survey  Please fill out the survey for this tutorial at http://scv.bu.edu/survey/tutorial_evaluation.html Information Services & Technology 73 10/6/2015

Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization.

Similar presentations

Presentation on theme: "Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization.

Similar presentations

Presentation on theme: "Code Tuning and Optimization Doug Sondak Boston University Scientific Computing and Visualization."— Presentation transcript:

Similar presentations

About project

Feedback