Presentation is loading. Please wait.

Presentation is loading. Please wait.

Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed.

Similar presentations


Presentation on theme: "Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed."— Presentation transcript:

1 Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed

2 Optimizations to source code §There are several optimizations that compilers and users can make to source code which generally result in fewer assembly code instructions.

3 x=3; if (x<3) my_function();

4 Dead code elimination x=3; if (x<3) my_function();

5 x=4+3; y=x-2;

6 Constant folding and propagation Multiple constants are folded together and evaluated at compile time. x=4+3; y=x-2;

7 Common sub-expression elimination x = y + (a*d+c) z = u + (a*d+c)

8 C/C++ Register data type §If a variable is to be used many times, then they should be kept in registers and not loaded from the memory. §Hint to compilers

9 C/C++ Register data type §If a variable is to be used many times, then they should be kept in registers and no loaded from the memory. §Hint to compilers l asm (problem?)

10 Strength reductions Replace expensive operations with cheaper ones. §Replace integer multiplication and division by constants with shift operations l Most processors have integer shift functional units. l Multiplication of a number by 9?

11 Strength reductions Replace expensive operations with cheaper ones. §Replace 32-bit integer division by 64-bit floating point division

12 Strength reductions Replace expensive operations with cheaper ones. Replace floating point multiplication by small constants with floating point additions

13 Strength reductions Replace expensive operations with cheaper ones. Replace floating point divisions by one division and multiplications Division is one of the most expensive operations, while that of multiplication is negligible in comparison. a=y/x b=z/x

14 Strength reductions Replace floating point divisions by one division and multiplications Especially inside loops a(i) = b(i) / c(i) x(i) = y(i) / z(i)

15 Strength reductions Replace floating point divisions by one division and multiplications Especially inside loops. a(i) = b(i)/c(i) x(i) = y(i)/z(i) temp = 1.0/(c(i)*z(i)) a(i) = b(i) * z(i) * temp x(i) = y(i) * c(i) * temp

16 Strength reductions Replace expensive operations with cheaper ones. Replace power functions by floating point multiplications Power calculations can take 50 times longer than multiplication

17 Fused multiply add (fma) Many processors have compound floating point multiply add instructions that are more efficient than individual multiply, add instructions fma: (a*b +c) fnma: -(a*b) +c

18 Fused multiply add (fma) a, b, c are complex numbers c=c+a*b (cr, ci) + (ar, ai) * (br, bi) = (cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br))

19 Fused multiply add (fma) (cr, ci) + (ar, ai) * (br, bi) = (cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br)) Multiply f1=ar*br Multiply f2=ar*bi fnma f3=-ai*bi + f1 fma f4=ai*br + f2 Add f5=cr+f3 Add f6=ci+f4

20 Fused multiply add (fma) Alter the order of instructions to: (cr, ci) + (ar, ai) * (br, bi) = ((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br))

21 Fused multiply add (fma) Alter the order of instructions to: (cr, ci) + (ar, ai) * (br, bi) = ((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br)) fma f1=ar*br + cr fma f2=ar*bi + ci fnmaf3=-ai*bi + f1 fma f4=ai*br + f2

22 Loop optimizations §Account of most of the runtime for computational programs §Source code modifications can lead to significant improvements in runtime §Magnification factor

23 Single loop optimization for (i=0; i<n; i+=2) a[i] = i*k + m; §Multiple of induction variable added to a constant. §Replace multiplication with addition.

24 Single loop optimization for (i=0; i<n; i+=2) a[i] = i*k + m; §Multiple of induction variable added to a constant. §Replace multiplication with addition. counter=m; for (i=0; i<n; i+=2){ a[i] = counter; counter = counter + k + k; }

25 Condition statements Branches in code reduce performance especially on pipelined systems. do I = 1,N if (A > 0) then x[I] = x[I]+1 else x[I] = 0.0 endif enddo

26 Condition statements Branches are a significant overhead on pipelined systems. Branch mis-prediction penalty is high. §Pipelined get stalled §In-flight instructions may have to be discarded §Try to move if statements out of loop body l Optimization has to be applied on case-by-case basis

27 Test promotion in loops do I = 1,N if (A > 0) then x[I] = x[I]+1 else x[I] = 0.0 endif enddo if (A > 0) then do I = 1,N x[I] = x[I]+1 enddo else do I = 1,N x[I] = 0.0 enddo endif Most compilers don’t do this optimization.

28 Condition statements do i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) = a2d(j,i) + b2d(j,i)*con1 else a2d(j,i) = 1.0 endif enddo

29 Condition statements do i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) = a2d(j,i) + b2d(j,i)*con1 else a2d(j,i) = 1.0 endif enddo do i=1,len2 do j=1,i-1 a2d(j,i) = a2d(j,i) + b2d(j,i)*con1 enddo do j=i,len2 a2d(j,i) = 1.0 enddo

30 Loop Peeling Boundary Conditions do I = 1,N if (I == 1) then x[I] = 0 elseif (I ==N )then x[I} = N else x[I] = x[I]+y[I] enddo

31 Loop Peeling do I = 1,N if (I == 1) then x[I] = 0 elseif (I ==N )then x[I} = N else x[I] = x[I]+y[I] enddo x[1]=0 do I = 2,N-1 x[I] = x[I]+y[I] enddo x[N]=N Most compilers don’t do this optimization.

32 Loop Peeling do i=1,n y(i,n) = (1.0- x(i,1))*y(1,n)+x(i,1)* y(n,n) enddo

33 Loop Peeling do i=1,n y(i,n) = (1.0- x(i,1))*y(1,n)+x(i,1)* y(n,n) enddo t2 = y(n,n) y(1,n) = (1.0- x(1,1))*y(1,n)+x(1,1)*t2 t1 = y(1,n) do i=2,n-1 y(i,n) = (1.0-x(i,1))*t1 + x(i,1)*t2 enddo y(n,n) = (1.0-x(n,1))*t1 + x(n,1)*t2 Complex for compilers to perform.

34 Loop Unrolling do i=1,n A(i)=B(i) enddo §Loop index incremented and checked at the beginning of each iteration §Branches interfere with pipelining §Register blocking: l Temporary variables that are used repeatedly.

35 Loop Unrolling do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo Unrolled by 4. Some compilers allow users to specify unrolling depth. Avoid excessive unrolling: Register pressure / spills can hurt performance Pipelining to hide instruction latencies Reduces overhead of index increment and conditional check Assumption n is divisible by 4

36 Loop Unrolling do j=1,m do i=1,n do k=1,p c(i,j)= c(i,j) +a(k,i)*b(k,j) enddo

37 Loop Unrolling do j=1,m do i=1,n do k=1,p c(i,j)= c(i,j) +a(k,i)*b(k,j) enddo do j=1,m,2 do i=1,n,2 t1=c(i,j) t2=c(i+1,j) t3=c(i,j+1) t4=c(i+1,j+1) do k=1,p t1=t1+a(k,i)*b(k,j) t2=t2+a(k,i+1)*b(k,j) t3=t3+a(k,i)*b(k,j+1) t4=t4+a(k,i+1)*b(k,j+1) enddo c(i,j) = t1 c(i+1,j)=t2 c(i,j+1)=t3 c(i+1,j+1)=t4 enddo

38 Loop Interchange §To improve spatial locality. §Align access pattern with the order in which data is storage in memory. §2-D arrays in Fortran are stored column- wise. §2-D arrays in C are stored row-wise.

39 Loop Fusion §Beneficial in loop-intensive programs. §Decreases index calculation overhead. §Can also help in instruction level parallelism. §Beneficial if same data structures are used in different loops.

40 Loop Fusion for (i=0;i<nodes;i++) { a[i] = a[i]*small; c[i] = (a[i] + b[i])*relaxn; } for (i=1;i<nodes-1;i++) { d[i]=c[i]-a[i]; }

41 Loop Fusion for (i=0;i<nodes;i++) { a[i] = a[i]*small; c[i] = (a[i] + b[i])*relaxn; } for (i=1;i<nodes-1;i++) { d[i]=c[i]-a[i]; } a[0] = a[0]*small; c[0]= (a[0]+b[0])*relaxn; a[nodes-1] = a[nodes- 1]*small; c[nodes-1] = c[nodes- 1]*relaxn; for (i=1;i<nodes-1;i++) { a[i] = a[i]*small; c[i] = (a[i] + b[i])*relaxn; d[i] = c[i] - a[i]; }


Download ppt "Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed."

Similar presentations


Ads by Google