Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed.

Slides:



Advertisements
Similar presentations
CSC 4181 Compiler Construction Code Generation & Optimization.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Optimizing compilers Managing Cache Bercovici Sivan.
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Static Single Assignment CS 540. Spring Efficient Representations for Reachability Efficiency is measured in terms of the size of the representation.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Supercomputing in Plain English Stupid Compiler Tricks PRESENTERNAME PRESENTERTITLE PRESENTERDEPARTMENT PRESENTERINSTITUTION DAY MONTH DATE YEAR Your Logo.
Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
1 CS 201 Compiler Construction Lecture 5 Code Optimizations: Copy Propagation & Elimination.
Data Locality CS 524 – High-Performance Computing.
Instruction Level Parallelism (ILP) Colin Stevens.
9. Optimization Marcus Denker. 2 © Marcus Denker Optimization Roadmap  Introduction  Optimizations in the Back-end  The Optimizer  SSA Optimizations.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Optimization Compiler Optimization – optimizes the speed of compilation Execution Optimization – optimizes the speed of execution.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G)
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Hardware-Software Interface Machine Program Performance = t cyc x CPI x code size X Available resources statically fixed Designed to support wide variety.
Data Locality CS 524 – High-Performance Computing.
Compiler Construction A Compulsory Module for Students in Computer Science Department Faculty of IT / Al – Al Bayt University Second Semester 2008/2009.
CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
High-Level Transformations for Embedded Computing
Parallel Programming & Cluster Computing Stupid Compiler Tricks Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education.
Supercomputing and Science An Introduction to High Performance Computing Part IV: Dependency Analysis and Stupid Compiler Tricks Henry Neeman, Director.
Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=
Supercomputing in Plain English Stupid Compiler Tricks Henry Neeman, Director OU Supercomputing Center for Education & Research Blue Waters Undergraduate.
Optimization of C Code The C for Speed
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Supercomputing in Plain English An Introduction to High Performance Computing Part IV:Stupid Compiler Tricks Henry Neeman, Director OU Supercomputing Center.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization Overview and Examples
Code Optimization.
Simone Campanoni Loop transformations Simone Campanoni
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Optimization Code Optimization ©SoftMoore Consulting.
Parallel Programming & Cluster Computing Stupid Compiler Tricks
Code Generation Part III
Register Pressure Guided Unroll-and-Jam
Compiler techniques for exposing ILP (cont)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) HW3 posted, due in a week.
Compiler Code Optimizations
Code Optimization Overview and Examples Control Flow Graph
Code Generation Part III
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Dynamic Hardware Prediction
How to improve (decrease) CPI
Loop-Level Parallelism
Optimization.
Introduction to Optimization
CMPE 152: Compiler Design April 30 Class Meeting
Code Optimization.
Presentation transcript:

Friday, September 15, 2006 The three most important factors in selling optimization are location, location, location. - Realtor’s creed

Optimizations to source code §There are several optimizations that compilers and users can make to source code which generally result in fewer assembly code instructions.

x=3; if (x<3) my_function();

Dead code elimination x=3; if (x<3) my_function();

x=4+3; y=x-2;

Constant folding and propagation Multiple constants are folded together and evaluated at compile time. x=4+3; y=x-2;

Common sub-expression elimination x = y + (a*d+c) z = u + (a*d+c)

C/C++ Register data type §If a variable is to be used many times, then they should be kept in registers and not loaded from the memory. §Hint to compilers

C/C++ Register data type §If a variable is to be used many times, then they should be kept in registers and no loaded from the memory. §Hint to compilers l asm (problem?)

Strength reductions Replace expensive operations with cheaper ones. §Replace integer multiplication and division by constants with shift operations l Most processors have integer shift functional units. l Multiplication of a number by 9?

Strength reductions Replace expensive operations with cheaper ones. §Replace 32-bit integer division by 64-bit floating point division

Strength reductions Replace expensive operations with cheaper ones. Replace floating point multiplication by small constants with floating point additions

Strength reductions Replace expensive operations with cheaper ones. Replace floating point divisions by one division and multiplications Division is one of the most expensive operations, while that of multiplication is negligible in comparison. a=y/x b=z/x

Strength reductions Replace floating point divisions by one division and multiplications Especially inside loops a(i) = b(i) / c(i) x(i) = y(i) / z(i)

Strength reductions Replace floating point divisions by one division and multiplications Especially inside loops. a(i) = b(i)/c(i) x(i) = y(i)/z(i) temp = 1.0/(c(i)*z(i)) a(i) = b(i) * z(i) * temp x(i) = y(i) * c(i) * temp

Strength reductions Replace expensive operations with cheaper ones. Replace power functions by floating point multiplications Power calculations can take 50 times longer than multiplication

Fused multiply add (fma) Many processors have compound floating point multiply add instructions that are more efficient than individual multiply, add instructions fma: (a*b +c) fnma: -(a*b) +c

Fused multiply add (fma) a, b, c are complex numbers c=c+a*b (cr, ci) + (ar, ai) * (br, bi) = (cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br))

Fused multiply add (fma) (cr, ci) + (ar, ai) * (br, bi) = (cr, ci)+((ar*br – ai*bi), (ar*bi + ai*br)) Multiply f1=ar*br Multiply f2=ar*bi fnma f3=-ai*bi + f1 fma f4=ai*br + f2 Add f5=cr+f3 Add f6=ci+f4

Fused multiply add (fma) Alter the order of instructions to: (cr, ci) + (ar, ai) * (br, bi) = ((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br))

Fused multiply add (fma) Alter the order of instructions to: (cr, ci) + (ar, ai) * (br, bi) = ((cr+ar*br) – (ai*bi), (ci+ar*bi) + (ai*br)) fma f1=ar*br + cr fma f2=ar*bi + ci fnmaf3=-ai*bi + f1 fma f4=ai*br + f2

Loop optimizations §Account of most of the runtime for computational programs §Source code modifications can lead to significant improvements in runtime §Magnification factor

Single loop optimization for (i=0; i<n; i+=2) a[i] = i*k + m; §Multiple of induction variable added to a constant. §Replace multiplication with addition.

Single loop optimization for (i=0; i<n; i+=2) a[i] = i*k + m; §Multiple of induction variable added to a constant. §Replace multiplication with addition. counter=m; for (i=0; i<n; i+=2){ a[i] = counter; counter = counter + k + k; }

Condition statements Branches in code reduce performance especially on pipelined systems. do I = 1,N if (A > 0) then x[I] = x[I]+1 else x[I] = 0.0 endif enddo

Condition statements Branches are a significant overhead on pipelined systems. Branch mis-prediction penalty is high. §Pipelined get stalled §In-flight instructions may have to be discarded §Try to move if statements out of loop body l Optimization has to be applied on case-by-case basis

Test promotion in loops do I = 1,N if (A > 0) then x[I] = x[I]+1 else x[I] = 0.0 endif enddo if (A > 0) then do I = 1,N x[I] = x[I]+1 enddo else do I = 1,N x[I] = 0.0 enddo endif Most compilers don’t do this optimization.

Condition statements do i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) = a2d(j,i) + b2d(j,i)*con1 else a2d(j,i) = 1.0 endif enddo

Condition statements do i=1,len2 do j=1,len2 if(j < i) then a2d(j,i) = a2d(j,i) + b2d(j,i)*con1 else a2d(j,i) = 1.0 endif enddo do i=1,len2 do j=1,i-1 a2d(j,i) = a2d(j,i) + b2d(j,i)*con1 enddo do j=i,len2 a2d(j,i) = 1.0 enddo

Loop Peeling Boundary Conditions do I = 1,N if (I == 1) then x[I] = 0 elseif (I ==N )then x[I} = N else x[I] = x[I]+y[I] enddo

Loop Peeling do I = 1,N if (I == 1) then x[I] = 0 elseif (I ==N )then x[I} = N else x[I] = x[I]+y[I] enddo x[1]=0 do I = 2,N-1 x[I] = x[I]+y[I] enddo x[N]=N Most compilers don’t do this optimization.

Loop Peeling do i=1,n y(i,n) = (1.0- x(i,1))*y(1,n)+x(i,1)* y(n,n) enddo

Loop Peeling do i=1,n y(i,n) = (1.0- x(i,1))*y(1,n)+x(i,1)* y(n,n) enddo t2 = y(n,n) y(1,n) = (1.0- x(1,1))*y(1,n)+x(1,1)*t2 t1 = y(1,n) do i=2,n-1 y(i,n) = (1.0-x(i,1))*t1 + x(i,1)*t2 enddo y(n,n) = (1.0-x(n,1))*t1 + x(n,1)*t2 Complex for compilers to perform.

Loop Unrolling do i=1,n A(i)=B(i) enddo §Loop index incremented and checked at the beginning of each iteration §Branches interfere with pipelining §Register blocking: l Temporary variables that are used repeatedly.

Loop Unrolling do i=1,n A(i)=B(i) enddo do i=1,n,4 A(i)=B(i) A(i+1)=B(i+1) A(i+2)=B(i+2) A(i+3)=B(i+3) enddo Unrolled by 4. Some compilers allow users to specify unrolling depth. Avoid excessive unrolling: Register pressure / spills can hurt performance Pipelining to hide instruction latencies Reduces overhead of index increment and conditional check Assumption n is divisible by 4

Loop Unrolling do j=1,m do i=1,n do k=1,p c(i,j)= c(i,j) +a(k,i)*b(k,j) enddo

Loop Unrolling do j=1,m do i=1,n do k=1,p c(i,j)= c(i,j) +a(k,i)*b(k,j) enddo do j=1,m,2 do i=1,n,2 t1=c(i,j) t2=c(i+1,j) t3=c(i,j+1) t4=c(i+1,j+1) do k=1,p t1=t1+a(k,i)*b(k,j) t2=t2+a(k,i+1)*b(k,j) t3=t3+a(k,i)*b(k,j+1) t4=t4+a(k,i+1)*b(k,j+1) enddo c(i,j) = t1 c(i+1,j)=t2 c(i,j+1)=t3 c(i+1,j+1)=t4 enddo

Loop Interchange §To improve spatial locality. §Align access pattern with the order in which data is storage in memory. §2-D arrays in Fortran are stored column- wise. §2-D arrays in C are stored row-wise.

Loop Fusion §Beneficial in loop-intensive programs. §Decreases index calculation overhead. §Can also help in instruction level parallelism. §Beneficial if same data structures are used in different loops.

Loop Fusion for (i=0;i<nodes;i++) { a[i] = a[i]*small; c[i] = (a[i] + b[i])*relaxn; } for (i=1;i<nodes-1;i++) { d[i]=c[i]-a[i]; }

Loop Fusion for (i=0;i<nodes;i++) { a[i] = a[i]*small; c[i] = (a[i] + b[i])*relaxn; } for (i=1;i<nodes-1;i++) { d[i]=c[i]-a[i]; } a[0] = a[0]*small; c[0]= (a[0]+b[0])*relaxn; a[nodes-1] = a[nodes- 1]*small; c[nodes-1] = c[nodes- 1]*relaxn; for (i=1;i<nodes-1;i++) { a[i] = a[i]*small; c[i] = (a[i] + b[i])*relaxn; d[i] = c[i] - a[i]; }