Optimizing single thread performance Dependence Loop transformations.

Slides:



Advertisements
Similar presentations
How SAS implements structured programming constructs
Advertisements

Computer Architecture Instruction-Level Parallel Processors
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
CSC 370 (Blum)1 Instruction-Level Parallelism. CSC 370 (Blum)2 Instruction-Level Parallelism Instruction-level Parallelism (ILP) is when a processor has.
Carnegie Mellon Lecture 7 Instruction Scheduling I. Basic Block Scheduling II.Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 – 10.4 M.
Compiler techniques for exposing ILP
Superscalar processors Review. Dependence graph S1S2 Nodes: instructions Edges: ordered relations among the instructions Any ordering-based transformation.
FTC.W99 1 Advanced Pipelining and Instruction Level Parallelism (ILP) ILP: Overlap execution of unrelated instructions gcc 17% control transfer –5 instructions.
Code optimization: –A transformation to a program to make it run faster and/or take up less space –Optimization should be safe, preserve the meaning of.
8. Code Generation. Generate executable code for a target machine that is a faithful representation of the semantics of the source code Depends not only.
Data Dependence Types and Associated Pipeline Hazards Chapter 4 — The Processor — 1 Sections 4.7.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Chapter 4 Advanced Pipelining and Intruction-Level Parallelism Computer Architecture A Quantitative Approach John L Hennessy & David A Patterson 2 nd Edition,
Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.
Control Flow Analysis (Chapter 7) Mooly Sagiv (with Contributions by Hanne Riis Nielson)
Dependency Analysis We want to “parallelize” program to make it run faster. For that, we need dependence analysis to ensure correctness.
C SINGH, JUNE 7-8, 2010IWW 2010, ISATANBUL, TURKEY Advanced Computers Architecture, UNIT 1 Advanced Computers Architecture Lecture 4 By Rohit Khokher Department.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter Hardwired vs Microprogrammed Control Multithreading
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Midterm Thursday let the slides be your guide Topics: First Exam - definitely cache,.. Hamming Code External Memory & Buses - Interrupts, DMA & Channels,
Data Dependencies A dependency type that can cause a stall.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Instruction-Level Parallelism for Low-Power Embedded Processors January 23, 2001 Presented By Anup Gangwar.
Chapter 25 Formal Methods Formal methods Specify program using math Develop program using math Prove program matches specification using.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
M. Mateen Yaqoob The University of Lahore Spring 2014.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
 In computer programming, a loop is a sequence of instruction s that is continually repeated until a certain condition is reached.  PHP Loops :  In.
Lecture 38: Compiling for Modern Architectures 03 May 02
Parallelizing Loops Moreno Marzolla
Topics to be covered Instruction Execution Characteristics
Computer Architecture Principles Dr. Mike Frank
William Stallings Computer Organization and Architecture 8th Edition
Dependence Analysis Important and difficult
CS5100 Advanced Computer Architecture Instruction-Level Parallelism
Instruction-Level Parallelism
Parallelizing Loops Moreno Marzolla
CGS 3763 Operating Systems Concepts Spring 2013
Siddhartha Chatterjee Spring 2008
Algorithms Take a look at the worksheet. What do we already know, and what will we have to learn in this term?
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
EE 4xx: Computer Architecture and Performance Programming
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Appendix C Practice Problem Set 1
How to improve (decrease) CPI
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Implementation of a De-blocking Filter and Optimization in PLX
Instruction Level Parallelism
COMPUTER ORGANIZATION AND ARCHITECTURE
Introduction to Optimization
Optimizing single thread performance
Presentation transcript:

Optimizing single thread performance Dependence Loop transformations

Optimizing single thread performance Assuming that all instructions are doing useful work, how can you make the code run faster? – Some sequence of code runs faster than other sequence Optimize for memory hierarchy Optimize for specific architecture features such as pipelining – Both optimization requires changing the execution order of the instructions. A[0][0] = 0.0; A[0][1] = 0.0; … A[1000][1000] = 0.0; A[0][0] = 0.0; A[1][0] = 0.0; … A[1000][1000] = 0.0; Both code initializes A, is one better than the other?

Changing the order of instructions without changing the semantics of the program The semantics of a program is defined by the sequential execution of the program. – Optimization should not change what the program does. Parallel execution also changes the order of instructions. – When is it safe to change the execution order (e.g. run instructions in parallel)? A=1 B=2 C=3 D=4 A=1; B=2 C=3; D=4 A=1 B=A+1 C=B+1 D=C+1 A=1,B=2, C=3, D=4 A=1; B=A+1 C=B+1;D=C+1 A=1, B=?, C=?, D=?

When is it safe to change order? – When can you change the order of two instructions without changing the semantics? They do not operate (read or write) on the same variables. They can be only read the same variables One read and one write is bad (the read will not get the right value) Two writes are also bad (the end result is different). – This is formally captured in the concept of data dependence True dependence: Write X-Read X (RAW) Output dependence: Write X – Write X (WAW) Anti dependence: Read X – Write X (WAR) What about RAR?

Data dependence examples A=1 B=2 C=3 D=4 A=1; B=2 C=3; D=4 A=1 B=A+1 C=B+1 D=C+1 A=1; B=A+1 C=B+1;D=C+1 When two instructions have no dependence, their execution order can be changed, or the two instructions can be executed in parallel

Data dependence in loops For (I=1; I<500; i++) a(I) = 0; For (I=1; I<500; i++) a(I) = a(I-1) + 1; Loop-carried dependency When there is no loop-carried dependency, the order for executing the loop body does not matter: the loop can be parallelized (executed in parallel)

Loop-carried dependence A loop-carried dependence is a dependence that is present only when the dependence is between statements in different iterations of a loop. Otherwise, we call it loop-independent dependence. Loop-carried dependence is what prevents loops from being parallelized. – Important since loops contains most parallelism in a program. Loop-carried dependence can sometimes be represented by dependence vector (or direction) that tells which iteration depends on which iteration. – When one tries to change the loop execution order, the loop carried dependence needs to be honored.

Dependence and parallelization For a set of instruction without dependence Execution in any order will produce the same results The instructions can be executed in parallel For two instructions with dependence – They must be executed in the original sequence – They cannot be executed in parallel Loops with no loop carried dependence can parallelized (iterations executed in parallel) Loops with loop carried dependence cannot be parallelized (must be executed in the original order).

Optimizing single thread performance through loop transformations 90% of execution time in 10% of the code – Mostly in loops Relatively easy to analyze Loop optimizations – Different ways to transform loops with the same semantics – Objective? Single-thread system: mostly optimizing for memory hierarchy. Multi-thread system: loop parallelization – Parallelizing compiler automatically finds the loops that can be executed in parallel.

Loop optimization: scalar replacement of array elements For (i=0; i<N; i++) for(j=0; j<N; j++) for (k=0; k<N; k++) c(I, j) = c(I, j) + a(I, k)* b(k, j); For (i=0; i<N; i++) for(j=0; j<N; j++) { ct = c(I, j) for (k=0; k<N; k++) ct = ct + a(I, k)* b(k, j); c(I, j) = ct; } Registers are almost never allocated to array elements. Why? Scalar replacement Allows registers to be allocated to the scalar, which reduces memory reference. Also known as register pipelining.

Loop normalization For (i=a; i<=b; i+= c) { …… } For (ii=1; ii<???; ii++) { i = a + (ii-1) *b; …… } Loop normalization does not do too much by itself. But it makes the iteration space much easy to manipulate, which enables other optimizations.

Loop transformations Change the shape of loop iterations – Change the access pattern Increase data reuse (locality) Reduce overheads – Valid transformations need to maintain the dependence. If (i1, i2, i3, …in) depends on (j1, j2, …, jn), then (j1’, j2’, …, jn’) needs to happen before (i1’, i2’, …, in’) in a valid transformation.

Loop transformations Unimodular transformations – Loop interchange, loop permutation, loop reversal, loop skewing, and many others Loop fusion and distribution Loop tiling Loop unrolling

Unimodular transformations A unimodular matrix is a square matrix with all integral components and with a determinant of 1 or –1. Let the unimodular matrix be U, it transforms iteration I = (i1, i2, …, in) to iteration U I. – Applicability (proven by Michael Wolf) A unimodular transformation represented by matrix U is legal when applied to a loop nest with a set of distance vector D if and only if for each d in D, Ud >= 0. – Distance vector tells the dependences in the loop.

Unimodular transformations example: loop interchange For (I=0; I<n; I++) for (j=0; j < n; j++) a(I,j) = a(I-1, j) + 1; For (j=0; j<n; j++) for (i=0; i < n; i++) a(i,j) = a(i-1, j) + 1; Why is this transformation valid? The calculation of a(i-1,j) must happen before a(I, j)

Unimodular transformations example: loop permutation For (I=0; I<n; I++) for (j=0; j < n; j++) for (k=0; k < n; k++) for (l=0; l<n; l++) ……

Unimodular transformations example: loop reversal For (I=0; I<n; I++) for (j=0; j < n; j++) a(I,j) = a(I-1, j) + 1.0; For (I=0; I<n; I++) for (j=n-1; j >=0; j--) a(I,j) = a(I-1, j) + 1.0;

Unimodular transformations example: loop skewing For (I=0; I<n; I++) for (j=0; j < n; j++) a(I) = a(I+ j) + 1.0; For (I=0; I<n; I++) for (j=I+1; j <i+n; j++) a(i) = a(j) + 1.0;

Loop fusion Takes two adjacent loops that have the same iteration space and combines the body. – Legal when there are no flow, anti- and output dependences in the fused loop. – Why Increase the loop body, reduce loop overheads Increase the chance of instruction scheduling May improve locality For (I=0; I<n; I++) a(I) = 1.0; For (j=0; j<n; j++) b(j) = 1.0 For (I=0; I<n; I++) { a(I) = 1.0; b(i) = 1.0; }

Loop distribution Takes one loop and partition it into two loops. – Legal when no dependence loop is broken. – Why Reduce memory trace Improve locality Increase the chance of instruction scheduling For (I=0; I<n; I++) a(I) = 1.0; For (j=0; j<n; j++) b(j) = a(I) For (I=0; I<n; I++) { a(I) = 1.0; b(i) = a(I); }

Loop tiling Replaceing a single loop into two loops. for(I=0; I<n; I++) …  for(I=0; I<n; I+=t) for (ii=I, ii < min(I+t,n); ii++) … T is call tile size; N-deep nest can be changed into n+1-deep to 2n-deep nest. For (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; j<n; k++) For (i=0; i<n; i+=t) for (ii=I; ii<min(i+t, n); ii++) for (j=0; j<n; j+=t) for (jj=j; jj < min(j+t, n); jj++) for (k=0; j<n; k+=t) for (kk = k; kk<min(k+t, n); kk++)

Loop tiling – When using with loop interchange, loop tiling create inner loops with smaller memory trace – great for locality. – Loop tiling is one of the most important techniques to optimize for locality Reduce the size of the working set and change the memory reference pattern. For (i=0; i<n; i+=t) for (ii=I; ii<min(i+t, n); ii++) for (j=0; j<n; j+=t) for (jj=j; jj < min(j+t, n); jj++) for (k=0; j<n; k+=t) for (kk = k; kk<min(k+t, n); kk++) For (i=0; i<n; i+=t) for (j=0; j<n; j+=t) for (k=0; k<n; k+=t) for (ii=I; ii<min(i+t, n); ii++) for (jj=j; jj < min(j+t, n); jj++) for (kk = k; kk<min(k+t, n); kk++) Inner loop with much smaller memory footprint

Loop unrolling For (I=0; I<100; I++) a(I) = 1.0; For (I=0; I<100; I+=4) { a(I) = 1.0; a(I+1) = 1.0; a(I+2) = 1.0; a(I+3) = 1.0; } Reduce control overheads. Increase chance for instruction scheduling. Large body may require more resources (register). This can be very effective!!!!

Loop optimization in action Optimizing matrix multiply: For (i=1; i<=N; i++) for (j=1; j<=N; j++) for(k=1; k<=N; k++) c(I, j) = c(I, j) + A(I, k)*B(k, j) Where should we focus on the optimization? – Innermost loop. – Memory references: c(I, j), A(I, 1..N), B(1..N, j) Spatial locality: memory reference stride = 1 is the best Temporal locality: hard to reuse cache data since the memory trace is too large.

Loop optimization in action Initial improvement: increase spatial locality in the inner loop, references to both A and B have a stride 1. – Transpose A before go into this operation (assuming column-major storage). – Demonstrate my_mm.c method 1 Transpose A /* for all I, j, A’(I, j) = A(j, i) */ For (i=1; i<=N; i++) for (j=1; j<=N; j++) for(k=1; k<=N; k++) c(I, j) = c(I, j) + A’(k, I)*B(k, j)

Loop optimization in action C(i, j) are repeatedly referenced in the inner loop: scalar replacement (method 2) Transpose A For (i=1; i<=N; i++) for (j=1; j<=N; j++) for(k=1; k<=N; k++) c(I, j) = c(I, j) + A(k, I)*B(k, j) Transpose A For (i=1; i<=N; i++) for (j=1; j<=N; j++) { t = c(I, j); for(k=1; k<=N; k++) t = t + A(k, I)*B(k, j); c(I, j) = t; }

Loop optimization in action Inner loops memory footprint is too large: – A(1..N, i), B(1..N, i) – Loop tiling + loop interchange Memory footprint in the inner loop A(1..t, i), B(1..t, i) Using blocking, one can tune the performance for the memory hierarchy: – Innermost loop fits in register; second innermost loop fits in L2 cache, … Method 4 for (j=1; j<=N; j+=t) for(k=1; k<=N; k+=t) for(I=1; i<=N; i+=t) for (ii=I; ii<=min(I+t-1, N); ii++) for (jj = j; jj<=min(j+t-1,N);jj++) { t = c(ii, jj); for(kk=k; kk <=min(k+t-1, N); kk++) t = t + A(kk, ii)*B(kk, jj) c(ii, jj) = t }

Loop optimization in action Loop unrolling (method 5) for (j=1; j<=N; j+=t) for(k=1; k<=N; k+=t) for(I=1; i<=N; i+=t) for (ii=I; ii<=min(I+t-1, N); ii++) for (jj = j; jj<=min(j+t-1,N);jj++) { t = c(ii, jj); t = t + A(kk, ii) * B(kk, jj); t = t + A(kk+1, ii) * B(kk+1, jj); …… t = t + A(kk+15, ii) * B(kk + 15, jj); c(ii, jj) = t } This assumes the loop can be nicely unrolled, you need to take care of the boundary condition.

Loop optimization in action Instruction scheduling (method 6) ‘+’ would have to wait on the results of ‘*’ in a typical processor. ‘*’ is often deeply pipelined: feed the pipeline with many ‘*’ operation. for (j=1; j<=N; j+=t) for(k=1; k<=N; k+=t) for(I=1; i<=N; i+=t) for (ii=I; ii<=min(I+t-1, N); ii++) for (jj = j; jj<=min(j+t-1,N);jj++) { t0 = A(kk, ii) * B(kk, jj); t1 = A(kk+1, ii) * B(kk+1, jj); …… t15 = A(kk+15, ii) * B(kk + 15, jj); c(ii, jj) = c(ii, jj) + t0 + t1 + … + t15; }

Loop optimization in action Further locality improve: block order storage of A, B, and C. (method 7) for (j=1; j<=N; j+=t) for(k=1; k<=N; k+=t) for(I=1; i<=N; i+=t) for (ii=I; ii<=min(I+t-1, N); ii++) for (jj = j; jj<=min(j+t-1,N);jj++) { t0 = A(kk, ii) * B(kk, jj); t1 = A(kk+1, ii) * B(kk+1, jj); …… t15 = A(kk+15, ii) * B(kk + 15, jj); c(ii, jj) = c(ii, jj) + t0 + t1 + … + t15; }

Loop optimization in action See the ATLAS paper for the complete story: C. Whaley, et. al, "Automated Empirical Optimization of Software and the ATLAS Project," Parallel Computing, 27(1-2):3-35, 2001.

Summary Dependence and parallelization What can a loop be parallelized? Loop transformations – What do they do? – When is a loop transformation valid? – Examples of loop transformations.