Introduction to Optimization

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Compiler Challenges for High Performance Architectures

Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Chapter 12 Pipelining Strategies Performance Hazards.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Register Allocation (via graph coloring)

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

High-Level Transformations for Embedded Computing

Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

An Overview of Parallel Processing

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

1 Lecture 5a: CPU architecture 101 boris.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Lecture 38: Compiling for Modern Architectures 03 May 02

Code Optimization Overview and Examples

Simone Campanoni Loop transformations Simone Campanoni

Dependence Analysis Important and difficult

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Optimizing Transformations Hal Perkins Autumn 2011

Register Pressure Guided Unroll-and-Jam

Optimizing Transformations Hal Perkins Winter 2008

Code Optimization Overview and Examples Control Flow Graph

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Instruction Level Parallelism (ILP)

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

COMPUTER ORGANIZATION AND ARCHITECTURE

Optimizing single thread performance

Presentation transcript:

Introduction to Optimization John Cavazos University of Delaware

Lecture Overview Motivation Loop Transformations

Why study compiler optimizations? Moore’s Law Chip density doubles every 18 months Reflected in CPU performance doubling every 18 months Proebsting’s Law Compilers double CPU performance every 18 years 4% improvement per year because of optimizations!

Why study compiler optimizations? Corollary 1 year of code optimization research = 1 month of hardware improvements No need for compiler research… Just wait a few months!

Free Lunch is over Moore’s Law Chip density doubles every 18 months PAST : Reflected CPU performance doubling every 18 months CURRENT: Density doubling reflected in more cores on chip! Corollary Cores will become simpler Just wait a few months… Your code might get slower! Many optimizations now being done by hand! (autotuning)

Optimizations: The Big Picture What are our goals? Simple Goal: Make execution time as small as possible Which leads to: Achieve execution of many (all, in the best case) instructions in parallel Find independent instructions

Dependences S1 PI = 3.14 We will concentrate on data dependences Simple example of data dependence: S1 PI = 3.14 S2 R = 5.0 S3 AREA = PI * R ** 2 Statement S3 cannot be moved before either S1 or S2 without compromising correct results S1 S2 S3

Dependences Formally: Data dependence from S1 to S2 (S2 depends on S1) if: 1. Both statements access same memory location and one of them stores onto it, and 2. There is a feasible execution path from S1 to S2

Load Store Classification Dependences classified in terms of load-store order: 1. True dependence (RAW hazard) 2. Antidependence (WAR hazard) 3. Output dependence (WAW hazard)

Dependence in Loops Let us look at two different loops: DO I = 1, N S1 A(I+1) = A(I)+ B(I) ENDDO DO I = 1, N S1 A(I+2) = A(I)+B(I) ENDDO In both cases, statement S1 depends on itself

Transformations We call a transformation safe if the transformed program has the same "meaning" as the original program But, what is the "meaning" of a program? For our purposes: Two programs are equivalent if, on the same inputs: They produce the same outputs in the same order

Compilers have always focused on loops Loop Transformations Compilers have always focused on loops Higher execution counts Repeated, related operations Much of real work takes place in loops *

Several effects to attack Overhead Decrease control-structure cost per iteration Locality Spatial locality  use of co-resident data Temporal locality  reuse of same data Parallelism Execute independent iterations of loop in parallel *

Eliminating Overhead Loop unrolling (the oldest trick in the book) To reduce overhead, replicate the loop body Sources of Improvement Less overhead per useful operation Longer basic blocks for local optimization do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+1) = a(i+1) + b(i+1) a(i+2) = a(i+2) + b(i+2) a(i+3) = a(i+3) + b(i+3) end do i = 1 to 100 by 1 a(i) = a(i) + b(i) end becomes (unroll by 4)

a(i) will be found in the cache Loop Fusion Two loops over same iteration space  one loop Safe if does not change the values used or defined by any statement in either loop (i.e., does not violate dependences) do i = 1 to n c(i) = a(i) + b(i) end do j = 1 to n d(j) = a(j) * e(j) becomes (fuse) d(i) = a(i) * e(i) For big arrays, a(i) may not be in the cache a(i) will be found in the cache *

Loop Fusion Advantages Enhance temporal locality Reduce control overhead Longer blocks for local optimization & scheduling Can convert inter-loop reuse to intra-loop reuse *

Loop Fusion of Parallel Loops Parallel loop fusion legal if dependences loop independent Source and target of flow dependence map to same loop iteration Each iteration can execute in parallel *

Loop distribution (fission) Single loop with independent statements  multiple loops Starts by constructing statement level dependence graph Safe to perform distribution if: No cycles in the dependence graph Statements forming cycle in dependence graph put in same loop

Loop distribution (fission) Has the following dependence graph (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) B[I] = C[I-1]*X+C (4) C[I] = 1/B[I] (5) D[I] = sqrt(C[I]) (6) endfor

Loop distribution (fission) (1) for I = 1 to N do A[I] = A[i] + B[i-1] (3) endfor (4) for B[I] = C[I-1]*X+C C[I] = 1/B[I] endfor for D[I] = sqrt(C[I]) becomes (fission) (1) for I = 1 to N do (2) A[I] = A[i] + B[i-1] (3) B[I] = C[I-1]*X+C (4) C[I] = 1/B[I] (5) D[I] = sqrt(C[I]) (6) endfor

Loop Fission Advantages Enables other transformations E.g., Vectorization Resulting loops have smaller cache footprints More reuse hits in the cache *

Loop Tiling (blocking) Want to exploit temporal locality in loop nest.

Loop Tiling (blocking)

Loop Tiling (blocking)

Loop Tiling (blocking)

Loop Tiling (blocking)

Loop Tiling Effects Reduces volume of data between reuses Works on one “tile” at a time (tile size is B by B) Choice of tile size is crucial

Scalar Replacement Allocators never keep c(i) in a register We can trick the allocator by rewriting the references The plan Locate patterns of consistent reuse Make loads and stores use temporary scalar variable Replace references with temporary’s name

Almost any register allocator can get t into a register Scalar Replacement do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end t = a(i) t = t + b(j) a(i) = t becomes (scalar replacement) Almost any register allocator can get t into a register

Scalar Replacement Effects Decreases number of loads and stores Keeps reused values in names that can be allocated to registers In essence, this exposes the reuse of a(i) to subsequent passes