Register Pressure Guided Unroll-and-Jam

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Computer Organization and Architecture

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

CSCI 4717/5717 Computer Architecture

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Programmability Issues

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Parallell Processing Systems1 Chapter 4 Vector Processors.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Register Allocation (via graph coloring)

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Pipelining – Loop unrolling and Multiple Issue

Lecture 38: Compiling for Modern Architectures 03 May 02

Code Optimization Overview and Examples

Concepts and Challenges

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Chapter 14 Instruction Level Parallelism and Superscalar Processors

CSL718 : VLIW - Software Driven ILP

Instruction Level Parallelism and Superscalar Processors

CSCI1600: Embedded and Real Time Software

Optimizing Transformations Hal Perkins Autumn 2011

Loop Scheduling and Software Pipelining

Compiler techniques for exposing ILP (cont)

Optimizing Transformations Hal Perkins Winter 2008

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Predicting Unroll Factors Using Supervised Classification

Chapter 12 Pipelining and RISC

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Dynamic Hardware Prediction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

How to improve (decrease) CPI

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Code Transformation for TLB Power Reduction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Loop-Level Parallelism

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

CSC D70: Compiler Optimization Prefetching

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

Register Pressure Guided Unroll-and-Jam Author: Yin Ma Steven Carr

Motivation In a processor, register sits at the fastest position in the memory hierarchy, but the number of physical registers is very limited. Unroll-and-jam in the loop model of Open64 not only increases register pressure by itself but also creates opportunities to make other loop optimizations increase register pressure indirectly. If a transformed loop demands too many registers, the overall performance may degrade Given a loop nest, with a better register pressure prediction and an unroll factor, the degradation can be eliminated and a better overall performance can be achieved

Research Topic A register pressure prediction algorithm for unroll-and-jam A register pressure guided loop model for unroll- and-jam

Background Data Dependence Analysis The data dependence graph (DDG) is a directed graph that represents the data dependence relationship among instructions. A true dependence exists when L1 stores into a memory location that is read by L2 later. An anti-dependence exists if L1 is a read from a memory location that is written by L2 later. An output dependence exists when L1 and L2 store into the same memory location. An input dependence exists if a memory location is read by L1 and L2. True Dependence S1 L1=……. S2 …….=L2 Anti-Dependence S1 …….=L1 S2 L2=……. Output Dependence Input Dependence

Background Scalar Replacement Uses scalars, later allocated to registers to replace array references in order to decrease the number of memory references in loops This directly increases register pressure for ( i = 2; i < n; i++ ) a[i] = a[i-1] + b[i]; Scalar Replaced: T = a[1]; for ( i = 2; i < n; i++){ T = T + b[i]; a[i] = T; }

Background Unroll-and-Jam Create larger loop bodies by flattening multiple iterations Larger loop bodies makes other optimizations create more register pressure Unroll-and-jammed and later scalar replaced for ( I = 1 ; I < 10 ; I = I+2 ){ for ( J = 1; J < 5 ; J ++ ){ b = B[J]; c = C[J] A[I][J] = b + c; D[I][J] = E[I][J] + F[I][J]; A[I+1][J] = b + c; D[I+1][J] = E[I+1][J] + F[I+1][J]; } /* register pressure increased because } b, c hold two registers that originally can be reused for E and F */ for ( I = 1 ; I < 10 ; I ++ ){ for ( J = 1; J < 5 ; J ++ ){ A[I][J] = B[J] + C[J]; D[I][J] = E[I][J] + F[I][J]; } 

Background Software Pipelining Software pipelining is an advanced scheduling techniques. Usually, more-overlapped instructions demand additional registers The Initiation interval (II) of a loop is the number of cycles used to finish one iteration. The resource II (ResII) gives the minimum number of cycles needed to execute the loop based upon machine resources such as the number of functional units. The recurrence II (RecII) gives the minimum number of cycles needed for a single iteration based upon the length of the cycles in the data dependence graph. Do N times [Prelude] D1 B1 D2 [Loop Body] Do N-2 times (with index i)‏ Ai Ci Bi+1 Di+2 [Postlude] AN-1 CN-1 BN AN CN Software pipelined due to dependences among the operations

Typical approaches of preventing degradation from register pressure Predictive approach <- Our approaches Predict effects before applying optimizations and decide the best set of parameters to do optimizations Fastest speed and fit for all situations Iterative approach (like feedback based)‏ Apply optimizations with one set of parameters then redo for the better performance with adjusted parameters Genetic approach Prepare many sets of parameters and apply optimizations with each set. Use genetic programming to pick the best

Problem in Previous Work All predictive register prediction methods are designed for software pipelining. Do not support source-code-level loop optimizations at all No systemic research on how to predict register pressure for loop optimizations No register pressure guided loop model

Key Design Detail Prediction algorithms works on source-code level Prediction algorithms handle the effects on register pressure from: unroll-and-jam scalar replacement software pipelining general scalar optimizations Register pressure guided loop model uses the predicted register information to pick an unroll vector for the best performance

Register Prediction for unroll-and-jam (Overview)‏ Compute RecII with our heuristic method Create the list of arrays that will be replaced by scalars by checking the original DDG Constructing the new DDG D1 with the list above only for the original loop All copies will reuse the DDG D1 as the base DDGs Adjust each copy of DDGs to reflect the future changes. Re-compute the ResII to get MinII Do pseudo schedule to get the register pressure

Construct the base DDG Travel through the innermost loop and construct the base DDG DO J = 1, N DO I = 1, N U(I,J) = V(I) + P(J,I) ENDDO ENDDO

Prepare the DDG after unroll-and-jam Duplicate the base DDG with the inputted unroll factors DO J = 1, N DO I = 1, N U(I,J) = V(I) + P(J,I) U(I,J+1) = V(I) + P(J+1,I) ENDDO ENDDO Unroll vector is 2

Finalize the DDG Remove unnecessary nodes/edges and add new edges Based on the updated dependence Reflect the effect of further optimizations Consider array indexing reuse by analyzing array subscripts

Register Prediction Schedule the final DDG with a depth-first scan starting from the first node of the first iteration copy The RecII is the RecII of the original innermost loop The ResII is computed on the final DDG with the targeted architecture information MinII = MAX( RecII, ResII)‏

Register Pressure Guided Unroll-and-Jam Use unitII as the performance indicator of an unroll-and- jammed loop R is the number of registers predicted P is the number of registers available D is the total outgoing degree in the final DDG E is the total number of cross iteration edges A is the average memory access penalty N is the number of nodes in the final DDG

Open64 Implementation & Experiment Results For register prediction, a retargetable compiler with infinite number of available physical registers is used Loop nests are extracted from SPEC2000 For register pressure guided unroll-and-jam, our model directly replaces the unroll-and-jam analysis used by Open64 backend An minor value computed with the information from Open64's cache model is added to UnitII For register prediction for unroll-and-jam, it predicts the floating-point register pressure of a loop within 3 registers and integer register pressure within 4 registers Also our register pressure guided unroll-and-jam improves the overall performance about 2% over the model in the Open64 backend on both x86 and x86-64 architectures on Polyhedron benchmark

The End Any Question?