Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

ICS’02 UPC An Interleaved Cache Clustered VLIW Processor E. Gibert, J. Sánchez * and A. González * Dept. d’Arquitectura de Computadors Universitat Politècnica.

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Compiler techniques for exposing ILP

1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

University of Michigan Electrical Engineering and Computer Science Compiler-directed Synthesis of Programmable Loop Accelerators Kevin Fan, Hyunchul Park,

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1-XII-98Micro-311 Widening Resources: A Cost-effective Technique for Aggressive ILP Architectures David López, Josep Llosa Mateo Valero and Eduard Ayguadé.

1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Hardware-Software Interface Machine Program Performance = t cyc x CPI x code size X Available resources statically fixed Designed to support wide variety.

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

CPSC614 Lec 6.1 Exploiting Instruction-Level Parallelism with Software Approach #1 E. J. Kim.

Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

Generic Software Pipelining at the Assembly Level Markus Pister

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.

Exploiting Vector Parallelism in Software Pipelined Loops Sam Larsen Rodric Rabbah Saman Amarasinghe Computer Science and Artificial Intelligence Laboratory.

University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Memory-Aware Compilation Philip Sweany 10/20/2011.

L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.

A Unified Modulo Scheduling and Register Allocation Technique for Clustered Processors Josep M. Codina, Jesús Sánchez and Antonio González Dept. of Computer.

Dependence Analysis and Loops CS 3220 Spring 2016.

Code Optimization Overview and Examples

CSL718 : VLIW - Software Driven ILP

Michael Chu, Kevin Fan, Scott Mahlke

Optimizing Transformations Hal Perkins Autumn 2011

Register Pressure Guided Unroll-and-Jam

Compiler techniques for exposing ILP (cont)

Optimizing Transformations Hal Perkins Winter 2008

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Dynamic Hardware Prediction

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Optimizing single thread performance

Research: Past, Present and Future

Presentation transcript:

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments)

Optimizing Loop Performance for Clustered VLIW Architectures Clustered VLIW Architecture

Optimizing Loop Performance for Clustered VLIW Architectures Motivation Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low. The compiler must Expose maximal parallelism, Maintain minimal communication overhead. High-level optimizations can improve loop performance on clustered VLIW machines.

Optimizing Loop Performance for Clustered VLIW Architectures Background Software Pipelining – modulo scheduling Archive ILP by overlapping execution of different loop iterations. Initiation Interval (II) ResII -- constraints from the machine resources. RecII -- constraints from the dependence recurrences. MinII = max(ResII, RecII)

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Scalar Replacement replace array references with scalar variables. improve register usage for (i=0; i<n; ++i) for ( j=0; j<n; ++j) a[i] = a[i] + b[j] * x[i][j]; for (i=0; i<n; ++i) { t = a[i]; for ( j=0; j<n; ++j) t = t + b[j] * x[i][j]; a[i] = t; }

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Unrolling reduce inter-iteration overhead enlarge loop body size Unroll-and-jam balance the computation and memory-access requirements improve uMinII (MinII / unrollAmount) for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j]; uMinII = 4 for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; } uMinII = 3 (1 computational unit, 1 memory unit) unroll-and-jammed loop: original loop:

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Unroll-and-jam/unrolling generate intercluster parallelism for (i=0; i<2*n; ++i) a[i] = a[i] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; } for (i=0; i<2*n; ++i) a[i] = a[i-1] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Loop Alignment Remove loop-carried dependences Alignment conflicts Used to determine intercluster communication cost for (i=1; i<n; ++i) { a[i] = b[i] + c[i]; x[i] = a[i-1] *q; } x[1] = a[0] * q; for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i]; x[i+1] = a[i] * q; } a[n-1] = b[n-1] + c[n-1]; for (i=1; i<n; ++i) { a[i] = b[i] + q; c[i] = a[i-1] + a[i-2]; } for (i=1; i<n; ++i) a[i] = a[i-1] + b[i];

Optimizing Loop Performance for Clustered VLIW Architectures Related Work Partitioning Problem Ellis -- BUG Capitanio et al. -- LC-VLIW Nystrom et al. -- cluster assignment & software pipelining Ozer et al. -- UAS Sanchez et al. -- unified method Hiser et al. – RCG Aleta et al. – pseudo-scheduler

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Scalar Replacement Callahan, et al -- pipelined architectures Carr, Kennedy -- general algorithm Duesterwalk -- data flow framework Loop Alignment Allen et al -- shared-memory machines Unrolling/Ujam Callahan et al -- pipelined architectures Carr,Kennedy -- ILP Carr, Guan -- linear algebra Carr -- cache, software pipelining Sarkar -- ILP, IC Sanchez et al -- clustered machines Huang et al -- clustered machines Shin et al – Superwood Register files

Optimizing Loop Performance for Clustered VLIW Architectures Optimization Strategy Unroll-and-jam/Unrolling Scalar Replacement Intermediate Code Generator Data-flow Optimization Value Cloning Register Partitioning Software Pipelining Assembly Code Generator Target Code Source Code

Optimizing Loop Performance for Clustered VLIW Architectures Our Method Picking loops to unroll Computing uMinII Computing register pressure (see paper) Determining unroll amounts

Optimizing Loop Performance for Clustered VLIW Architectures Picking Loops to Unroll : carries the most dep. that are amenable to S.R. : contains the fewest alignment conflicts. Computing uMinII uRecII does not increase uResII where

Optimizing Loop Performance for Clustered VLIW Architectures Computing Communication Cost for Unrolled Loops Intercluster Copies multiple loops (see paper) single loop invariant dep. variant dep. innermost loop is unrolled innermost loop is not unrolled invariant dep. variant dep.

Optimizing Loop Performance for Clustered VLIW Architectures Unrolling a Single Loop Variant Dep. Cluster 1... Cluster? Before unrolling... After unrolling = # of e where copies per cluster: sinks of the new dependences: total costs:

Optimizing Loop Performance for Clustered VLIW Architectures Unrolling a Single Loop Variant Dep. Special Cases if, then for (i=0; i<4*n; i+=4) { a[i] = a[i-4]; a[i+1] = a[i–3]; a[i+2] = a[i–2]; a[i+3] = a[i-1]; } if, then for (i=0; i<6*n; i+=6) { a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i]; a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3]; } 4 clusters: 2 clusters:

Optimizing Loop Performance for Clustered VLIW Architectures Unrolling a Single Loop Invariant Dep. references can be eliminated by scalar replacement. clusters need a copy operation. for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i]; for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t; a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; }

Optimizing Loop Performance for Clustered VLIW Architectures Determining Unroll Amounts Integer optimization problem Exhaustive search Heuristic method

Optimizing Loop Performance for Clustered VLIW Architectures Experimental Results Benchmarks 119 DSP loops from the TI's benchmark suite DSP applications: FIR filter, correlation, Reed- Solomon decoding, lattice filter, LMS filter, etc. Architectures URM, a simulated architecture 8 functional units - 2 clusters, 4 clusters (1 copy unit) 16 functional units - 2 clusters, 4 clusters (2 copy units) TMS320C64x

Optimizing Loop Performance for Clustered VLIW Architectures Unroll-and-jam/unrolling is applicable to 71 loops. URM Speedups: Transformed vs. Original width816 clusters2424 Speedup Harmonic Median Improved

Optimizing Loop Performance for Clustered VLIW Architectures Our Algorithm vs. Fixed Unroll Amounts Using a fixed unroll amount may cause performance degradation when communication costs are dominant. Width816 Clusters2424 Speedup Harmonic Harmonic(fixed) # of loops94921

Optimizing Loop Performance for Clustered VLIW Architectures C64x Results TMS320C64x Speedups: Unrolled vs. Original Speedup Harmonic1.7 Median2 Improved55

Optimizing Loop Performance for Clustered VLIW Architectures Accuracy of Communication Cost Model Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops 2-cluster: 66 exact prediction 4-cluster: 64 exact prediction

Optimizing Loop Performance for Clustered VLIW Architectures Conclusion Proposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops. 70%-90% of 71 loops can be improved by a speedup of High-level transformations should be an integral part of compilation for clustered VLIW machines.