Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Compiler Challenges for High Performance Architectures
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
Data Locality CS 524 – High-Performance Computing.
 Data copy forms part of an auto-tuning compiler framework.  Auto-tuning compiler, while using the library, can empirically evaluate the different implementations.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
November 29, 2005Christopher Tuttle1 Linear Scan Register Allocation Massimiliano Poletto (MIT) and Vivek Sarkar (IBM Watson)
Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Parallelizing Compilers Presented by Yiwei Zhang.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Systems I Locality and Caching
Linear Scan Register Allocation POLETTO ET AL. PRESENTED BY MUHAMMAD HUZAIFA (MOST) SLIDES BORROWED FROM CHRISTOPHER TUTTLE 1.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Makoto Kudoh*1, Hisayasu Kuroda*1,
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Cg Programming Mapping Computational Concepts to GPUs.
Automated Design of Custom Architecture Tulika Mitra
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
High-Level Transformations for Embedded Computing
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Communication Optimizations in Titanium Programs Jimmy Su.
Points-to Analysis as a System of Linear Equations Rupesh Nasre. Computer Science and Automation Indian Institute of Science Advisor: Prof. R. Govindarajan.
Memory-Aware Compilation Philip Sweany 10/20/2011.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Automatic Thread Extraction with Decoupled Software Pipelining
Auburn University
Ioannis E. Venetis Department of Computer Engineering and Informatics
The Hardware/Software Interface CSE351 Winter 2013
5.2 Eleven Advanced Optimizations of Cache Performance
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Memory Hierarchies.
Performance Optimization for Embedded Software
A Practical Stride Prefetching Implementation in Global Optimizer
Register Pressure Guided Unroll-and-Jam
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
TensorFlow: A System for Large-Scale Machine Learning
Optimizing single thread performance
Presentation transcript:

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio

Improving Memory Performance processor Registers Cache Memory processor Registers Cache Memory Shared memory Network connection Locality optimizations Parallelization Communication optimizations

Compiler Optimizations For Locality Computation optimizations  Loop blocking, fusion, unroll-and-jam, interchange, unrolling Rearrange computations for better spatial and temporal locality Data-layout optimizations  Static layout transformation A single layout for entire application No additional overhead, tradeoff between different layout choices  Dynamic layout transformation A different layout for each computation phase Flexible but could be expensive Combining computation and data transformations  Static layout transformation Transform layout first, then computation  Dynamic layout transformation Transform computation first, then dynamic re-arrange layout

Array Copying --- Related Work Dynamic layout transformation for arrays  Copy arrays into local buffers before computation  Copy modified local buffers back to array Previous work  Lam, Rothberg and Wolf, Temam, Granston and Jalby Copy arrays after loop blocking Assume arrays are accessed through affine expressions  Array copy always safe  Static data layout transformations A single layout throughout the application --- always safe  Optimizing irregular applications Data access patterns not known until runtime Dynamic layout transformation --- through libraries  Scalar Replacement Equivalent to copying single array element into scalars Carr and Kennedy: applied to inner loops Question: how expensive is array copy? How difficult is it?

What is new Array copy for arbitrary loop computations  Stand-alone optimization independent of blocing  Works for computations with non-affine array access General array copy algorithm  Work on arbitrarily shaped loops  Automatic insert copy operations to ensure safety  Heuristics to reduce buffer size and copy cost  Performs scalar replacement when specialized Applications  Improve cache and register locality When combined with blocking and when without blocking  Future work Communication and parallelization Empirical tuning of optimizations  Interface that allows different application heuristics

Apply Array Copy: Matrix Multiplication Step1: build dependence graph  True, output, anti and input deps between array accesses  Is each dependence precise? for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[i+j*m] A[i+k*m]B[k+j*l]

Apply Array Copy: Matrix Multiplication Step2-3: Separate array references connected by imprecise deps  Impose an order on all array references C[i+j*m] (R) -> A[i+k*m] -> B[k+j*l] -> C[i+j*m] (W)  Remove all back edges  Apply typed fusion for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[i+j*m] A[i+k*m]B[k+j*l]

Array Copy with Imprecise Dependences Array references connected by imprecise deps  Cannot precisely determine a mapping between subscripts  Sometimes may refer to the same location, sometimes not  Not safe to copy into a single buffer  Apply typed-fusion algorithm to separate them for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) { C[f(i,j,m)] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[f(i,j,m)] C[i+j*m] A[i+k*m]B[k+j*l] Imprecise

Profitability of Applying Array Copy Each buffer should be copied at most twice (splitting groups) Determine outermost location to perform copy  No interference with other groups Enforce size limit on the buffer  constant size => scalar replacement Ensure reuse of copied elements  Lowering position to copy if no extra reuse is gained for (j=0; j<n; ++j) for (k=0; k<l; ++k) for (i=0; i<m; ++i) C[i+j*m] = C[i+j*m] + alpha * A[i+k*m]*B[k+j*l]; C[i+j*m] A[i+k*m]B[k+j*l] location to copy A j i k k k k location to copy C location to copy B

Array Copy Result: Matrix Multiplication Dimensionality of buffer enforced by command-line options Can be applied to arbitrarily shaped loop structures Can be applied independently or after blocking Copy A[0:m,0:l] to A_buf; for (j=0; j<n; ++j) { Copy C[0:m, j*m] to C_buf; for (k=0; k<l; ++k) { Copy B[k+j*l] to B_buf; for (i=0; i<m; ++i) C_buf[i] = C_buf[i] + alpha * A_buf[i+k*m]*B_buf; } Copy C_buf[0:m] back to C[0:m,j*m]; }

Experiments Implementation  Loop transformation framework by Yi, Kennedy and Adve  ROSE compiler infrastructure at LLNL (Quinlan et al) Benchmarks  dgemm (matrix multiplication, LAPACK)  dgetrf (LU factorization with partial pivoting, LAPACK)  tomcatv (mesh generation with Thompson solver, SPEC95) Machines  A Dell PC with two 2.2GHz Intel XEON processors 512KB cache on each processor and 2GB memory  A SGI workstation with a 195 MHz R10000 processor 32KB 2-way L1, 1MB 4-way L2, and 256MB memory;  A single 8-way P655+ node on a IBM terascale machine 32KB 2-way L1 (32 KB) cache, 0.75MB 4-way L2, 16MB 8-way L3 16GB memory

DGEMM on Dell PC

DGEMM on SGI Workstation

DGEMM on IBM

Summary When should we apply scalar replacement?  Profitable unless register pressure too high 3-12% improvement observed  No overhead When should we apply array copy?  When regular conflict misses occur  When prefetching of array elements is need 10-40% improvement observed  Overhead is 0.5-8% when not beneficial Optimizations not beneficial on the IBM node  Both blocking and 2-dim copying not profitable  Integer operation overhead too high  Will be further investigated