A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Slides:



Advertisements
Similar presentations
College of Information Technology & Design
Advertisements

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Chapter 4 Systems of Linear Equations; Matrices
Chapter 4 Systems of Linear Equations; Matrices
Fast Algorithms For Hierarchical Range Histogram Constructions
Dragan Jovicic Harvinder Singh
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Numerical Algorithms Matrix multiplication
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
Complexity Analysis (Part I)
Stanford University CS243 Winter 2006 Wei Li 1 Data Dependences and Parallelization.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Data Dependences CS 524 – High-Performance Computing.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
CALCULUS – II Matrix Multiplication by Dr. Eman Saad & Dr. Shorouk Ossama.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
CS32310 MATRICES 1. Vector vs Matrix transformation formulae Geometric reasoning allowed us to derive vector expressions for the various transformation.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Digital Image Processing, 3rd ed. © 1992–2008 R. C. Gonzalez & R. E. Woods Gonzalez & Woods Matrices and Vectors Objective.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
WEEK 8 SYSTEMS OF EQUATIONS DETERMINANTS AND CRAMER’S RULE.
Analysis of Algorithms
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Data Structure Introduction.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Word : Let F be a field then the expression of the form a 1, a 2, …, a n where a i  F  i is called a word of length n over the field F. We denote the.
Matrices Section 2.6. Section Summary Definition of a Matrix Matrix Arithmetic Transposes and Powers of Arithmetic Zero-One matrices.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
The parity bits of linear block codes are linear combination of the message. Therefore, we can represent the encoder by a linear system described by matrices.
OR Chapter 7. The Revised Simplex Method  Recall Theorem 3.1, same basis  same dictionary Entire dictionary can be constructed as long as we.
CompSci 102 Discrete Math for Computer Science February 7, 2012 Prof. Rodger Slides modified from Rosen.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Searching Topics Sequential Search Binary Search.
Linear Systems Dinesh A.
Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!
Arrays Department of Computer Science. C provides a derived data type known as ARRAYS that is used when large amounts of data has to be processed. “ an.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Dependence Analysis and Loops CS 3220 Spring 2016.
A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.
Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,
Matrix 2015/11/18 Hongfei Yan zip(*a) is matrix transposition
Data Locality Analysis and Optimization
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Matrices and Vectors Review Objective
Memory Hierarchies.
Polyhedron Here, we derive a representation of polyhedron and see the properties of the generators. We also see how to identify the generators. The results.
Register Pressure Guided Unroll-and-Jam
Polyhedron Here, we derive a representation of polyhedron and see the properties of the generators. We also see how to identify the generators. The results.
Algorithm Discovery and Design
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors.
I.4 Polyhedral Theory (NW)
Chapter 4 Systems of Linear Equations; Matrices
I.4 Polyhedral Theory.
Memory System Performance Chapter 3
Optimizing single thread performance
Presentation transcript:

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam

Outline Introduction The Problem Loop Transformation Theory –Iteration Space –Matrix Form of Loop Transformations The Localized Vector Space –Tiling –Reuse and Locality The SRP Algorithm

Introduction As processor speed continues to increase faster than memory speed, optimizations to use the memory hierarchy efficiently become ever more important. Tiling is a well known technique that improves the data locality of numerical algorithms.

Let’s consider the example of matrix multiplication: for I 1 := 1 to n for I 2 := 1 to n for I 3 := 1 to n C[I 1,I 3 ] += A[I 1,I 2 ] * B[I 2,I 3 ] for II 2 := 1 to n by s for II 3 := 1 to n by s for I 1 := 1 to n for I 2 := II 2 to min(II 2 + s - 1,n) for I 3 := II 3 to min(II 3 + s - 1,n) C[I 1,I 3 ] += A[I 1,I 2 ] * B[I 2,I 3 ] Introduction (cont.) can be reused

The Problem Matrix multiplication is a particularly simple example because it is both legal and advantageous to tile the entire nest. In general, it is not always possible to tile the entire loop nest. Let’s consider the example of an abstraction of hyperbolic PDE code: for I 1 := 0 to 5 for I 2 := 0 to 6 A[I 2 + 1] := 1/3 * (A[I 2 ] + A[I 2 + 1] + A[I 2 + 2])

Loop Transformation Theory In some cases when direct tiling is not applicable, we must use the loop transformations such as interchange, skewing and reversal. And for this we must construct some theory.

Iteration Space In our model, a loop nest of depth n corresponds to a finite convex polyhedron of iteration space Z n, bounded by the loop bounds. Each iteration in the loop corresponds to a node in the polyhedron, and is identified by its index vector p i is the loop index of the i loop in the nest, counting from outermost to innermost.

A dependence vector in an n-nested loop is denoted by a vector Each component d i is possibly infinite range of integers, represented by, where and. A single dependence vector therefore represents a set of distance vectors, called its distance vector set: and. Iteration Space (cont.)

for I 1 := 0 to 5 for I 2 := 0 to 6 A[I 2 + 1] := 1/3 * (A[I 2 ] + A[I 2 + 1] + A[I 2 + 2]) Iteration Space (cont.) I1I1 I2I2 So, the dependencies for our last example are D = {(0, 1), (1, 0), (1, –1)}

Matrix Form of Loop Transformations With dependencies represented as vectors in the iteration space, loop transformations such as interchange, skewing and reversal, can be represented as matrix transformations. For example, matrix form of loop interchange transformation that maps iteration (p 1, p 2 ) to iteration (p 2, p 1 ) is

Matrix Form of Loop Transformations (cont.) Since a matrix transformation T is a linear transformation on the iteration space,. Therefore, if is a distance vector in the original iteration space, then is a distance vector in the transformed iteration space.

Interchange (Permutation) A permutation  on a loop nest transforms iteration (p 1, …, p n ) to. This transformation can be expressed in matrix form as I , the n  n identity matrix I with rows permuted by . The loop interchange above is an n = 2 example of the general permutation transformation.

Reversal Reversal of ith loop is represented by the identity matrix, but with the ith diagonal element equal to –1 rather than 1. For example, loop reversal of the outermost loop of a two-deep loop nest is represented as

Skewing Skewing loop I j by an integer factor f with respect to loop I i maps iteration to And here the example of skewing of the innermost loop of a two-deep loop nest is

The Localized Vector Space It is important to distinguish between reuse and locality. We say that a data item is reused if the same data is used in multiple iterations in a loop nest. Thus reuse is a measure that is inherent in the computation and not dependent on the particular way the loops are written. This reuse may not lead to saving a memory access if intervening iterations flush the data out of the cache between uses of data.

The Localized Vector Space (cont.) Let’s consider the following example: for I 1 := 0 to n for I 2 := 0 to n f(A[I 1 ],A[I 2 ]) Here, reference A[I 2 ] touches different data within the innermost loop, but reuses the same elements across the outer loop.

Tiling In general, tiling transforms an n-deep loop nest into a 2n-deep loop nest where the inner n loops execute a compiler- determined number of iterations. For example, the result of applying tiling to our example of an abstraction of hyperbolic PDE code will look as follows: for II 1 := 0 to 5 by 2 for II 2 := 0 to 11 by 2 for I 1 := II 1 to min(II 1 + 1, 5) for I 2 := max(I 1, II 2 ) to min(6 + I 1, II 2 + 1) A[I 2 - I 1 + 1] := 1/3 * (A[I 2 - I 1 ] + A[I 2 - I 1 + 1] + A[I 2 - I 1 + 2])

Tiling (cont.) As the outer loop nests of tiled code controls the execution of the tiles, we will refer to them as the controlling loops. When we say tiling, we refer to the partitioning of the iteration space into rectangular blocks. Non-rectangular blocks are obtained by first applying unimodular transformations to the iteration space and then applying tiling. for II 1 := 0 to 5 by 2 for II 2 := 0 to 11 by 2 for I 1 := II 1 to min(II 1 + 1, 5) for I 2 := max(I 1, II 2 ) to min(6 + I 1, II 2 + 1) A[I 2 - I 1 + 1] := 1/3 * (A[I 2 - I 1 ] + A[I 2 - I 1 + 1] + A[I 2 - I 1 + 2])

Tiling (cont.) II 2 II 1 for II 1 := 0 to 5 by 2 for II 2 := 0 to 11 by 2 for I 1 := II 1 to min(II 1 + 1, 5) for I 2 := max(I 1, II 2 ) to min(6 + I 1, II 2 + 1) A[I 2 - I 1 + 1] := 1/3 * (A[I 2 - I 1 ] + A[I 2 - I 1 + 1] + A[I 2 - I 1 + 2])

Reuse and Locality Since unimodular transformations and tiling can modify the localized vector space, knowing where there is reuse in the iteration space can help guide the search for the transformation that delivers the best locality. Also, to choose between alternate transformations that exploit different reuses in a loop nest, we need a metric to quantify locality for a specific localized iteration space.

Types of Reuse Reuse occurs when a reference within a loop accesses the same data location in different iterations. We call this self-temporal reuse. Likewise, if a reference accesses data on the same cache line in different iterations, it is said to possess self-spatial reuse. Furthermore, different references may access the same locations. We say that there is group-temporal reuse if the references refer to the same location, and group-spatial reuse if they refer to the same cache line.

The SRP Algorithm Combining all together, we get the algorithm, which is known as SRP because the unimodular transformations it performs can be expressed as the product of a skew transformation (S), a reversal transformation (R) and a permutation transformation (P).

The SRP Algorithm Let us illustrate SRP algorithm using our example of an abstraction of hyperbolic PDE code: for I 1 := 0 to 5 for I 2 := 0 to 6 A[I 2 + 1] := 1/3 * (A[I 2 ] + A[I 2 + 1] + A[I 2 + 2]) First an outer loop must be chosen. I 1 can be the outer loop, because its dependence components are all non-negative. Now loop I 2 has a dependence component that is negative, but it can be made non-negative by skewing with respect to I 1.

The SRP Algorithm (cont.) for I 1 := 0 to 5 for I 2 := I 1 to 6 + I 1 A[I 2 – I 1 + 1] := 1/3 * (A[I 2 – I 1 ] + A[I 2 – I 1 + 1] + A[I 2 + 2]) I1I1 I2I2 Loop I 2 is now placed in the same fully permutable nest as I 1 ; the loop nest is tilable.

The SRP Algorithm (cont.) for II 2 := 0 to 11 by 2 for I 1 := 0 to 5 for I 2 := max(I 1, II 2 ) to min(6 + I 1,II 2 + 1) A[I 2 – I 1 + 1] := 1/3 * (A[I 2 – I 1 ] + A[I 2 – I 1 + 1] + A[I 2 + 2]) I1I1 I2I2 Loop I 2 is now placed in the same fully permutable nest as I 1 ; the loop nest is tilable.

The End.