DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.
Chapter 4 Systems of Linear Equations; Matrices Section 2 Systems of Linear Equations and Augmented Matrics.
Compiler Challenges for High Performance Architectures
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
1 2 Extreme Pathway Lengths and Reaction Participation in Genome Scale Metabolic Networks Jason A. Papin, Nathan D. Price and Bernhard Ø. Palsson.
Maths for Computer Graphics
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Introduction to Analysis of Algorithms
Data Locality CS 524 – High-Performance Computing.
Computer Science 1620 Multi-Dimensional Arrays. we used arrays to store a set of data of the same type e.g. store the assignment grades for a particular.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Data Locality CS 524 – High-Performance Computing.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
Pointers (Continuation) 1. Data Pointer A pointer is a programming language data type whose value refers directly to ("points to") another value stored.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Copyright © Cengage Learning. All rights reserved. 7.4 Matrices and Systems of Equations.
Chapter 7: Arrays. In this chapter, you will learn about: One-dimensional arrays Array initialization Declaring and processing two-dimensional arrays.
Chapter One Introduction to Pipelined Processors.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
 2007 Pearson Education, Inc. All rights reserved C Arrays.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Text TCS INTERNAL Oracle PL/SQL – Introduction. TCS INTERNAL PL SQL Introduction PLSQL means Procedural Language extension of SQL. PLSQL is a database.
MULTI-DIMENSIONAL ARRAYS 1. Multi-dimensional Arrays The types of arrays discussed so far are all linear arrays. That is, they all dealt with a single.
A few words on locality and arrays
Numerical Algorithms Chapter 11.
Code Optimization.
Multiplying Matrices.
Computer Architecture Principles Dr. Mike Frank
CMPUT Compiler Design and Optimization
Section 7: Memory and Caches
Data Locality Analysis and Optimization
Implementation of DWT using SSE Instruction Set
Warm Up Use scalar multiplication to evaluate the following:
Linear Transformations
Multiplying Matrices.
Memory Hierarchies.
STUDY AND IMPLEMENTATION
Matrix Multiplication
Multiplying Matrices.
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory System Performance Chapter 3
Multiplying Matrices.
Multiplying Matrices.
Optimizing single thread performance
Introduction to Matrices
Multiplying Matrices.
Presentation transcript:

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012

Why? Processor Speed - increasing at a faster rate than the memory speed Computer Architectures -more levels of cache memory Cache - takes advantage of data locality Good Data Locality - good application performance Poor Data Locality - reduces the effectiveness of the cache

Data Locality It is the property that, references to the same memory location or adjacent locations are reused within a short period of time Temporal locality Spatial locality Fig: Program to find the squares of the differences (a) without loop fusion (b) with loop fusion [Image from: The Dragon book 2 nd edition]

Matrix Multiplication - Example Fig: Basic Matrix Multiplication Algorithm [Image from: The Dragon book 2 nd edition] Poor data locality N 2 multiply add operations separates the reuse of same data element in matrix Y N operations separate the reuse of same cache line in Y Solutions Changing the layout of the data structures Blocking

Matrix Multiplication – Example Contd… Changing the data structure layout Store Y in column-major order Improves reuse of cache lines of matrix Y Limited Applicability Blocking Changes the execution order of instructions Divide the matrix into submatrices or blocks Order the operations such that entire block is used over a short period of time Choose B such that, one block from each of the matrices fits into cache Image from: The Dragon book 2 nd edition

Data Reuse Locality Optimization Identify set of iterations that access the same data or same cache line Static Access- an instruction in a program e.g x = z[i,j] Dynamic Access- execution of instruction many times as in a loop nest Types of Reuse Self Iterations using same data come from same static access Group Iterations using same data come from different static access Temporal If the same exact location is referenced Spatial If the same cache line is referenced

Self Temporal Reuse Save substantial memory by exploiting self reuse n (d-k) times reused for data with ‘k’ dimensions in a loop nest of depth ‘d’ e.g. 3-deep nested loop accesses one column of an array, then there is a potential saving accesses of n 2 accesses Dimensionality of access- Rank of the matrix in access Iterations referring to the same location – Null Space of a matrix Rank of a Matrix No. of rows or columns that are linearly independent Null Space of a matrix A reference in ‘d’ deep loop nest with ‘r’ rank, accesses O(n r ) data elements in O(n d ) iterations, so on an average, O(n d-r ) iterations must refer to the same array element Rank = Dimensionality = 2 2 nd row = 1 st + 3 rd 4 th row = 3 rd – 2* 1 st Nullity = 3-2 = 1 Loop depth = 3 Rank = 2

Self Spatial Reuse Depends on data layout of the matrix – e.g. Row major order In an array of ‘d’ dimension, array elements share a cache line if they differ only in the last dimension e.g. Two array elements share the same cache line if and only if they share the same row in a 2-D array Truncated matrix is obtained by dropping of the last row from the matrix If the resulting matrix has a rank ‘r’ that is less than depth ‘d’, we can assure for spatial reuse Truncated Matrix, r = 1, d = 2 r<d, assures spatial reuse

Group Reuse Group reuse only among accesses in a loop sharing the same coefficient matrix Fig: 2-deep loop nest [Image from: The Dragon book 2 nd edition] z[i,j] and z[i-1,j] access almost the same set of array elements Data read by access z[i-1,j] is same as the data written by z[i,j], except for i = 1 Rank = 2, no self temporal reuse Truncated Matrix, Rank = 1, self spatial reuse

Locality Optimization Temporal Locality of data Use the results as soon as they are generated Fig: Code excerpt for a multigrid algorithm (a) before partition (b) after patition [Image from: The Dragon book 2 nd edition]

Locality Optimization Contd… Array Contraction Reduce the dimension of the array and reduce the number of memory locations accessed Fig: Code excerpt for a multigrid algorithm after partition and after array contraction Image from: The Dragon book 2 nd edition

Locality Optimization Contd… Instead of executing each partition one after the other; we interleave a number of the partitions so that reuse among partitions occur close together Interleaving Inner Loops in a Parallel Loop Interleaving Statements in a Parallel Loop Fig: The statement interleaving transformation [Image from: The Dragon book 2 nd edition] Fig: Interleaving four instances of the inner loop [Image from: The Dragon book 2 nd edition]

References Wolf, Michael E., and Monica S. Lam. "A data locality optimizing algorithm." ACM Sigplan Notices 26.6 (1991): McKinley, Kathryn S., Steve Carr, and Chau-Wen Tseng. "Improving data locality with loop transformations." ACM Transactions on Programming Languages and Systems (TOPLAS) 18.4 (1996): Bodin, François, et al. "A quantitative algorithm for data locality optimization." Code Generation: Concepts, Tools, Techniques (1992): Kennedy, Ken, and Kathryn S. McKinley. "Optimizing for parallelism and data locality." Proceedings of the 6th international conference on Supercomputing. ACM, Compilers ‐ Principles, Techniques, and Tools by A. Aho, M. Lam (2nd edition), R. Sethi, and J.Ullman, Addison ‐ Wesley.

Thank You! Questions??