Data Locality CS 524 – High-Performance Computing.

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Compiler Challenges for High Performance Architectures
CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.
Numerical Algorithms Matrix multiplication
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Numerical Algorithms • Matrix multiplication
High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Introduction CS 524 – High-Performance Computing.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Data Locality CS 524 – High-Performance Computing.
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
Sparse Matrix Algorithms CS 524 – High-Performance Computing.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Data Dependences CS 524 – High-Performance Computing.
Uniprocessor Optimizations and Matrix Multiplication
Dense Matrix Algorithms CS 524 – High-Performance Computing.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
High-Level Transformations for Embedded Computing
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
A few words on locality and arrays
Automatic Thread Extraction with Decoupled Software Pipelining
Cache Memories.
Cache Memories CSE 238/2038/2138: Systems Programming
Optimizing Cache Performance in Matrix Multiplication
The Hardware/Software Interface CSE351 Winter 2013
Section 7: Memory and Caches
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Prof. Zhang Gang School of Computer Sci. & Tech.
CS 105 Tour of the Black Holes of Computing
Optimizing Cache Performance in Matrix Multiplication
Memory Hierarchies.
Optimizing MMM & ATLAS Library Generator
Parallel Programming in C with MPI and OpenMP
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Memory System Performance Chapter 3
Cache Models and Program Transformations
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Cache Memories.
Optimizing single thread performance
Writing Cache Friendly Code
Presentation transcript:

Data Locality CS 524 – High-Performance Computing

CS 524 (Wi 2003/04)- Asim LUMS2 Locality of Reference Application data space is often too large to all fit into cache Key is locality of reference Types of locality  Register locality  Cache-temporal locality  Cache-spatial locality Attempt (in order of preference) to:  Keep all data in registers  Order algorithms to do all ops on a data element before it is overwritten in cache  Order algorithms such that successive ops are on physically contiguous data

CS 524 (Wi 2003/04)- Asim LUMS3 Single Processor Performance Enhancement Two fundamental issues  Adequate op-level parallelism (to keep superscalar, pipelined processor units busy)  Minimize memory access costs (locality in cache) Three useful techniques to improve locality of reference (loop transformations)  Loop permutation  Loop blocking (tiling)  Loop unrolling

CS 524 (Wi 2003/04)- Asim LUMS4 Access Stride and Spatial Locality Access stride: separation between successively accessed memory locations  Unit access stride maximizes spatial locality (only one miss per cache line) 2-D arrays have different linearized representations in C and FORTRAN  FORTRAN’s column-major representation favors column- wise access of data  C’s representation favors row-wise access of data a d g b e h c f i abca d g Row major order (C ) Column major order (FORTRAN)

CS 524 (Wi 2003/04)- Asim LUMS5 Matrix-Vector Multiply (MVM) do i = 1, N do j = 1, N y(i) = y(i)+A(i,j)*x(j) end do Total no. of references = 4N 2 Two questions:  Can we predict the miss ratio of different variations of this program for different cache models?  What transformations can we do to improve performance? Ax y j i j

CS 524 (Wi 2003/04)- Asim LUMS6 Cache Model Fully associative cache (so no conflict misses) LRU replacement algorithm (ensures temporal locality) Cache size  Large cache model: no capacity misses  Small cache model: miss depending on problem size Cache block size  1 floating-point number  b floating-point numbers

CS 524 (Wi 2003/04)- Asim LUMS7 MVM: Dot-Product Form (Scenario 1) Loop order: i-j Cache model  Small cache: assume cache can hold fewer than (2N+1) numbers  Cache block size: 1 floating-point number Cache misses  Matrix A: N 2 misses (cold misses)  Vector x: N cold misses + N(N-1) capacity misses  Vector y: N cold misses Miss ratio = (2N 2 + N)/4N 2 → 0.5

CS 524 (Wi 2003/04)- Asim LUMS8 MVM: Dot-Product Form (Scenario 2) Cache model  Large cache: cache can hold more than (2N+1) numbers  Cache block size: 1 floating-point number Cache misses  Matrix A: N 2 cold misses  Vector x: N cold misses  Vector y: N cold misses Miss ratio = (N 2 + 2N)/4N 2 → 0.25

CS 524 (Wi 2003/04)- Asim LUMS9 MVM: Dot-Product Form (Scenario 3) Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses (small cache)  Matrix A: N 2 /b cold misses  Vector x: N/b cold misses + N(N-1)/b capacity misses  Vector y: N/b cold misses  Miss ratio = (1/2 +1/4N)*(1/b) → 1/2b Cache misses (large cache)  Matrix A: N 2 /b cold misses  Vector x: N/b cold misses  Vector y: N/b cold misses  Miss ratio = (1/4 +1/2N)*(1/b) → 1/4b

CS 524 (Wi 2003/04)- Asim LUMS10 MVM: SAXPY Form do j = 1, N do i = 1, N y(i) = y(i) +A(i,j)*x(j) end do Total no. of references = 4N 2 Ayx j i j

CS 524 (Wi 2003/04)- Asim LUMS11 MVM: SAXPY Form (Scenario 4) Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses (small cache)  Matrix A: N 2 cold misses  Vector x: N/b cold misses  Vector y: N/b cold misses + N(N-1)/b capacity misses  Miss ratio = 0.25(1 +1/b) + 1/4Nb → 0.25(1 + 1/b) Cache misses (large cache)  Matrix A: N 2 cold misses  Vector x: N/b cold misses  Vector y: N/b cold misses  Miss ratio = 1/4 +1/2Nb → 1/4

CS 524 (Wi 2003/04)- Asim LUMS12 Loop Blocking (Tiling) Loop blocking enhances temporal locality by ensuring that data remain in cache until all operations are performed on them Concept  Divide or tile the loop computation into chunks or blocks that fit into cache  Perform all operations on block before moving to the next block Procedure  Add outer loop(s) to tile inner loop computation  Select size of the block so that you have a large cache model (no capacity misses, temporal locality maximized)

CS 524 (Wi 2003/04)- Asim LUMS13 MVM: Blocked do bi = 1, N/B do bj = 1, N/B do i = (bi-1)*B+1, bi*B do j = (bj-1)*B+1, bj*B y(i) = y(i) + A(i, j)*x(j) A x y j i B B

CS 524 (Wi 2003/04)- Asim LUMS14 MVM: Blocked (Scenario 5) Cache size greater than 2B + 1 floating-point numbers Cache block size: b floating-point numbers Matrix access order: row-major access (C) Cache misses per block  Matrix A: B 2 /b cold misses  Vector x: B/b  Vector y: B/b  Miss ratio = (1/4 +1/2B)*(1/b) → 1/4b Lesson: proper loop blocking eliminates capacity misses. Performance essentially becomes independent of cache size (as long as cache size is > 2B +1 fp numbers)

CS 524 (Wi 2003/04)- Asim LUMS15 Access Stride and Spatial Locality Access stride: separation between successively accessed memory locations  Unit access stride maximizes spatial locality (only one miss per cache line) 2-D arrays have different linearized representations in C and FORTRAN  FORTRAN’s column-major representation favors column- wise access of data  C’s representation favors row-wise access of data a d g b e h c f i abca d g Row major order (C ) Column major order (FORTRAN)

CS 524 (Wi 2003/04)- Asim LUMS16 Access Stride Analysis of MVM c Dot-product form do i = 1, N do j = 1, N y(i) = y(i) + A(i,j)*x(j) end do c SAXPY form do j = 1, N do i = 1, N y(i) = y(i) + A(i,j)*x(j) end do Access stride for arrays Dot-product form CFortran A1N x11 y00 SAXPY form AN1 x00 y11

CS 524 (Wi 2003/04)- Asim LUMS17 Loop Permutation Changing the order of nested loops can improve spatial locality  Choose the loop ordering that minimizes miss ratio, or  Choose the loop ordering that minimizes array access strides Loop permutation issues  Validity: Loop reordering should not change the meaning of the code  Performance: Does the reordering improve performance?  Array bounds: What should be the new array bounds?  Procedure: Is there a mechanical way to perform loop permutation? These are of primary concern in compiler design. We will discuss some of these issues when we cover dependences, from an application software developer’s perspective.

CS 524 (Wi 2003/04)- Asim LUMS18 Matrix-Matrix Multiply (MMM) for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { for (k = 0; k < N; k++) C[i][j] = C[i][j] + A[k][i]*B[k][j]; } } Access strides (C compiler: row-major) + Performance (MFLOPS) ijkjikikjkijjkikji C(i,j)0011NN A(k,i)NN0011 B(k,j)NN1100 P P suraj

CS 524 (Wi 2003/04)- Asim LUMS19 Loop Unrolling Reduce number of iterations of loop but add statement(s) to loop body to do work of missing iterations Increase amount of operation-level parallelism in loop body Outer-loop unrolling changes the order of access of memory elements and could reduce number of memory accesses and cache misses do i = 1, 2n do j = 1, m loop-body(i,j) end do do i = 1, 2n, 2 do j = 1, m loop-body(i,j) loop-body(i+1, j) End do

CS 524 (Wi 2003/04)- Asim LUMS20 MVM: Dot-Product 4-Outer Unrolled Form /* Assume n is a multiple of 4 */ for (i = 0; i < n; i+=4) { for (j = 0; j < n; j++) { y[i] = y[i] + A[i][j]*x[j]; y[i+1] = y[i+1] + A[i+1][j]*x[j]; y[i+2] = y[i+2] + A[i+2][j]*x[j]; y[i+3] = y[i+3] + A[i+3][j]*x[j]; } Performance (MFLOPS) 32 x x 1024 P P suraj

CS 524 (Wi 2003/04)- Asim LUMS21 MVM: SAXPY 4-Outer Unrolled Form /* Assume n is a multiple of 4 */ for (j = 0; j < n; j+=4) { for (i = 0; i < n; i++) { y[i] = y[i] + A[i][j]*x[j]; + A[i][j+1]*x[j+1] + A[i][j+2]*x[j+2] + A[i][j+3]*x[j+3]; } Performance (MFLOPS) 32 x x 1024 P P suraj3.58.6

CS 524 (Wi 2003/04)- Asim LUMS22 Summary Loop permutation  Enhances spatial locality  C and FORTRAN compilers have different access patterns for matrices resulting in different performances  Performance increases when inner loop index has the minimum access stride Loop blocking  Enhances temporal locality  Ensure that 2B is less than cache size Loop unrolling  Enhances superscalar, pipelined operations