ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

Slides:

Advertisements

Similar presentations

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Advertisements

Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Numerical Algorithms Matrix multiplication

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.

Data Locality CS 524 – High-Performance Computing.

Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.

Structured Data I: Homogenous Data Sept. 17, 1998 Topics Arrays –Single –Nested Pointers –Multilevel Arrays Optimized Array Code class08.ppt “The.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Data Locality CS 524 – High-Performance Computing.

CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.

Data Representation and Alignment Topics Simple static allocation and alignment of basic types and data structures Policies Mechanisms.

Cache Organization Topics Background Simple examples.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Fast matrix multiplication; Cache usage

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.

ECE Dept., University of Toronto

Cache Lab Implementation and Blocking

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.

– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.

Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.

Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.

Defining a 2d Array A 2d array implements a MATRIX. Example: #define NUMROWS 5 #define NUMCOLS 10 int arr[NUMROWS][NUMCOLS];

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:

Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.

A few words on locality and arrays

Cache Memories.

CSE 351 Section 9 3/1/12.

Cache Memories CSE 238/2038/2138: Systems Programming

C Tutorial (part 5) CS220, Spring 2012

The Hardware/Software Interface CSE351 Winter 2013

Section 7: Memory and Caches

CS 105 Tour of the Black Holes of Computing

Cache Miss Rate Computations

The Memory Hierarchy : Memory Hierarchy - Cache

Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron

BLAS: behind the scenes

Memory Hierarchies.

Cache Memories Topics Cache memory organization Direct mapped caches

“The course that gives CMU its Zip!”

Memory Hierarchy II.

November 14 6 classes to go! Read

Optimizing MMM & ATLAS Library Generator

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Memory System Performance Chapter 3

Cache Memories Lecture, Oct. 30, 2018

Cache Models and Program Transformations

Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.

Cache Memories.

Cache Memory and Performance

Optimizing single thread performance

ENERGY 211 / CME 211 Lecture 11 October 15, 2008.

Writing Cache Friendly Code

Presentation transcript:

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

Content Cache basics and organization (last lec.) Optimizing for Caches (this lec.) Tiling/blocking Loop reordering Prefetching (next lec.) Virtual Memory (next lec.)

Optimizing for Caches

Memory Optimizations Write code that has locality Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time How to achieve? Proper choice of algorithm Loop transformations

Background: Array Allocation Basic Principle T A[ L ]; Array of data type T and length L Contiguously allocated region of L * sizeof( T ) bytes char string[12]; xx + 12 int val[5]; x x + 4x + 8x + 12x + 16x + 20 double a[3]; x + 24 x x + 8x + 16 char *p[3]; (64 bit) x + 24 x x + 8x + 16

Multidimensional (Nested) Arrays Declaration T A[ R ][ C ]; 2D array of data type T R rows, C columns T element requires K bytes Array Size R * C * K bytes Arrangement Row-Major Ordering (C code) A[0][0]A[0][C-1] A[R-1][0] A[R-1][C-1] int A[R][C]; A [0] A [0] [C-1] A [1] [0] A [1] [C-1] A [R-1] [0] A [R-1] [C-1] 4*R*C Bytes

Assumed Simple Cache 2 ints per block 2-way set associative 2 blocks, 1 set in total i.e., same thing as fully associative Replacement policy: Least Recently Used (LRU) Cache Block 0 Block 1

Some Key Questions How many elements are there per block? Does the data structure fit in the cache? Do I re-use blocks over time? In what order am I accessing blocks?

Simple Array 1234 A Cache for (i=0;i<N;i++){ … = A[i]; } Miss rate = #misses / #accesses = (N/2) / N = ½ = 50%

Simple Array w outer loop 1234 A Cache for (k=0;k<P;k++){ for (i=0;i<N;i++){ … = A[i]; } Assume A[] fits in the cache: Miss rate = #misses / #accesses = (N/2) / N*P = 1/2P Lesson: for sequential accesses with re-use, If fits in the cache, first visit suffers all the misses

Simple Array A Cache for (i=0;i<N;i++){ … = A[i]; } Assume A[] does not fit in the cache: Miss rate = #misses / #accesses

Simple Array A Cache for (i=0;i<N;i++){ … = A[i]; } Assume A[] does not fit in the cache: Miss rate = #misses / #accesses = (N/2) / N = ½ = 50% Lesson: for sequential accesses, if no-reuse it doesn’t matter whether data structure fits

Simple Array with outer loop A Cache Assume A[] does not fit in the cache: Miss rate = #misses / #accesses = for (k=0;k<P;k++){ for (i=0;i<N;i++){ … = A[i]; } (N/2) / N = ½ = 50% Lesson: for sequential accesses with re-use, If the data structure doesn’t fit, same miss rate as no-reuse

Let’s warm-up our cache Problem (and opportunity) L1 cache reference 0.5 ns* (L1 cache size: 32 KB) Main memory reference 100 ns (mem. size: 4 GBs) Locality Temporal locality Spatial locality Target program: matrix multiplication

2D array A Cache Assume A[] fits in the cache: Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ … = A[i][j]; } (N*N/2) / (N*N) = ½ = 50%

2D array A Cache for (i=0;i<N;i++){ for (j=0;j<N;j++){ … = A[i][j]; } Lesson: for 2D accesses, if row order and no-reuse, same hit rate as sequential, whether fits or not Assume A[] does not fit in the cache: Miss rate = #misses / #accesses = (N*N/2) / (N*N) = ½ = 50%

2D array A Cache for (j=0;j<N;j++){ for (i=0;i<N;i++){ … = A[i][j]; } Lesson: for 2D accesses, if column order and no-reuse, same hit rate as sequential if entire column fits in the cache Assume A[] fits in the cache: Miss rate = #misses / #accesses = (N*N/2) / N*N = ½ = 50%

2D array A Cache Assume A[] does not fit in the cache: Miss rate = #misses / #accesses for (j=0;j<N;j++){ for (i=0;i<N;i++){ … = A[i][j]; } = N*N / N*N = 100% Lesson: for 2D accesses, if column order, if entire column doesn’t fit, then 100% miss rate (block (1,2) is gone after access to element 9).

Matrix multiplication A for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B The most inner loop (i=j=0): A[0][0] * B[0][0], A[0][1] * B[1][0], A[0][2] * B[2][0], A[0][3] * B[3][0] 1 time stamp 2 3 X 4 5

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B The most inner loop (i=j=0): A[0][0] * B[0][0], A[0][1] * B[1][0], A[0][2] * B[2][0], A[0][3] * B[3][0] time stamp X 7

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B The most inner loop (i=j=0): A[0][0] * B[0][0], A[0][1] * B[1][0], A[0][2] * B[2][0], A[0][3] * B[3][0] time stamp

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B Next time: (i=0, j=1): A[0][0] * B[0][1], A[0][1] * B[1][1], A[0][2] * B[2][1], A[0][3] * B[3][1] time stamp

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B Next time: (i=0, j=1): A[0][0] * B[0][1], A[0][1] * B[1][1], A[0][2] * B[2][1], A[0][3] * B[3][1] time stamp

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B Next time: (i=0, j=1): A[0][0] * B[0][1], A[0][1] * B[1][1], A[0][2] * B[2][1], A[0][3] * B[3][1] time stamp X 11

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B Next time: (i=0, j=1): A[0][0] * B[0][1], A[0][1] * B[1][1], A[0][2] * B[2][1], A[0][3] * B[3][1] time stamp

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B Next time: (i=0, j=1): A[0][0] * B[0][1], A[0][1] * B[1][1], A[0][2] * B[2][1], A[0][3] * B[3][1] time stamp

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B Next time: (i=0, j=1): A[0][0] * B[0][1], A[0][1] * B[1][1], A[0][2] * B[2][1], A[0][3] * B[3][1] time stamp X 15

2 2D Arrays A Cache A[] does not fit, B[] does not fit, column of B[] does not fit (at same time as row of A[]) Miss rate = #misses / #accesses = for (i=0;i<N;i++){ for (j=0;j<N;j++){ for (k=0;k<N;k++){ … = A[i][k] * B[k][j]; } B Next time: (i=0, j=1): A[0][0] * B[0][1], A[0][1] * B[1][1], A[0][2] * B[2][1], A[0][3] * B[3][1] time stamp 75%

Example: Matrix Multiplication ab i j * c += c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i][j] += a[i][k]*b[k][j]; }

Cache Miss Analysis Assume: Matrix elements are doubles Cache block 64B = 8 doubles Cache capacity << n (much smaller than n) i.e., can’t even hold an entire row in the cache! First iteration: How many misses? in cache at end of first iteration: * += * n/8 misses n misses n/8 + n = 9n/8 misses 8 wide

Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 8 doubles Cache capacity << n (much smaller than n) Second iteration: Number of misses: n/8 + n = 9n/8 misses Total misses (entire mmm): 9n/8 * n 2 = (9/8) * n 3 * += 8 wide

Doing Better MMM has lots of re-use: try to use all of a cache block once loaded Challenge we need both rows and columns to work with Compromise: operate in sub-squares of the matrices One sub-square per matrix should fit in cache simultaneously Heavily re-use the sub-squares before loading new ones Called ‘Tiling’ or ‘Blocking’ A sub-square is a ‘tile’

Tiled Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=T) for (j = 0; j < n; j+=T) for (k = 0; k < n; k+=T) /* T x T mini matrix multiplications */ for (i1 = i; i1 < i+T; i1++) for (j1 = j; j1 < j+T; j1++) for (k1 = k; k1 < k+T; k1++) c[i1][j1] += a[i1][k1]*b[k1][j1]; } ab i1 j1 * c += Tile size T x T

Big picture * += First calculate C[0][0] – C[T-1][T-1]

Big picture * += Next calculate C[0][T] – C[T-1][2T-1]

Detailed Visualization a * += bc Still have to access b[] column-wise But now b’s cache blocks don’t get replaced

Cache Miss Analysis Assume: Cache block = 8 doubles Cache capacity << n (much smaller than n) Need to fit 3 tiles in cache: hence ensure 3T 2 < capacity (since 3 arrays a,b,c) Misses per tile-iteration: T 2 /8 misses for each tile 2n/T * T 2 /8 = nT/4 Total misses: Tiled: nT/4 * (n/T) 2 = n 3 /(4T) Untiled: (9/8) * n 3 * += Tile size T x T n/T tiles