1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

Slides:

Advertisements

Similar presentations

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Advertisements

Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i

Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.

The Memory Hierarchy CS 740 Sept. 28, 2001 Topics The memory hierarchy Cache design.

Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.

The Memory Hierarchy CS 740 Sept. 17, 2007 Topics The memory hierarchy Cache design.

Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.

Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.

Cache Memories February 24, 2004 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance class13.ppt.

CS 3214 Computer Systems Godmar Back Lecture 11. Announcements Stay tuned for Exercise 5 Project 2 due Sep 30 Auto-fail rule 2: –Need at least Firecracker.

Introduction to Computer Systems* Topics: Theme Five great realities of computer systems How this fits within CS curriculum F ’07 class01a.ppt

CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Fast matrix multiplication; Cache usage

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.

Instructor: Erol Sahin

Cache Performance Metrics

External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.

1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.

Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo

– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.

Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.

Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.

Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”

ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto

Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.

SNU IDB Lab. Ch4. Performance Measurement © copyright 2006 SNU IDB Lab.

1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.

Memory Hierarchy II. – 2 – Last class Caches Direct mapped E=1 (One cache line per set) Each main memory address can be placed in exactly one place in.

Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

1 Cache Memory. 2 Outline General concepts 3 ways to organize cache memory Issues with writes Write cache friendly codes Cache mountain Suggested Reading:

Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.

1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.

Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.

Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.

Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance cache.ppt CS 105 Tour.

Memory Hierarchy Computer Organization and Assembly Languages Yung-Yu Chuang 2007/01/08 with slides by CMU

Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.

Cache Memories February 26, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance The.

1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.

Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:

Cache Memories CENG331 - Computer Organization Instructor: Murat Manguoglu(Section 1) Adapted from: and

Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.

Carnegie Mellon 1 Cache Memories Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron.

Cache Memories.

Cache Memories CSE 238/2038/2138: Systems Programming

Section 7: Memory and Caches

CS 105 Tour of the Black Holes of Computing

The Memory Hierarchy : Memory Hierarchy - Cache

Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron

Performance Measurement

Cache Memories Topics Cache memory organization Direct mapped caches

“The course that gives CMU its Zip!”

Memory Hierarchy II.

November 14 6 classes to go! Read

Siddhartha Chatterjee

Memory Hierarchy and Cache Memories CENG331 - Computer Organization

Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:

Cache Performance October 3, 2007

Cache Memories Lecture, Oct. 30, 2018

Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05

Cache Memories.

Cache Memory and Performance

Optimizing single thread performance

Writing Cache Friendly Code

Presentation transcript:

1 Cache Memory

2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7

3 6.6 Putting it Together: The Impact of Caches on Program Performance The Memory Mountain

4 The Memory Mountain P512 Read throughput (read bandwidth) –The rate that a program reads data from the memory system Memory mountain –A two-dimensional function of read bandwidth versus temporal and spatial locality –Characterizes the capabilities of the memory system for each computer

5 Memory mountain main routine Figure 6.41 P513 /* mountain.c - Generate the memory mountain. */ #define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXBYTES (1 << 23) /*... up to 8 MB */ #define MAXSTRIDE 16 /* Strides range from 1 to 16 */ #define MAXELEMS MAXBYTES/sizeof(int) int data[MAXELEMS]; /* The array we'll be traversing */

6 Memory mountain main routine int main() { int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */

7 Memory mountain main routine for (size = MAXBYTES; size >= MINBYTES; size >>= 1) { for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz)); printf("\n"); } exit(0); }

8 Memory mountain test function Figure 6.40 P512 /* The test function */ void test (int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */ }

9 Memory mountain test function /* Run test (elems, stride) and return read throughput (MB/s) */ double run (int size, int stride, double Mhz) { double cycles; int elems = size / sizeof(int); test (elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test (elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ }

10 The Memory Mountain Data –Size MAXBYTES(8M) bytes or MAXELEMS(2M) words –Partially accessed Working set: from 8MB to 1KB Stride: from 1 to 16

11 The Memory Mountain Figure 6.42 P514

12 Ridges of temporal locality Slice through the memory mountain with stride=1 –illuminates read throughputs of different caches and memory Ridges: 山脊

13 Ridges of temporal locality Figure 6.43 P515

14 A slope of spatial locality Slice through memory mountain with size=256KB –shows cache block size.

15 A slope of spatial locality Figure 6.44 P516

Putting it Together: The Impact of Caches on Program Performance Rearranging Loops to Increase Spatial Locality

17 Matrix Multiplication P517

18 Matrix Multiplication Implementation Figure 6.45 (a) P518 /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; } O(n 3 )adds and multiplies Each n 2 elements of A and B is read n times /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { c[i][j] = 0.0; for (k=0; k<n; k++) c[i][j] += a[i][k] * b[k][j]; } O(n 3 )adds and multiplies Each n 2 elements of A and B is read n times

19 Matrix Multiplication P517 Assumptions: –Each array is an n  n array of double, with size 8 –There is a single cache with a 32-byte block size ( B=32 ) –The array size n is so large that a single matrix row does not fit in the L1 cache –The compiler stores local variables in registers, and thus references to local variables inside loops do not require any load and store instructions.

20 /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } Variable sum held in register Matrix Multiplication Figure 6.45 (a) P518

21 /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } /* ijk */ for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ABC (i,*) (*,j) (i,j) Inner loop: Column- wise Row-wise Fixed Misses per Inner Loop Iteration: ABC Matrix multiplication (ijk) Figure 6.46 P519 1) (AB)

22 /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } /* jik */ for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum } ABC (i,*) (*,j) (i,j) Inner loop: Row-wiseColumn- wise Fixed Misses per Inner Loop Iteration: ABC Matrix multiplication (jik) Figure 6.45 (b) P518 Figure 6.46 P519 1) (AB)

23 /* kij */ for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC Matrix multiplication (kij) Figure 6.45 (e) P518 Figure 6.46 P519 3) (BC)

24 /* ikj */ for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } ABC (i,*) (i,k)(k,*) Inner loop: Row-wise Fixed Misses per Inner Loop Iteration: ABC Matrix multiplication (ikj) Figure 6.45 (f) P518 Figure 6.46 P519 3) (BC)

25 /* jki */ for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) Column - wise Column- wise Fixed Misses per Inner Loop Iteration: ABC Matrix multiplication (jki) Figure 6.45 (c) P518 Figure 6.46 P519 2) (AC)

26 /* kji */ for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } ABC (*,j) (k,j) Inner loop: (*,k) FixedColumn- wise Column- wise Misses per Inner Loop Iteration: ABC Matrix multiplication (kji) Figure 6.45 (d) P518 Figure 6.46 P519 2) (AC)

27 Pentium matrix multiply performance Figure 6.47 (d) P519 2) (AC) 3) (BC) 1) (AB)

28 Pentium matrix multiply performance Notice that miss rates are helpful but not perfect predictors. –Code scheduling matters, too.

29 for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; } ijk (& jik): 2 loads, 0 stores misses/iter = 1.25 for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; } for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; } kij (& ikj): 2 loads, 1 store misses/iter = 0.5 jki (& kji): 2 loads, 1 store misses/iter = 2.0 Summary of matrix multiplication 1) (AB)3) (BC)2) (AC)

Putting it Together: The Impact of Caches on Program Performance Using Blocking to Increase Temporal Locality

31 Improving temporal locality by blocking P520 Example: Blocked matrix multiplication –“block” (in this context) does not mean “cache block”. –Instead, it mean a sub-block within the matrix. –Example: N = 8; sub-block size = 4

32 Improving temporal locality by blocking C 11 = A 11 B 11 + A 12 B 21 C 12 = A 11 B 12 + A 12 B 22 C 21 = A 21 B 11 + A 22 B 21 C 22 = A 21 B 12 + A 22 B 22 A11 A12 A21 A22 B11 B12 B21 B22 X = C11 C12 C21 C22 Key idea: Sub-blocks (i.e., A xy ) can be treated just like scalars.

33 for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } Blocked matrix multiply (bijk) Figure 6.48 P521

34 Blocked matrix multiply analysis Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C –Loop over i steps through n row slivers of A & C, using same B Sliver: 长条

35 ABC block reused n times in succession row sliver accessed bsize times Update successive elements of sliver ii kk jj for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } Innermost Loop Pair Blocked matrix multiply analysis Figure 6.49 P522

36 Pentium blocked matrix multiply performance Figure 6.50 P523 2) 3) 1)

Putting it Together: Exploring Locality in Your Programs

38 Techniques P523 Focus your attention on the inner loops Try to maximize the spatial locality in your programs by reading data objects sequentially, in the order they are stored in memory Try to maximize the temporal locality in your programs by using a data object as often as possible once it has been read from memory Miss rates, the number of memory accesses