Section 7: Memory and Caches

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Example How are these parameters decided?. Row-Order storage main() { int i, j, a[3][4]={1,2,3,4,5,6,7,8,9,10,11,12}; for (i=0; i
Today Memory hierarchy, caches, locality Cache organization
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Data Locality CS 524 – High-Performance Computing.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Data Locality CS 524 – High-Performance Computing.
DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.
ECE Dept., University of Toronto
Cache Lab Implementation and Blocking
Cache Lab Implementation and Blocking
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
ICC Module 3 Lesson 2 – Memory Hierarchies 1 / 13 © 2015 Ph. Janson Information, Computing & Communication Memory Hierarchies – Clip 9 – Locality School.
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
Carnegie Mellon Introduction to Computer Systems /18-243, spring th Lecture, Feb. 19 th Instructors: Gregory Kesden and Markus Püschel.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
Notes Over 4.2 Finding the Product of Two Matrices Find the product. If it is not defined, state the reason. To multiply matrices, the number of columns.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Cache Memories CENG331 - Computer Organization Instructor: Murat Manguoglu(Section 1) Adapted from: and
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
Carnegie Mellon 1 Cache Memories Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron.
May 4: Caches.
A few words on locality and arrays
Cache Memories.
CSE 351 Section 9 3/1/12.
Introduction To Computer Systems
Optimization III: Cache Memories
Cache Memories CSE 238/2038/2138: Systems Programming
The Hardware/Software Interface CSE351 Winter 2013
Dynamic Array Multidimensional Array Matric Operation with Array
Morgan Kaufmann Publishers
Today How’s Lab 3 going? HW 3 will be out today
CS 105 Tour of the Black Holes of Computing
Cache Miss Rate Computations
The Memory Hierarchy : Memory Hierarchy - Cache
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
Caches III CSE 351 Winter 2017.
Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson
Memory Hierarchies.
Cache Memories Topics Cache memory organization Direct mapped caches
“The course that gives CMU its Zip!”
Memory Hierarchy II.
Optimizing MMM & ATLAS Library Generator
Lecture 22: Cache Hierarchies, Memory
Siddhartha Chatterjee
Feb 11 Announcements Memory hierarchies! How’s Lab 3 going?
Memory Hierarchy and Cache Memories CENG331 - Computer Organization
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Memory System Performance Chapter 3
Cache Performance October 3, 2007
Cache Memories Lecture, Oct. 30, 2018
Cache Memories.
Cache Memory and Performance
Optimizing single thread performance
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
Writing Cache Friendly Code

Presentation transcript:

Section 7: Memory and Caches Cache basics Principle of locality Memory hierarchies Cache organization Program optimizations that consider caches Caches and Program Optimizations

Optimizations for the Memory Hierarchy Write code that has locality Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time How to achieve? Proper choice of algorithm Loop transformations Caches and Program Optimizations

Example: Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k]*b[k*n + j]; } i is row of c, j is column of c: [i, j] k is used to simultaneously traverse row i of a and column j of b j c a b = * i Caches and Program Optimizations

Caches and Program Optimizations Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) First iteration: n/8 + n = 9n/8 misses (omitting matrix c) Afterwards in cache: (schematic) n = * = * 8 wide Caches and Program Optimizations

Caches and Program Optimizations Cache Miss Analysis Assume: Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Other iterations: Again: n/8 + n = 9n/8 misses (omitting matrix c) Total misses: 9n/8 * n2 = (9/8) * n3 n = * 8 wide Caches and Program Optimizations

Blocked Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i1++) for (j1 = j; j1 < j+B; j1++) for (k1 = k; k1 < k+B; k1++) c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1]; } j1 c a b = * i1 Block size B x B Caches and Program Optimizations

Caches and Program Optimizations Cache Miss Analysis Assume: Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C First (block) iteration: B2/8 misses for each block 2n/B * B2/8 = nB/4 (omitting matrix c) Afterwards in cache (schematic) n/B blocks 3B2 < C: need to be able to fit three full blocks in the cache (one for c, one for a, one for b) while executing the algorithm = * Block size B x B = * Caches and Program Optimizations

Caches and Program Optimizations Cache Miss Analysis Assume: Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C Other (block) iterations: Same as first iteration 2n/B * B2/8 = nB/4 Total misses: nB/4 * (n/B)2 = n3/(4B) n/B blocks = * Block size B x B Caches and Program Optimizations

Caches and Program Optimizations Summary No blocking: (9/8) * n3 Blocking: 1/(4B) * n3 If B = 8 difference is 4 * 8 * 9 / 8 = 36x If B = 16 difference is 4 * 16 * 9 / 8 = 72x Suggests largest possible block size B, but limit 3B2 < C! Reason for dramatic difference: Matrix multiplication has inherent temporal locality: Input data: 3n2, computation 2n3 Every array element used O(n) times! But program has to be written properly 3B2 < C: need to be able to fit three full blocks in the cache (one for c, one for a, one for b) while executing the algorithm Caches and Program Optimizations

Caches and Program Optimizations Cache-Friendly Code Programmer can optimize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor “cache-friendly code” Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality) Focus on inner loop code Caches and Program Optimizations

Caches and Program Optimizations Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip The Memory Mountain Caches and Program Optimizations