The Hardware/Software Interface CSE351 Winter 2013

Slides:



Advertisements
Similar presentations
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution
CS2100 Computer Organisation Cache II (AY2014/2015) Semester 2.
Today Memory hierarchy, caches, locality Cache organization
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Systems I Locality and Caching
ECE Dept., University of Toronto
Cache Lab Implementation and Blocking
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
Lecture 13: Caching EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier.
Lecture 20: Locality and Caching CS 2011 Fall 2014, Dr. Rozier.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Carnegie Mellon Introduction to Computer Systems /18-243, spring th Lecture, Feb. 19 th Instructors: Gregory Kesden and Markus Püschel.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Carnegie Mellon 1 Cache Memories Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron.
May 4: Caches.
CSCI206 - Computer Organization & Programming
Cache Memories.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CSE 351 Section 9 3/1/12.
Cache Memory and Performance
CS161 – Design and Architecture of Computer
Optimization III: Cache Memories
Lecture 12 Virtual Memory.
Associativity in Caches Lecture 25
Cache Memories CSE 238/2038/2138: Systems Programming
Multilevel Memories (Improving performance using alittle “cash”)
Basic Performance Parameters in Computer Architecture:
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Section 7: Memory and Caches
Caches III CSE 351 Autumn 2017 Instructor: Justin Hsia
Today How’s Lab 3 going? HW 3 will be out today
Cache Memory Presentation I
CS 105 Tour of the Black Holes of Computing
The Memory Hierarchy : Memory Hierarchy - Cache
Authors: Adapted from slides by Randy Bryant and Dave O’Hallaron
Caches III CSE 351 Winter 2017.
ReCap Random-Access Memory (RAM) Nonvolatile Memory
Lecture 21: Memory Hierarchy
Caches III CSE 351 Spring 2017 Instructor: Ruth Anderson
Lecture 21: Memory Hierarchy
Cache Memories September 30, 2008
Cache Memories Topics Cache memory organization Direct mapped caches
Memory Hierarchy II.
Adapted from slides by Sally McKee Cornell University
Feb 11 Announcements Memory hierarchies! How’s Lab 3 going?
Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Lecture 21: Memory Hierarchy
Caches III CSE 351 Winter 2019 Instructors: Max Willsey Luis Ceze
Computer Organization and Assembly Languages Yung-Yu Chuang 2006/01/05
Cache - Optimization.
Cache Memories.
Cache Memory and Performance
Principle of Locality: Memory Hierarchies
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
10/18: Lecture Topics Using spatial locality
Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson
Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

The Hardware/Software Interface CSE351 Winter 2013 Memory and Caches II

Types of Cache Misses Cold (compulsory) miss Conflict miss Occurs on very first access to a block Conflict miss Occurs when some block is evicted out of the cache, but then that block is referenced again later Conflict misses occur when the cache is large enough, but multiple data blocks all map to the same slot e.g., if blocks 0 and 8 map to the same cache slot, then referencing 0, 8, 0, 8, ... would miss every time Conflict misses may be reduced by increasing the associativity of the cache Capacity miss Occurs when the set of active cache blocks (the working set) is larger than the cache (just won’t fit) Winter 2013 Memory and Caches II

General Cache Organization (S, E, B) E = 2e lines per set set line S = 2s sets cache size: S x E x B data bytes v tag 1 2 B-1 valid bit B = 2b bytes of data per cache line (the data block) Winter 2013 Memory and Caches II

Cache Read Locate set Check if any line in set has matching tag Yes + line valid: hit Locate data starting at offset E = 2e lines per set Address of byte in memory: t bits s bits b bits S = 2s sets tag set index block offset data begins at this offset v tag 1 2 B-1 valid bit B = 2b bytes of data per cache line (the data block) Winter 2013 Memory and Caches II

Example: Direct-Mapped Cache (E = 1) Direct-mapped: One line per set Assume: cache block size 8 bytes Address of int: 1 2 7 tag v 3 6 5 4 t bits 0…01 100 1 2 7 tag v 3 6 5 4 find set S = 2s sets 1 2 7 tag v 3 6 5 4 1 2 7 tag v 3 6 5 4 Winter 2013 Memory and Caches II

Example: Direct-Mapped Cache (E = 1) Direct-mapped: One line per set Assume: cache block size 8 bytes Address of int: valid? + match?: yes = hit t bits 0…01 100 1 2 7 tag v 3 6 5 4 tag block offset Winter 2013 Memory and Caches II

Example: Direct-Mapped Cache (E = 1) Direct-mapped: One line per set Assume: cache block size 8 bytes Address of int: valid? + match?: yes = hit t bits 0…01 100 1 2 7 tag v 3 6 5 4 block offset int (4 Bytes) is here No match: old line is evicted and replaced Winter 2013 Memory and Caches II

Example (for E = 1) Assume sum, i, j in registers Address of an aligned element of a: aa...ayyyyxxxx000 Example (for E = 1) Assume: cold (empty) cache 3 bits for set, 5 bits for offset aa...ayyy yxx xx000 int sum_array_rows(double a[16][16]) { int i, j; double sum = 0; for (i = 0; i < 16; i++) for (j = 0; j < 16; j++) sum += a[i][j]; return sum; } 2,0: aa...a001 000 00000 0,4: aa...a000 001 00000 1,0: aa...a000 100 00000 0,0: aa...a000 000 00000 0,0 0,1 0,2 0,3 0,0 0,1 0,2 0,3 4,0 4,1 4,2 4,3 2,0 2,1 2,2 2,3 0,4 0,5 0,6 0,7 0,8 0,9 0,a 0,b 0,c 0,d 0,e 0,f 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7 1,8 1,9 1,a 1,b 1,c 1,d 1,e 1,f 3,0 3,1 3,2 3,3 1,0 1,1 1,2 1,3 int sum_array_cols(double a[16][16]) { int i, j; double sum = 0; for (j = 0; j < 16; j++) for (i = 0; i < 16; i++) sum += a[i][j]; return sum; } 32 B = 4 doubles 32 B = 4 doubles 4 misses per row of array 4*16 = 64 misses every access a miss 16*16 = 256 misses Winter 2013 Memory and Caches II

Example (for E = 1) In this example, cache blocks are 16 bytes; 8 sets in cache How many block offset bits? How many set index bits? Address bits: ttt....t sss bbbb B = 16 = 2b: b=4 offset bits S = 8 = 2s: s=3 index bits 0: 000....0 000 0000 128: 000....1 000 0000 160: 000....1 010 0000 float dotprod(float x[8], float y[8]) { float sum = 0; int i; for (i = 0; i < 8; i++) sum += x[i]*y[i]; return sum; } x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] y[0] y[1] y[2] y[3] x[0] x[1] x[2] x[3] x[0] x[1] x[2] x[3] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] if x and y have aligned starting addresses, e.g., &x[0] = 0, &y[0] = 128 if x and y have unaligned starting addresses, e.g., &x[0] = 0, &y[0] = 160 y[0] y[1] y[2] y[3] y[4] y[5] y[6] y[7] Winter 2013 Memory and Caches II

E-way Set-Associative Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 1 2 7 tag v 3 6 5 4 1 2 7 tag v 3 6 5 4 find set 1 2 7 tag v 3 6 5 4 1 2 7 tag v 3 6 5 4 Winter 2013 Memory and Caches II

E-way Set-Associative Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit 1 2 7 tag v 3 6 5 4 tag block offset Winter 2013 Memory and Caches II

E-way Set-Associative Cache (Here: E = 2) E = 2: Two lines per set Assume: cache block size 8 bytes Address of short int: t bits 0…01 100 compare both valid? + match: yes = hit 1 2 7 tag v 3 6 5 4 block offset short int (2 Bytes) is here No match: One line in set is selected for eviction and replacement Replacement policies: random, least recently used (LRU), … Winter 2013 Memory and Caches II

Example (for E = 2) float dotprod(float x[8], float y[8]) { float sum = 0; int i; for (i = 0; i < 8; i++) sum += x[i]*y[i]; return sum; } If x and y have aligned starting addresses, e.g. &x[0] = 0, &y[0] = 128, can still fit both because two lines in each set x[0] x[1] x[2] x[3] y[0] y[1] y[2] y[3] x[4] x[5] x[6] x[7] y[4] y[5] y[6] y[7] Winter 2013 Memory and Caches II

Fully Set-Associative Caches (S = 1) Fully-associative caches have all lines in one single set, S = 1 E = C / B, where C is total cache size Since, S = ( C / B ) / E, therefore, S = 1 Direct-mapped caches have E = 1 S = ( C / B ) / E = C / B Tag matching is more expensive in associative caches Fully-associative cache needs C / B tag comparators: one for every line! Direct-mapped cache needs just 1 tag comparator In general, an E-way set-associative cache needs E tag comparators Tag size, assuming m address bits (m = 32 for IA32): m – log2S – log2B Winter 2013 Memory and Caches II

Intel Core i7 Cache Hierarchy Processor package Core 0 Core 3 L1 i-cache and d-cache: 32 KB, 8-way, Access: 4 cycles L2 unified cache: 256 KB, 8-way, Access: 11 cycles L3 unified cache: 8 MB, 16-way, Access: 30-40 cycles Block size: 64 bytes for all caches. Regs Regs L1 d-cache L1 i-cache L1 d-cache L1 i-cache … cat /sys/devices/system/cpu/cpu0/cache/index0/ways_of_associativity L2 unified cache L2 unified cache L3 unified cache (shared by all cores) Main memory Winter 2013 Memory and Caches II

What about writes? Multiple copies of data exist: L1, L2, possibly L3, main memory What to do on a write-hit? Write-through (write immediately to memory) Write-back (defer write to memory until line is evicted) Need a dirty bit to indicate if line is different from memory or not What to do on a write-miss? Write-allocate (load into cache, update line in cache) Good if more writes to the location follow No-write-allocate (just write immediately to memory) Typical caches: Write-back + Write-allocate, usually Write-through + No-write-allocate, occasionally Winter 2013 Memory and Caches II

Software Caches are More Flexible Examples File system buffer caches, web browser caches, etc. Some design differences Almost always fully-associative so, no placement restrictions index structures like hash tables are common (for placement) Often use complex replacement policies misses are very expensive when disk or network involved worth thousands of cycles to avoid them Not necessarily constrained to single “block” transfers may fetch or write-back in larger units, opportunistically Winter 2013 Memory and Caches II

Optimizations for the Memory Hierarchy Write code that has locality Spatial: access data contiguously Temporal: make sure access to the same data is not too far apart in time How to achieve? Proper choice of algorithm Loop transformations Winter 2013 Memory and Caches II

Example: Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i++) for (j = 0; j < n; j++) for (k = 0; k < n; k++) c[i*n + j] += a[i*n + k]*b[k*n + j]; } i is row of c, j is column of c: [i, j] k is used to simultaneously traverse row i of a and column j of b j c a b = * i Winter 2013 Memory and Caches II

Cache Miss Analysis = * = * Assume: First iteration: Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) First iteration: n/8 + n = 9n/8 misses (omitting matrix c) Afterwards in cache: (schematic) n = * = * 8 wide Winter 2013 Memory and Caches II

Cache Miss Analysis = * Assume: Other iterations: Total misses: Matrix elements are doubles Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Other iterations: Again: n/8 + n = 9n/8 misses (omitting matrix c) Total misses: 9n/8 * n2 = (9/8) * n3 n = * 8 wide Winter 2013 Memory and Caches II

Blocked Matrix Multiplication c = (double *) calloc(sizeof(double), n*n); /* Multiply n x n matrices a and b */ void mmm(double *a, double *b, double *c, int n) { int i, j, k; for (i = 0; i < n; i+=B) for (j = 0; j < n; j+=B) for (k = 0; k < n; k+=B) /* B x B mini matrix multiplications */ for (i1 = i; i1 < i+B; i1++) for (j1 = j; j1 < j+B; j1++) for (k1 = k; k1 < k+B; k1++) c[i1*n + j1] += a[i1*n + k1]*b[k1*n + j1]; } j1 c a b = * i1 Block size B x B Winter 2013 Memory and Caches II

Cache Miss Analysis = * = * Assume: First (block) iteration: Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C First (block) iteration: B2/8 misses for each block 2n/B * B2/8 = nB/4 (omitting matrix c) Afterwards in cache (schematic) n/B blocks 3B2 < C: need to be able to fit three full blocks in the cache (one for c, one for a, one for b) while executing the algorithm = * Block size B x B = * Winter 2013 Memory and Caches II

Cache Miss Analysis = * Assume: Other (block) iterations: Cache block = 64 bytes = 8 doubles Cache size C << n (much smaller than n) Three blocks fit into cache: 3B2 < C Other (block) iterations: Same as first iteration 2n/B * B2/8 = nB/4 Total misses: nB/4 * (n/B)2 = n3/(4B) n/B blocks = * Block size B x B Winter 2013 Memory and Caches II

Summary No blocking: (9/8) * n3 Blocking: 1/(4B) * n3 If B = 8 difference is 4 * 8 * 9 / 8 = 36x If B = 16 difference is 4 * 16 * 9 / 8 = 72x Suggests largest possible block size B, but limit 3B2 < C! Reason for dramatic difference: Matrix multiplication has inherent temporal locality: Input data: 3n2, computation 2n3 Every array element used O(n) times! But program has to be written properly 3B2 < C: need to be able to fit three full blocks in the cache (one for c, one for a, one for b) while executing the algorithm Winter 2013 Memory and Caches II

Cache-Friendly Code Programmer can optimize for cache performance How data structures are organized How data are accessed Nested loop structure Blocking is a general technique All systems favor “cache-friendly code” Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality) Focus on inner loop code Winter 2013 Memory and Caches II

The Memory Mountain Intel Core i7 32 KB L1 i-cache 32 KB L1 d-cache 256 KB unified L2 cache 8M unified L3 cache All caches on-chip The Memory Mountain Winter 2013 Memory and Caches II

Winter 2013 Memory and Caches II