ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
CMSC 611: Advanced Computer Architecture
CSE 351 Section 9 3/1/12.
Associativity in Caches Lecture 25
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Lecture: Cache Hierarchies
ECE 445 – Computer Organization
Improving cache performance of MPEG video codec
Lecture 14: Reducing Cache Misses
Systems Architecture II
ECE Dept., University of Toronto
Lecture: Cache Innovations, Virtual Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
Siddhartha Chatterjee
CS-447– Computer Architecture Lecture 20 Cache Memories
15-740/ Computer Architecture Lecture 14: Prefetching
Virtual Memory Prof. Eric Rotenberg
Caches: reducing miss penalty Prof. Eric Rotenberg
Lecture: Cache Hierarchies
Basic Cache Operation Prof. Eric Rotenberg
CSC3050 – Computer Architecture
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
Cache - Optimization.
Cache Performance Improvements
Principle of Locality: Memory Hierarchies
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg ECE 463/563 Fall `18 Reducing miss rate: cache dimensions, prefetching, loop transformations Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Reduce Miss Rate Cache size, associativity, block size Prefetching: Hardware, Software Transform program to increase locality Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Cache size Increase cache size Decrease miss rate Increase hit time Asymptotically approaches just the compulsory miss rate. At some point, cache size becomes large enough to eliminate capacity and conflict misses (moreover, this point tends to be reached sooner with higher associativity). miss rate “diminishing returns” log(cache size) Small decrease in miss rate in this region may not justify (1) big increase in hit time, and (2) taking chip area away from other units. Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Associativity Increase associativity (for a fixed cache size) In general, for the same cache size, increasing associativity tends to decrease miss rate (decrease conflict misses) May increase hit time, for the same cache size. Energy per access must also be looked at. miss rate diminishing returns 4-way or 8-way set-associative is almost equivalent to fully-associative in many cases log(associativity) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Block size Increase block size (for a fixed cache size) Miss rate may decrease, up to a point, due to exploiting more spatial locality Miss rate may increase after a point, due to cache pollution For a fixed cache size, a side-effect of larger blocks is having fewer total blocks in the cache. It’s a trade-off between hits on consecutive bytes (fewer, large blocks) and hits on non-consecutive bytes (more, small blocks). At some point, you exhaust all the spatial locality and increasing block size further only takes cache space away from useful bytes in other blocks. A secondary drawback of a larger block is that it increases miss penalty (more bytes to fetch) “cache pollution” miss rate block size Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Prefetching Idea: get it before you need it Prefetching can be implemented in hardware, software (e.g., compiler), or both Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Hardware prefetching General idea Autonomous hardware prefetcher sits alongside cache Predict which blocks may be accessed in the future Prefetch these predicted blocks Simplest hardware prefetchers: stride prefetchers +1 prefetch (stride = 1): fetch missing block, and next sequential block Works well for streams with high sequential locality, e.g., instruction caches +n prefetch (stride = n): observe memory is being accessed every n blocks, so prefetch block +n: example of code that has this behavior: for (i = 1; i < MAX; i += 8) a[i] = b[i]; block X b[0] b[1] b[2] b[3] X+1 b[4] b[5] b[6] b[7] X+2 b[8] b[9] b[10] b[11] X+3 b[12] b[13] b[14] b[15] Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Software prefetching Need a “prefetch” instruction Sole purpose is to calculate an address and access the cache with the address. If it hits, nothing happens. If it misses, the processor does not stall; the only thing that happens is that the cache will fetch the memory block. Like a load instruction, except: It does not delay processor on a miss It does not change the processor’s architectural state in any way: Doesn’t have a destination register Doesn’t cause exceptions (we’ll learn about exceptions later) In other words, the sole purpose of a “prefetch” instruction is to tell the cache to fetch the specified block if it doesn’t already have that block Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Software prefetching (cont.) Compiler predicts which accesses are likely to cause misses Compiler inserts prefetch instructions well enough ahead to prevent these accesses from missing The misses still occur, but they occur in advance The prefetches miss, but the accesses that are targeted by the prefetches do not (if everything works as planned) for (i = 0; i < 100; i++) { prefetch(x[i+k]); x[i] = c * x[i]; } for (i = 0; i < 100; i++) x[i] = c * x[i]; Where k depends on (1) the miss penalty and (2) the time it takes to execute an iteration assuming hits Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Software prefetching (cont.) for (i = 0; i < 100; i++) { prefetch(x[i+k]); x[i] = c * x[i]; } Where k depends on (1) the miss penalty and (2) the time it takes to execute an iteration assuming hits CPU is currently in iteration i In the example below: k = 11 prefetch x[i+k] i i+k . . . . . . miss penalty: time to service a miss execution time for one iteration of inner loop, assuming cache hits Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Potential issues with prefetching Cache pollution Inaccurate prefetches bring in useless blocks, displacing useful ones Must be careful not to increase miss rate Solution: prefetch block into a “stream buffer” or “candidate cache”, transfer block to main cache only when the block is actually referenced by the program Bandwidth hog Inaccurate prefetches waste bandwidth throughout the memory hierarchy Must be careful that prefetch misses (prefetch traffic) do not delay demand misses (legitimate traffic) Solutions: Be selective: balance removing as many misses as possible with minimizing useless prefetches Request queues throughout memory hierarchy should prioritize demand misses over prefetch misses Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Transform program to increase locality Increase spatial locality Explicitly place items close to each other, that are accessed close in time Increase temporal locality Transform computation to increase the number of times items are reused before being replaced in the cache Examples: Loop interchange Loop fusion Loop tiling (also called loop blocking) Feel free to explore these on your own We’ll only cover this one since it is quite relevant Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Tiling Idea: access “regions” of arrays instead of whole array T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = k = i = 0…T j = 0…T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Tiling (cont.) T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = k = 1 i = 0…T j = 0…T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Tiling (cont.) T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = T k = i = 0…T j = T…2T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg