ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Slides:

Advertisements

Similar presentations

1 Optimizing compilers Managing Cache Bercovici Sivan.

Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

1  2004 Morgan Kaufmann Publishers Chapter Seven.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.

CMSC 611: Advanced Computer Architecture

CSE 351 Section 9 3/1/12.

Associativity in Caches Lecture 25

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

The Hardware/Software Interface CSE351 Winter 2013

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

Morgan Kaufmann Publishers

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

Lecture: Cache Hierarchies

ECE 445 – Computer Organization

Improving cache performance of MPEG video codec

Lecture 14: Reducing Cache Misses

Systems Architecture II

ECE Dept., University of Toronto

Lecture: Cache Innovations, Virtual Memory

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Adapted from slides by Sally McKee Cornell University

Siddhartha Chatterjee

CS-447– Computer Architecture Lecture 20 Cache Memories

15-740/ Computer Architecture Lecture 14: Prefetching

Virtual Memory Prof. Eric Rotenberg

Caches: reducing miss penalty Prof. Eric Rotenberg

Lecture: Cache Hierarchies

Basic Cache Operation Prof. Eric Rotenberg

CSC3050 – Computer Architecture

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg

Cache - Optimization.

Cache Performance Improvements

Principle of Locality: Memory Hierarchies

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg ECE 463/563 Fall `18 Reducing miss rate: cache dimensions, prefetching, loop transformations Prof. Eric Rotenberg Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Reduce Miss Rate Cache size, associativity, block size Prefetching: Hardware, Software Transform program to increase locality Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Cache size Increase cache size Decrease miss rate Increase hit time Asymptotically approaches just the compulsory miss rate. At some point, cache size becomes large enough to eliminate capacity and conflict misses (moreover, this point tends to be reached sooner with higher associativity). miss rate “diminishing returns” log(cache size) Small decrease in miss rate in this region may not justify (1) big increase in hit time, and (2) taking chip area away from other units. Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Associativity Increase associativity (for a fixed cache size) In general, for the same cache size, increasing associativity tends to decrease miss rate (decrease conflict misses) May increase hit time, for the same cache size. Energy per access must also be looked at. miss rate diminishing returns 4-way or 8-way set-associative is almost equivalent to fully-associative in many cases log(associativity) Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Block size Increase block size (for a fixed cache size) Miss rate may decrease, up to a point, due to exploiting more spatial locality Miss rate may increase after a point, due to cache pollution For a fixed cache size, a side-effect of larger blocks is having fewer total blocks in the cache. It’s a trade-off between hits on consecutive bytes (fewer, large blocks) and hits on non-consecutive bytes (more, small blocks). At some point, you exhaust all the spatial locality and increasing block size further only takes cache space away from useful bytes in other blocks. A secondary drawback of a larger block is that it increases miss penalty (more bytes to fetch) “cache pollution” miss rate block size Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Prefetching Idea: get it before you need it Prefetching can be implemented in hardware, software (e.g., compiler), or both Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Hardware prefetching General idea Autonomous hardware prefetcher sits alongside cache Predict which blocks may be accessed in the future Prefetch these predicted blocks Simplest hardware prefetchers: stride prefetchers +1 prefetch (stride = 1): fetch missing block, and next sequential block Works well for streams with high sequential locality, e.g., instruction caches +n prefetch (stride = n): observe memory is being accessed every n blocks, so prefetch block +n: example of code that has this behavior: for (i = 1; i < MAX; i += 8) a[i] = b[i]; block X b[0] b[1] b[2] b[3] X+1 b[4] b[5] b[6] b[7] X+2 b[8] b[9] b[10] b[11] X+3 b[12] b[13] b[14] b[15] Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Software prefetching Need a “prefetch” instruction Sole purpose is to calculate an address and access the cache with the address. If it hits, nothing happens. If it misses, the processor does not stall; the only thing that happens is that the cache will fetch the memory block. Like a load instruction, except: It does not delay processor on a miss It does not change the processor’s architectural state in any way: Doesn’t have a destination register Doesn’t cause exceptions (we’ll learn about exceptions later) In other words, the sole purpose of a “prefetch” instruction is to tell the cache to fetch the specified block if it doesn’t already have that block Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Software prefetching (cont.) Compiler predicts which accesses are likely to cause misses Compiler inserts prefetch instructions well enough ahead to prevent these accesses from missing The misses still occur, but they occur in advance The prefetches miss, but the accesses that are targeted by the prefetches do not (if everything works as planned) for (i = 0; i < 100; i++) { prefetch(x[i+k]); x[i] = c * x[i]; } for (i = 0; i < 100; i++) x[i] = c * x[i]; Where k depends on (1) the miss penalty and (2) the time it takes to execute an iteration assuming hits Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Software prefetching (cont.) for (i = 0; i < 100; i++) { prefetch(x[i+k]); x[i] = c * x[i]; } Where k depends on (1) the miss penalty and (2) the time it takes to execute an iteration assuming hits CPU is currently in iteration i In the example below: k = 11 prefetch x[i+k] i i+k . . . . . . miss penalty: time to service a miss execution time for one iteration of inner loop, assuming cache hits Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Potential issues with prefetching Cache pollution Inaccurate prefetches bring in useless blocks, displacing useful ones Must be careful not to increase miss rate Solution: prefetch block into a “stream buffer” or “candidate cache”, transfer block to main cache only when the block is actually referenced by the program Bandwidth hog Inaccurate prefetches waste bandwidth throughout the memory hierarchy Must be careful that prefetch misses (prefetch traffic) do not delay demand misses (legitimate traffic) Solutions: Be selective: balance removing as many misses as possible with minimizing useless prefetches Request queues throughout memory hierarchy should prioritize demand misses over prefetch misses Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Transform program to increase locality Increase spatial locality Explicitly place items close to each other, that are accessed close in time Increase temporal locality Transform computation to increase the number of times items are reused before being replaced in the cache Examples: Loop interchange Loop fusion Loop tiling (also called loop blocking) Feel free to explore these on your own We’ll only cover this one since it is quite relevant Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Tiling Idea: access “regions” of arrays instead of whole array T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = k = i = 0…T j = 0…T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Tiling (cont.) T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = k = 1 i = 0…T j = 0…T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg Tiling (cont.) T Each iteration of k scans entire x[ ][ ]: cache T 1 2 Tiling factor “T” selected so that a tile of x[i][j] fits in the cache Memory layout of x[ ][ ] shown in 2D i i 3 4 j j ii = for (ii = 0; ii < N; ii += T) for (jj = 0; jj < N; jj += T) for (k = 0; k < N; k++) for (i = ii; i < min(ii+T, N); i++) for (j = jj; j < min(jj+T,N); j++) … reference x[i][j] … for (k = 0; k < N; k++) for (i = 0; i < N; i++) for (j = 0; j < N; j++) … reference x[i][j] … jj = T k = i = 0…T j = T…2T problem: all of x[i][j] can’t fit in the cache Fall 2018 ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg