Krste Asanovic Electrical Engineering and Computer Sciences

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache IV Steve Ko Computer Sciences and Engineering University at Buffalo.

February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

February 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

February 9, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.

CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of.

February 11, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering.

ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 13 Memory – Part 2 Benjamin Lee Electrical and Computer Engineering Duke University

February 9, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:

CENG709 Computer Architecture and Operating Systems Lecture 7 - Memory Hierarchy-II Murat Manguoglu Department of Computer Engineering Middle East Technical.

Memory Hierarchy— Reducing Miss Penalty Reducing Hit Time Main Memory

Reducing Hit Time Small and simple caches Way prediction Trace caches

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Lecture: Cache Hierarchies

The University of Adelaide, School of Computer Science

Lecture 9: Memory Hierarchy (3)

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Lecture: Cache Hierarchies

Krste Asanovic Electrical Engineering and Computer Sciences

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 14: Reducing Cache Misses

CS203A Graduate Computer Architecture Lecture 13 Cache Design

Electrical and Computer Engineering

Lecture: Cache Innovations, Virtual Memory

Krste Asanovic Electrical Engineering and Computer Sciences

Adapted from slides by Sally McKee Cornell University

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Siddhartha Chatterjee

Lecture 15: Memory Design

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

CS 3410, Spring 2014 Computer Science Cornell University

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 6 – Memory II Krste Asanovic Electrical Engineering and Computer.

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 7 – Memory III Krste Asanovic Electrical Engineering and Computer.

Lecture: Cache Hierarchies

CSC3050 – Computer Architecture

ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg

Cache Performance Improvements

CSE 486/586 Distributed Systems Cache Coherence

Lecture 14: Cache Performance

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

CS 152 Computer Architecture and Engineering Lecture 8 - Memory Hierarchy-III Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152 CS252 S05

hit time + miss rate * miss penalty Last time in Lecture 7 3 C’s of cache misses: compulsory, capacity, conflict Average memory access time = hit time + miss rate * miss penalty To improve performance, reduce: hit time miss rate and/or miss penalty Primary cache parameters: Total cache capacity Cache line size Associativity 2/21/2008 CS152-Spring’08 CS252 S05

Multilevel Caches DRAM CPU L2$ L1$ A memory cannot be large and fast Increasing sizes of cache at each level CPU L1$ L2$ DRAM Local miss rate = misses in cache / accesses to cache Global miss rate = misses in cache / CPU memory accesses Misses per instruction = misses in cache / number of instructions MPI makes it easier to compute overall performance (jse) 2/21/2008 CS152-Spring’08 CS252 S05

A Typical Memory Hierarchy c.2008 Split instruction & data primary caches (on-chip SRAM) Multiple interleaved memory banks (off-chip DRAM) L1 Instruction Cache Unified L2 Cache Memory CPU Memory Memory L1 Data Cache RF Memory Implementation close to the CPU looks like a Harvard machine…(jse) Multiported register file (part of CPU) Large unified secondary cache (on-chip SRAM) 2/21/2008 CS152-Spring’08 CS252 S05

Presence of L2 influences L1 design Use smaller L1 if there is also L2 Trade increased L1 miss rate for reduced L1 hit time and reduced L1 miss penalty Reduces average access energy Use simpler write-through L1 cache with on-chip L2 Write-back L2 cache absorbs write traffic, doesn’t go off-chip At most one L1 miss request per L1 access (no dirty victim write back) simplifies pipeline control Simplifies coherence issues Simplifies error recovery in L1 (can use just parity bits in L1 and reload from L2 when parity error detected on L1 read) 2/21/2008 CS152-Spring’08 CS252 S05

Inclusion Policy Inclusive multilevel cache: Inner cache holds copies of data in outer cache External access need only check outer cache Most common case Exclusive multilevel caches: Inner cache may hold data not in outer cache Swap lines between inner/outer caches on miss Used in AMD Athlon with 64KB primary and 256KB secondary cache Why choose one type or the other? New slide (jse) 2/21/2008 CS152-Spring’08 CS252 S05

Itanium-2 On-Chip Caches (Intel/HP, 2002) Level 1, 16KB, 4-way s.a., 64B line, quad-port (2 load+2 store), single cycle latency Level 2, 256KB, 4-way s.a, 128B line, quad-port (4 load or 4 store), five cycle latency Level 3, 3MB, 12-way s.a., 128B line, single 32B port, twelve cycle latency If two is good, then three must be better (jse) 2/21/2008 CS152-Spring’08 CS252 S05

Reducing penalty of associativity Associativity reduces conflict misses, but requires expensive (area, energy, delay) multi-way tag search Two optimizations to reduce cost of associativity Victim caches Way prediction 2/21/2008 CS152-Spring’08 CS252 S05

Victim Caches (Jouppi 1990) Unified L2 Cache CPU L1 Data Cache Direct Map. RF Evicted data from L1 (HP 7200) Victim Cache Fully Assoc. 4 blocks where ? Hit data from VC (miss in L1) Evicted data From VC Victim cache is a small associative back up cache, added to a direct mapped cache, which holds recently evicted lines First look up in direct mapped cache If miss, look in victim cache If hit in victim cache, swap hit line with line now evicted from L1 If miss in victim cache, L1 victim -> VC, VC victim->? Fast hit time of direct mapped but with reduced conflict misses 2/21/2008 CS152-Spring’08 CS252 S05

Way Predicting Caches (MIPS R10000 off-chip L2 cache) Use processor address to index into way prediction table Look in predicted way at given index, then: HIT MISS Return copy of data from cache Look in other way Read block of data from next level of cache Prediction for data is hard, but what about instruction cache (jse) MISS SLOW HIT (change entry in prediction table) 2/21/2008 CS152-Spring’08 CS252 S05

Way Predicting Instruction Cache (Alpha 21264-like) Jump target 0x4 Jump control Add PC addr inst Primary Instruction Cache way New slide (jse) If way prediction is wrong – need to retrain and reaccess. Sequential Way Branch Target Way 2/21/2008 CS152-Spring’08 CS252 S05

Reduce Miss Penalty of Long Blocks: Early Restart and Critical Word First Don’t wait for full block before restarting CPU Early restart—As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution Spatial locality  tend to want next sequential word, so not clear size of benefit of just early restart Critical Word First—Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Long blocks more popular today  Critical Word 1st Widely used block 2/21/2008 CS152-Spring’08 CS252 S05

Increasing Cache Bandwidth with Non-Blocking Caches Non-blocking cache or lockup-free cache allow data cache to continue to supply cache hits during a miss requires Full/Empty bits on registers or out-of-order execution “hit under miss” reduces the effective miss penalty by working during miss vs. ignoring CPU requests “hit under multiple miss” or “miss under miss” may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses, and can get miss to line with outstanding miss (secondary miss) Requires pipelined or banked memory system (otherwise cannot support multiple misses) Pentium Pro allows 4 outstanding memory misses (Cray X1E vector supercomputer allows 2,048 outstanding memory misses) 2/21/2008 CS152-Spring’08 CS252 S05

Value of Hit Under Miss for SPEC (old data) Integer Floating Point 0->1 1->2 2->64 Base “Hit under n Misses” FP programs on average: AMAT= 0.68 -> 0.52 -> 0.34 -> 0.26 Int programs on average: AMAT= 0.24 -> 0.20 -> 0.19 -> 0.19 8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss, SPEC 92 2/21/2008 CS152-Spring’08 CS252 S05

CS152 Administrivia Textbooks should appear in Cal book store soon (already there? Or next couple of days) Quiz 1 handed back next Tuesday in class. First set of open-ended labs seemed fine. Opinions? Encourage you to try to find your own ideas, instead of following suggestions. 2/21/2008 CS152-Spring’08 CS252 S05

Prefetching Speculate on future instruction and data accesses and fetch them into cache(s) Instruction accesses easier to predict than data accesses Varieties of prefetching Hardware prefetching Software prefetching Mixed schemes What types of misses does prefetching affect? Reduces compulsory misses, can increase conflict and capacity misses. 2/21/2008 CS152-Spring’08 CS252 S05

Issues in Prefetching Usefulness – should produce hits Timeliness – not late and not too early Cache and bandwidth pollution L1 Instruction Unified L2 Cache CPU L1 Data RF Prefetched data 2/21/2008 CS152-Spring’08 CS252 S05

Hardware Instruction Prefetching Instruction prefetch in Alpha AXP 21064 Fetch two blocks on a miss; the requested block (i) and the next consecutive block (i+1) Requested block placed in cache, and next block in instruction stream buffer If miss in cache but hit in stream buffer, move stream buffer block into cache and prefetch next block (i+2) Prefetched instruction block Req block Stream Buffer Unified L2 Cache CPU Need to check the stream buffer if the requested block is in there. Never more than one 32-byte block in the stream buffer. L1 Instruction Req block RF 2/21/2008 CS152-Spring’08 CS252 S05

Hardware Data Prefetching Prefetch-on-miss: Prefetch b + 1 upon miss on b One Block Lookahead (OBL) scheme Initiate prefetch for block b + 1 when block b is accessed Why is this different from doubling block size? Can extend to N block lookahead Strided prefetch If observe sequence of accesses to block b, b+N, b+2N, then prefetch b+3N etc. Example: IBM Power 5 [2003] supports eight independent streams of strided prefetch per processor, prefetching 12 lines ahead of current access HP PA 7200 uses OBL prefetching Tag prefetching is twice as effective as prefetch-on-miss in reducing miss rates. 2/21/2008 CS152-Spring’08 CS252 S05

Software Prefetching for(i=0; i < N; i++) { prefetch( &a[i + 1] ); prefetch( &b[i + 1] ); SUM = SUM + a[i] * b[i]; } What property do we require of the cache for prefetching to work ? Cache should be non-blocking or lockup-free. By that we mean that the processor can proceed while the prefetched Data is being fetched; and the caches continue to supply instructions And data while waiting for the prefetched data to return. 2/21/2008 CS152-Spring’08 CS252 S05

Software Prefetching Issues Timing is the biggest issue, not predictability If you prefetch very close to when the data is required, you might be too late Prefetch too early, cause pollution Estimate how long it will take for the data to come into L1, so we can set P appropriately Why is this hard to do? for(i=0; i < N; i++) { prefetch( &a[i + P] ); prefetch( &b[i + P] ); SUM = SUM + a[i] * b[i]; } Must consider cost of prefetch instructions 2/21/2008 CS152-Spring’08 CS252 S05

Compiler Optimizations Restructuring code affects the data block access sequence Group data accesses together to improve spatial locality Re-order data accesses to improve temporal locality Prevent data from entering the cache Useful for variables that will only be accessed once before being replaced Needs mechanism for software to tell hardware not to cache data (instruction hints or page table bits) Kill data that will never be used again Streaming data exploits spatial locality but not temporal locality Replace into dead cache locations Miss-rate reduction without any hardware changes. Hardware designer’s favorite solution. 2/21/2008 CS152-Spring’08 CS252 S05

What type of locality does this improve? Loop Interchange for(j=0; j < N; j++) { for(i=0; i < M; i++) { x[i][j] = 2 * x[i][j]; } } for(i=0; i < M; i++) { for(j=0; j < N; j++) { x[i][j] = 2 * x[i][j]; } } What type of locality does this improve? 2/21/2008 CS152-Spring’08 CS252 S05

What type of locality does this improve? Loop Fusion for(i=0; i < N; i++) a[i] = b[i] * c[i]; for(i=0; i < N; i++) d[i] = a[i] * c[i]; for(i=0; i < N; i++) { a[i] = b[i] * c[i]; d[i] = a[i] * c[i]; } What type of locality does this improve? 2/21/2008 CS152-Spring’08 CS252 S05

Blocking x j y k z j i i k Not touched Old access New access for(i=0; i < N; i++) for(j=0; j < N; j++) { r = 0; for(k=0; k < N; k++) r = r + y[i][k] * z[k][j]; x[i][j] = r; } x j y k z j i i k Not touched Old access New access 2/21/2008 CS152-Spring’08 CS252 S05

Blocking x j y k z j i i k What type of locality does this improve? for(jj=0; jj < N; jj=jj+B) for(kk=0; kk < N; kk=kk+B) for(i=0; i < N; i++) for(j=jj; j < min(jj+B,N); j++) { r = 0; for(k=kk; k < min(kk+B,N); k++) r = r + y[i][k] * z[k][j]; x[i][j] = x[i][j] + r; } x j y k z j Y benefits from spatial locality z benefits from temporal locality i i k What type of locality does this improve? 2/21/2008 CS152-Spring’08 CS252 S05

Workstation Memory System (Apple PowerMac G5, 2003) Dual 2GHz processors, each has: 64KB I-cache, direct mapped 32KB D-cache, 2-way 512KB L2 unified cache, 8-way All 128B lines 1GHz, 2x32-bit bus, 16GB/s AGP Graphics Card, 533MHz, 32-bit bus, 2.1GB/s North Bridge Chip Up to 8GB DDR SDRAM, 400MHz, 128-bit bus, 6.4GB/s PCI-X Expansion, 133MHz, 64-bit bus, 1 GB/s Apple Power Mac G5 block diagram from Apple white paper on G5 technology. 2/21/2008 CS152-Spring’08 CS252 S05

Acknowledgements These slides contain material developed and copyright by: Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 2/21/2008 CS152-Spring’08 CS252 S05