Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Slides:



Advertisements
Similar presentations
Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
0 High-Performance Computer Architecture Memory Organization Chapter 5 from Quantitative Architecture January 2006.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Computer Architecture Ch5-1 Ping-Liang Lai ( 賴秉樑 ) Lecture 5 Review of Memory Hierarchy (Appendix C in textbook) Computer Architecture 計算機結構.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Memory Architecture Chapter 5 in Hennessy & Patterson.
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Hierarchy and Cache Design (4). Reducing Hit Time 1. Small and Simple Caches 2. Avoiding Address Translation During Indexing of the Cache –Using.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CMSC 611: Advanced Computer Architecture
Improving Memory Access The Cache and Virtual Memory
Reducing Hit Time Small and simple caches Way prediction Trace caches
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture: Cache Hierarchies
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Lecture: Cache Innovations, Virtual Memory
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Cache - Optimization.
Fundamentals of Computing: Computer Architecture
Cache Performance Improvements
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Chapter 5 Memory III CSE 820

Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Michigan State University Computer Science and Engineering Larger Block Size Reduces compulsory misses through spatial locality But, –miss penalty increases: higher bandwidth helps –miss rate can increase: fixed cache size + larger blocks means fewer blocks in the cache

Michigan State University Computer Science and Engineering Notice the “U” shape: some is good, too much is bad.

Michigan State University Computer Science and Engineering Larger Caches Reduces capacity misses But –Increased hit time –Increased cost ($) Over time, L2 and higher cache size increases

Michigan State University Computer Science and Engineering Higher Associativity Reduces miss rates with fewer conflicts But –Increased hit time (tag check) Note –An 8-way associative cache has close to the same miss rate as fully associative

Michigan State University Computer Science and Engineering Way Prediction Predict which way of a L1 cache will be accessed next –Alpha correct prediction is 1 cycle incorrect prediction is 3 cycles –SPEC95 prediction is 85% correct

Michigan State University Computer Science and Engineering Compiler Techniques Reduce conflicts in I-cache: 1989 study showed reduced misses by 50% for a 2KB cache and by 75% for an 8KB cache D-cache performs differently

Michigan State University Computer Science and Engineering Compiler data optimizations Loop Interchange Before for (j = … for (i = … x[i][j] = 2 * x[i][j] After for (i = … for (j = … x[i][j] = 2 * x[i][j] Improved Spatial Locality

Michigan State University Computer Science and Engineering Blocking: Improve Spatial Locality Before After

Michigan State University Computer Science and Engineering Miss Rate and Miss Penalty Reduction via Parallelism

Michigan State University Computer Science and Engineering Nonblocking Caches Reduces stalls on cache miss A blocking cache refuses all requests while waiting for data A nonblocking cache continues to handle other requests while waiting for data on another request Increases cache controller complexity

Michigan State University Computer Science and Engineering NonBlocking Cache (8K direct L1; 32 byte blocks)

Michigan State University Computer Science and Engineering Hardware Prefetch Fetch two blocks: desired + next “Next” goes into “stream buffer” on fetch check stream buffer first Performance –Single-instruction stream buffer caught 15% to 25% of L1 misses –4-instruction stream buffer caught 50% –16-instruction stream buffer caught 72%

Michigan State University Computer Science and Engineering Hardware Prefetch Data prefetch –Single-data stream buffer caught 25% of L1 misses –4-data stream buffer caught 43% –8-data stream buffers caught 50% to 70% Prefetch from multiple addresses UltraSPARCIII handles 8 prefetches calculates “stride” for next prediction

Michigan State University Computer Science and Engineering Software Prefetch Many processors such as Itanium have prefetch instructions Remember they are nonfaulting

Michigan State University Computer Science and Engineering Hit Time Reduction

Michigan State University Computer Science and Engineering Small, Simple Caches Time –Indexing –Comparing tag Small  indexing is fast Simple  direct allows tag comparison in parallel with data load  L2 with tag on chip with data off chip

Michigan State University Computer Science and Engineering Time vs cache size & organization

Michigan State University Computer Science and Engineering Perspective on previous graph Same: –1ns clock is sec/clockCycle –1 GHz is 10 9 clockCycles/sec Therefore, –2ns clock is 500 MHz –4ns clock is 250 MHz Conclude that small differences in ns represents a large difference in MHz

Michigan State University Computer Science and Engineering Virtual vs Physical Address in L1 Translating from virtual address to physical address as part of cache access takes time on critical path Translation is needed for both index and tag Making the common case fast suggests avoiding translation for hits (misses must be translated)

Michigan State University Computer Science and Engineering Why are L1 caches physical? (almost all) Security (Protection): page-level protection must be checked on access (protection data can be copied into cache) Process switch can change virtual mapping requiring cache flush (or Process ID) [see next slide] Synonyms: two virtual addresses for same (shared) physical address

Michigan State University Computer Science and Engineering Virtually-addressed cache context-switch cost

Michigan State University Computer Science and Engineering Hybrid: virtually indexed, physically tagged Index with the part of the page offset that is identical in virtual and physical addresses i.e. the index bits are a subset of the page-offset bits In parallel with indexing, translate the virtual address to check the physical tag Limitation: direct-mapped cache ≤ page size (determined by address bits) set-associative caches can be bigger since fewer bits are needed for index

Michigan State University Computer Science and Engineering Example Pentium III –8 KB pages with 16KB 2-way set-associative cache IBM 3033 –4KB pages with 64KB 16-way set-associative cache (note that 8-way is sufficient, but 16-way is needed to keep index bits sufficiently small)

Michigan State University Computer Science and Engineering Trace Cache Pentium 4 NetBurst architecture I-cache blocks are organized to contain instruction traces including predicted taken branches instead of organized around memory addresses Advantage over regular large cache blocks which contain branches and, hence, many unused instructions e.g. AMD Athlon 64-byte blocks contain x86 instructions with 1-in-5 being branches Disadvantage: complex addressing

Michigan State University Computer Science and Engineering Trace Cache P4 trace cache (I-cache) is placed after decode and branch predict so it contains –μops –only desired instructions Trace cache contains 12K μops Branch predict BTB is 4K (33% improvement over PIII)

Michigan State University Computer Science and Engineering

Michigan State University Computer Science and Engineering Summary (so far) Figure 5.26 summarizes all

Michigan State University Computer Science and Engineering Main-memory Main-memory modifications can help cache miss penalty by bringing words faster from memory –Wider path to memory brings in more words at a time, e.g. one address request brings in 4 words (reduces overhead) –Interleaved memory can allow memory to respond faster