EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.
ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.
1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )
1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)
Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.
Lecture 11 & 12: Caches Cache overview 4 Hierarchy questions More on Locality Please bring these slides to the next lecture!
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
Cache Memory Chapter 17 S. Dandamudi To be used with S. Dandamudi, “Fundamentals of Computer Organization and Design,” Springer,  S. Dandamudi.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CSCI206 - Computer Organization & Programming
Associativity in Caches Lecture 25
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
The Hardware/Software Interface CSE351 Winter 2013
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Lecture: Cache Hierarchies
Lecture 21: Memory Hierarchy
Chapter 5 Memory CSE 820.
Performance metrics for caches
CDA 5155 Caches.
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 3410, Spring 2014 Computer Science Cornell University
Cache - Optimization.
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5

Memory pyramid Disk (Many GB) Memory (128MB – fewGB) L2 Cache (½-32MB) L1 Cache (several KB) Reg 100s bytes Cache Design cycle access (early in pipeline) 1-3 cycle access 6-15 cycle access cycle access Millions cycle access!

Direct-mapped cache Cache V d tag data Memory Address Block Offset (1-bit) Line Index (2-bit) Tag (2-bit) Compulsory Miss: first reference to memory block Capacity Miss: Working set doesn’t fit in cache Conflict Miss: Working set maps to same cache line

2-way set associative cache Cache V d tag data Memory Address Block Offset (unchanged) 1-bit Set Index Larger (3-bit) Tag Rule of thumb: Increasing associativity decreases conflict misses. A 2-way associative cache has about the same hit rate as a direct mapped cache twice the size.

Effects of Varying Cache Parameters Total cache size: block size  # sets  associativity –Positives: Should decrease miss rate –Negatives: May increase hit time Increased area requirements

Effects of Varying Cache Parameters Bigger block size –Positives: Exploit spatial locality ; reduce compulsory misses Reduce tag overhead (bits) Reduce transfer overhead (address, burst data mode) –Negatives: Fewer blocks for given size; increase conflict misses Increase miss transfer time (multi-cycle transfers) Wasted bandwidth for non-spatial data

Effects of Varying Cache Parameters Increasing associativity –Positives: Reduces conflict misses Low-assoc cache can have pathological behavior (very high miss) –Negatives: Increased hit time More hardware requirements (comparators, muxes, bigger tags) Decreased improvements past 4- or 8- way.

Effects of Varying Cache Parameters Replacement Strategy: (for associative caches) LRU: intuitive; difficult to implement with high assoc; worst case performance can occur (N+1 element array) Random: Pseudo-random easy to implement; performance close to LRU for high associvity Optimal: replace block that has next reference farthest in the future; hard to implement

Other Cache Design Decisions Write Policy: How to deal with write misses? –Write-through / no-allocate Total traffic? Read misses  block size + writes Common for L1 caches back by L2 (esp. on-chip) –Write-back / write-allocate Needs a dirty bit to determine whether cache data differs Total traffic? (read misses + write misses)  block size + dirty-block-evictions  block size Common for L2 caches (memory bandwidth limited) –Variation: Write validate Write-allocate without fetch-on-write Needs sub-block cache with valid bits for each word/byte

Other Cache Design Decisions Write Buffering –Delay writes until bandwidth available Put them in FIFO buffer Only stall on write if buffer is full Use bandwidth for reads first (since they have latency problems) –Important for write-through caches since write traffic is frequent Write-back buffer –Holds evicted (dirty) lines for Write-back caches Also allows reads to have priority on the L2 or memory bus. Usually only needs a small buffer Ref: Eager Writeback Caches

Adding a Victim cache V d tag data (Direct mapped) V d tag data (fully associative) Small victim cache adds associativity to “hot” lines Blocks evicted from direct-mapped cache go to victim Tag compares are made to direct mapped and victim Victim hits cause lines to swap from L1 and victim Not very useful for associative L1 caches Victim cache (4 lines) Ref: Ref:

Hash-Rehash Cache V d tag data (Direct mapped)

Hash-Rehash Cache V d tag data (Direct mapped) Allocate? Miss Rehash miss

Hash-Rehash Cache V d tag data (Direct mapped) Miss Rehash miss R 010

Hash-Rehash Cache V d tag data (Direct mapped) Miss Rehash Hit! R 010

Hash- Rehash Cache Calculating performance: –Primary hit time (normal Direct mapped) –Rehash hit time (sequential tag lookups) –Block swap time? –Hit rate comparable to 2-way associative.

Compiler support for caching Array Merging (array of structs vs. 2 arrays) Loop interchange (row vs. column access) Structure padding and alignment (malloc) Cache conscious data placement –Pack working set into same line –Map to non-conflicting address is packing impossible

Prefetching Already done – bring in an entire line assuming spatial locality Extend this… Next Line Prefetch –Bring in the next block in memory as well a miss line (very good for Icache) Software prefetch –Loads to R0 have no data dependency Aggressive/speculative prefetch useful for L2 Speculative prefetch problematic for L1

Calculating the Effects of Latency Does a cache miss reduce performance? –It depends on whether there are critical instructions waiting for the result

Calculating the Effects of Latency –It depends on whether critical resources are held up –Blocking: When a miss occurs, all later reference to the cache must wait. This is a resource conflict. –Non-blocking: Allows later references to access cache while miss is being processed. Generally there is some limit to how many outstanding misses can be bypassed.

P4 Overview (Todd’s slides) Latest iA32 processor from Intel –Equipped with the full set of iA32 SIMD operations –First flagship architecture since the P6 microarchitecture –Pentium 4 ISA = Pentium III ISA + SSE2 –SSE2 (Streaming SIMD Extensions 2) provides 128-bit SIMD integer and floating point operations + prefetch

Comparison Between Pentium III and Pentium 4

Trace Cache Primary instruction cache in P4 architecture –Stores 12k decoded  ops On a miss, instructions are fetched from L2 Trace predictor connects traces Trace cache removes –Decode latency after mispredictions –Decode power for all pre-decoded instructions

Execution Pipeline

Store and Load Scheduling Out of order store and load operations Stores are always in program order 48 loads and 24 stores could be in flight Store/load buffers are allocated at the allocation stage –Total 24 store buffers and 48 load buffers

Data Stream of Pentium 4 Processor

On-chip Caches L1 instruction cache (Trace Cache) L1 data cache L2 unified cache –All caches use a pseudo-LRU replacement algorithm Parameters:

L1 Data Cache Non-blocking –Support up to 4 outstanding load misses Load latency –2-clock for integer –6-clock for floating-point 1 Load and 1 Store per clock Load speculation –Assume the access will hit the cache –“Replay” the dependent instructions when miss detected

L2 Cache Non-blocking Load latency –Net load access latency of 7 cycles Bandwidth –1 load and 1 store in one cycle –New cache operations may begin every 2 cycles –256-bit wide bus between L1 and L2 –48Gbytes per 1.5GHz

L2 Cache Data Prefetcher Hardware prefetcher monitors the reference patterns Bring cache lines automatically Attempts to fetch 256 bytes ahead of current access Prefetch for up to 8 simultaneous independent streams