Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

Slides:



Advertisements
Similar presentations
361 Computer Architecture Lecture 15: Cache Memory
Advertisements

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Lecture 32: Chapter 5 Today’s topic –Cache performance assessment –Associative caches Reminder –HW8 due next Friday 11/21/2014 –HW9 due Wednesday 12/03/2014.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.
Computing Systems Memory Hierarchy.
Lecture 15: Virtual Memory EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Chapter Twelve Memory Organization
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 11: Memory Hierarchy.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Lecture#15. Cache Function The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Computer Organization CS224 Fall 2012 Lessons 41 & 42.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Caches 1 Computer Organization II © McQuain Memory Technology Static RAM (SRAM) – 0.5ns – 2.5ns, $2000 – $5000 per GB Dynamic RAM (DRAM)
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Cache Issues Computer Organization II 1 Main Memory Supporting Caches Use DRAMs for main memory – Fixed width (e.g., 1 word) – Connected by fixed-width.
CS161 – Design and Architecture of Computer Systems
CMSC 611: Advanced Computer Architecture
Chapter 1 Computer System Overview
COSC3330 Computer Architecture
Cache Memory and Performance
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Multilevel Memories (Improving performance using alittle “cash”)
Morgan Kaufmann Publishers
Cache Memory Presentation I
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
ECE 445 – Computer Organization
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
ECE 445 – Computer Organization
Set-Associative Cache
Morgan Kaufmann Publishers
Chapter 1 Computer System Overview
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr. Rozier (UM)

EXAM

Exam is one week from Thursday Thursday, April 3 rd Covers: – Chapter 4 – Chapter 5 – Buffer Overflows – Exam 1 material fair game

LAB QUESTIONS?

CACHING

Caching Review What is temporal locality? What is spatial locality?

Caching Review What is a cache set? What is a cache line? What is a cache block?

Caching Review What is a direct mapped cache? What is a set associative cache?

Caching Review What is a cache miss? What happens on a miss?

ASSOCIATIVITY AND MISS RATES

How Much Associativity Increased associativity decreases miss rate – But with diminishing returns Simulation of a system with 64KB D-cache, 16-word blocks, SPEC2000 – 1-way: 10.3% – 2-way: 8.6% – 4-way: 8.3% – 8-way: 8.1%

Set Associative Cache Organization

Multilevel Caches Primary cache attached to CPU – Small, but fast Level-2 cache services misses from primary cache – Larger, slower, but still faster than main memory Main memory services L-2 cache misses Some high-end systems include L-3 cache

Multilevel Cache Example Given – CPU base CPI = 1, clock rate = 4GHz – Miss rate/instruction = 2% – Main memory access time = 100ns With just primary cache – Miss penalty = 100ns/0.25ns = 400 cycles – Effective CPI = × 400 = 9

Example (cont.) Now add L-2 cache – Access time = 5ns – Global miss rate to main memory = 0.5% Primary miss with L-2 hit – Penalty = 5ns/0.25ns = 20 cycles Primary miss with L-2 miss – Extra penalty = 500 cycles CPI = × × 400 = 3.4 Performance ratio = 9/3.4 = 2.6

Multilevel Cache Considerations Primary cache – Focus on minimal hit time L-2 cache – Focus on low miss rate to avoid main memory access – Hit time has less overall impact Results – L-1 cache usually smaller than a single cache – L-1 block size smaller than L-2 block size

Interactions with Advanced CPUs Out-of-order CPUs can execute instructions during cache miss – Pending store stays in load/store unit – Dependent instructions wait in reservation stations Independent instructions continue Effect of miss depends on program data flow – Much harder to analyse – Use system simulation

Interactions with Software Misses depend on memory access patterns – Algorithm behavior – Compiler optimization for memory access

Replacement Policy Direct mapped: no choice Set associative – Prefer non-valid entry, if there is one – Otherwise, choose among entries in the set Least-recently used (LRU) – Choose the one unused for the longest time Simple for 2-way, manageable for 4-way, too hard beyond that Random – Gives approximately the same performance as LRU for high associativity

Replacement Policy Optimal algorithm: – The clairvoyant algorithm – Not a real possibility – Why is it useful to consider?

Replacement Policy Optimal algorithm: – The clairvoyant algorithm – Not a real possibility – Why is it useful to consider? – We can never beat the Oracle – Good baseline to compare to Get as close as possible

LRU Least Recently Used – Discard the data which was used least recently. – Need to maintain information on the age of a cache. How do we do so efficiently? Break into Groups Propose an algorithm

Using a BDD

LRU is Expensive Age bits take a lot to store What about pseudo LRU?

Using a BDD BDDs have decisions at their leaf nodes BDDs have variables at their non-leaf nodes Each path from a non- leaf to a leaf node is an evaluation of the variable at the node Standard convention: – Left node on 0 – Right node on 1

BDD for LRU 4-way associative cache all lines valid b0 is 0 Choose invalid Line b1 is 0 line0 line1 line2 line3

BDD for LRU 4-way associative cache all lines valid b0 is 0 Choose invalid Line b1 is 0 line0 line1 line2 line3

BDD for LRU 4-way associative cache Pseudo-LRU – Uses only 3-bits

WRAP UP

For next time Read Chapter 5, section 5.4