Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
Performance of Cache Memory
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Caches Vincent H. Berk October 21, 2005
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
EENG449b/Savvides Lec /1/04 April 1, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.
Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)
CMSC 611: Advanced Computer Architecture
Soner Onder Michigan Technological University
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
Cache Memory Presentation I
Morgan Kaufmann Publishers
ECE 445 – Computer Organization
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Lecture 7 Memory Hierarchy and Cache Design
Memory & Cache.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Memory Hierarchy Design Chapter 5 Karin Strauss

Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory gap

Processor-Memory Gap µProc 60%/yr. DRAM 7%/yr DRAM CPU 1982 Performance Source: lecture handouts Prof. John Kubiatowicz, CS252 – U.C.Berkeley

Because… Memory speed is a limiting factor in performance Caches are small and fast Caches leverage the principle of locality Temporal locality: data that has been referenced recently tends to be re-referenced soon Spatial locality: data close (in the address space) to recently referenced data tends to be referenced soon

Review Cache block: minimum unit of information that can be present in the cache (several contiguous memory positions) Cache hit: requested data can be found in cache Cache miss: requested data cannot be found in cache The four design questions: Where can a block be placed? How can a block be found? Which block should be replaced? What happens on a write?

Where can a block be placed? Suppose we need to place block 10 Directly mapped (1-way): 10 mod 8 = 2 2-way set associative: 10 mod 4 = set 2 4-way set associative: 10 mod 2 = set 0 fully associative (8-way, in this case): anywhere Placement set = address mod (# sets) Where (# sets) = (cache size)/(# ways)

How can a block be found? Look at the address! Block AddressBlock Offset TagIndexBlock Offset determines set (no index in fully associative caches) determines offset in block block unique id “primary key”

Which block should be replaced? Random Least Recently Used (LRU) True LRU may be too costly to implement in hardware (requires a stack) Simplified LRU First in, First out (FIFO)

What happens on a write? Write through: every time a block is written, the new value is propagated to the next memory level Easier to implement Makes displacement simple and fast Reads never have to wait for a displacement to finish Writes may have to wait  use a write buffer Write back: new value is propagated to the next memory level only when block is displaced Makes writes fast Uses less memory bandwidth Dirty bit may save additional bandwidth no need to write clean blocks Saves power

What happens on a write? (cont.) Write allocate: fetch on write Entire block is brought into cache No write allocate: write around Written word is sent to next memory level Write policy and write miss policy are independent, but usually: Write back  write allocate Write through  no write allocate

Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Unified cache Split caches D-cacheI-cache Size32KB16KB Miss rate1.99%0.64%6.47% Hit timeI:1 / D:211 Miss penalty50 75% of accesses are instruction references Which system is faster?

Solution: AMAT(split) = 0.75*(1+0.64%*50) *(1+6.47%*50) AMAT(split) = 2.05 AMAT(unified) = 0.75*(1+1.99%*50) *(2+1.99%*50) AMAT(unified) = 2.24 Miss Rate(split) = 0.75*0.64% *6.47% = 2.10% Miss Rate(unified) = 1.99% Although split has a higher miss rate, it is faster on avg!

Processor Performance CPU time = (proc cyc + mem stall cyc)*(clk cyc time) proc cyc = IC*CPI mem stall cyc = (mem accesses)*(miss rate)*(miss penalty) CPI (proc)2.0 Miss penalty50 cyc Miss rate2% Mem ref/inst1.33 What is the total CPU time including the caches, in function of IC and clk cyc time? CPU time = (IC*2.0 + IC*1.33*.02*50)*(clk cyc time) mem stall cyc CPU time = IC*(clk cyc time)*3.33

Processor Performance AMAT has large impact on performance If CPI decreases, mem stall cyc represents a larger fraction of total cycles If clock cycle time decreases, mem stall cyc represents more cycles Note: in ooo execution processors, part of the memory access latency is overlapped with computation

Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing hit time: Small and simple caches No address translation Pipelined cache access Trace caches

Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing miss rate: Larger block size Larger cache size Higher associativity Way prediction or pseudo-associative caches Compiler optimizations (code/data layout)

Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing miss penalty: Multilevel caches Critical word first Read miss before write miss Merging write buffers Victim caches

Improving Cache Performance AMAT = (hit time) + (miss rate)*(miss penalty) Reducing miss rate and miss penalty: Increase parallelism Non-blocking caches Prefetching Hardware Software