DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Slides:



Advertisements
Similar presentations
361 Computer Architecture Lecture 15: Cache Memory
Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Performance of Cache Memory
1 The Memory Hierarchy Ideally one would desire an indefinitely large memory capacity such that any particular word would be immediately available … We.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
Memory Chapter 7 Cache Memories.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
B. Ramamurthy.  12 stage pipeline  At peak speed, the processor can request both an instruction and a data word on every clock.  We cannot afford pipeline.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Lecture 20 Last lecture: Today’s lecture: Types of memory
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CMSC 611: Advanced Computer Architecture
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Soner Onder Michigan Technological University
COSC3330 Computer Architecture
CS161 – Design and Architecture of Computer
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
ECE 445 – Computer Organization
Chapter 5 Memory CSE 820.
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
ECE232: Hardware Organization and Design
How can we find data in the cache?
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4% gcc spice Write Misses included in 4 word block, but not in 1 word.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3% 0.6% 0.4% gcc spice Write Misses included in 4 word block, but not in 1 word. Remember Miss Penalty goes UP !

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Miss Penalty Block Size Miss Rate Block Size Access Time Transfer Time Constant Size Cache Fewer Blocks

Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block.

Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart”

Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart” Access and transfer each word sequentially. As soon as the requested word is in cache, restart the processor to access cache and finish the block transfer while the cache is available.

Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart” Access and transfer each word sequentially. As soon as the requested word is in cache, restart the processor to access cache and finish the block transfer while the cache is available. Variation: “Requested Word First”

Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Don’t wait for the complete block to be transferred “Early Restart” Access and transfer each word sequentially. As soon as the requested word is in cache, restart the processor to access cache and finish the block transfer while the cache is available. Variation: “Requested Word First” Disadvantage: Complex Control Likely access cache block before transfer is complete

Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Assume Memory Access times: 1 clock cycle to send address 10 Clock cycles to access DRAM 1 clock cycle to send a word of data

Reducing the Miss Penalty Reduce the time to read the multiple words from Main Memory to the cache block. Assume Memory Access times: 1 clock cycle to send address 10 Clock cycles to access DRAM 1 clock cycle to send a word of data For sequential transfer of 4 data words: Miss Penalty = *( 10 +1) = 45 clock cycles

What if we could read a block of words simultaneously from the Main Memory? Cache Entry Valid Tag Word3 Word2 Word1 Word Main Memory

What if we could read a block of words simultaneously from the Main Memory? Cache Entry Valid Tag Word3 Word2 Word1 Word Main Memory Miss Penalty = = 12 clock cycles Miss Penalty for Sequential = 45 clock cycles

What about 4 banks of Memory? “Interleaved Memory” Cache Bank 3 Bank 2 Bank 1 Bank 0 Address Banks are accessed in parallel Words are transferred serially

What about 4 banks of Memory? “Interleaved Memory” Cache Bank 3 Bank 2 Bank 1 Bank 0 Address Banks are accessed in parallel Words are transferred serially Miss Penalty = * 1 = 16 clock cycles Miss Penalty for Parallel = 12 clock cycles Miss Penalty for Sequential = 45 clock cycles

Average Memory Access Time = Hit Time + Miss Rate * Miss Penalty Average Access Time Block Size Increase Cache size Increase Block size Main Memory Organization

CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time Assuming no penalty for Hit

CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time Assuming no penalty for Hit

CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles Assuming no penalty for Hit

CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles Read Stall Cycles = Reads * Read Miss Rate * Read Miss Penalty Program Assuming no penalty for Hit

CPU Performance with Cache Memory Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program + Write Buffer Stalls

CPU Performance with Cache Memory Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program + Write Buffer Stalls Write Buffer Stalls should be << Write Miss Stalls

CPU Performance with Cache Memory Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program + Write Buffer Stalls Write Buffer Stalls should be << Write Miss Stalls So, approximately, Write Stall Cycles = Writes * Write Miss Rate * Write Miss Penalty Program

CPU Performance with Cache Memory Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles = Reads * Read Miss Rate * Read Miss Penalty Program + Writes * Write Miss Rate * Write Miss Penalty Program

CPU Performance with Cache Memory Memory Stall Clock Cycles = Read Stall Cycles + Write Stall Cycles = Reads * Read Miss Rate * Read Miss Penalty Program + Writes * Write Miss Rate * Write Miss Penalty Program The Miss Penalties are approximately the same ( Fetch the Block) So, combining the Reads and Writes together into a weighted Miss Rate Memory Stall Cycles = Memory Accesses * Miss Rate * Miss Penalty Program

CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time CPU time = CPU execution time + Memory Accesses * Miss Rate * Miss Penalty* Clock Cycle time Program Assuming no penalty for Hit

CPU Performance with Cache Memory For a program: CPU time = CPU execution time + CPU Hold time CPU Hold time = Memory Stall Clock Cycles * Clock Cycle time CPU time = CPU execution time + Memory Accesses * Miss Rate * Miss Penalty* Clock Cycle time Program Dividing both sides by Instructions / Program and Clock Cycle time Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Assuming no penalty for Hit

CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles

CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Eff CPI = ( 1 * *.006) Miss Penalty = * Miss Penalty Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles

CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Eff CPI = ( 1 * *.006) Miss Penalty = * Miss Penalty 1.) Eff CPI = * 65 = = 1.43 Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles

CPU Performance with Cache Memory Effective CPI = Execution CPI + Memory Accesses * Miss Rate * Miss Penalty Instruction Eff CPI = ( 1 * *.006) Miss Penalty = * Miss Penalty 1.) Eff CPI = * 65 = = ) Eff CPI = * 20 = = Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 1.) Sequential Memory : Miss penalty = 65 clock cycles 2.) 4 Bank Interleaved: Miss penalty = 20 clock cycles

CPU Performance with Cache Memory Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 4 Bank Interleaved: Miss penalty = 20 clock cycles Eff CPI = clock cycles What if we get a new processor and cache that runs at twice the clock frequency, but keep the same main memory speed?

CPU Performance with Cache Memory Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 4 Bank Interleaved: Miss penalty = 20 clock cycles Eff CPI = clock cycles What if we get a new processor and cache that runs at twice the clock frequency, but keep the same main memory speed? Miss penalty = 40 clock cycles Eff CPI = * 40 = = 1.342

CPU Performance with Cache Memory Consider the DECStation 3100 with 4 word blocks running spice CPI = 1.2 without misses Instruction Miss Rate = 0.3% Data Miss Rate = 0.6%, For spice, frequency of loads and stores = 9% 4 Bank Interleaved: Miss penalty = 20 clock cycles Eff CPI = clock cycles What if we get a new processor and cache that runs at twice the clock frequency, but keep the same main memory speed? Miss penalty = 40 clock cycles Eff CPI = * 40 = = Performance Fast clock = * 2 *clock cycle time = 1.89 Slow clock * clock cycle time

Address Byte Offset Block Offset IndexTag v Tag Word3 Word2 Word1 Word0 4K Entries = 16 Hit Mux Data

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss

X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4

X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4 Cache Address

X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4 Cache Address

X4X+3 4X+2 4X+1 4X Block Address Word Address Word Addr 4 Cache Address X Modulo 8

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 80204Miss Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 80204Miss 611Hit 711Hit 822Hit 922Hit 81204Hit Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 68171Miss Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 68171Miss 611Miss 711Hit 822Hit 922Hit 69 Cache Address =( Word Addr ) modulo 8 4

Consider a Direct Mapped Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address Hit or Miss 611Miss 711Hit 822Miss 922Hit 68171Miss 611Miss 711Hit 822Hit 922Hit 69171Miss Cache Address =( Word Addr ) modulo 8 4

How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0

How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0 How can you find it?

How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0 How can you find it? Expand the Tag to the block address and compare

How about putting a block in any unused block of the eight blocks? Tag Word3 Word2 Word1 Word0 Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address

Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address For practical Hit time, must have parallel comparisons of the Tag and the Block Address Only feasible for small number of blocks Byte Offset Block Offset

Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address Tag Data Tag Data Blk Addr = == = + Hit Mux Data Valid bit not shown Block Offset selects Word Byte Offset Block Offset

Fully Associative Memory – Addressed by it’s contents Block Address – 28 bits Address Tag Data Tag Data Blk Addr = == = + Hit Mux Data Valid bit not shown Hardware Not Feasible for large Cache Byte Offset Block Offset

Make sets of Blocks Associative Two-way set associative Tag0 Data0 Tag1 Data Index Valid bit not shown Addr by Index Compare Two Tags in parallel for Hit 2 k -1

Make sets of Blocks Associative Two-way set associative Tag0 Data0 Tag1 Data Index Valid bit not shown Tag Index Block Offset Byte Offset Addr by Index Compare Two Tags in parallel for Hit Address 2 k -1

Block replacement strategies For each Index there are 2, 4,... n options for replacement. Strategies 1.LRU – Least Recently Used Replace the block that has been unused for the longest time Implementation

Block replacement strategies For each Index there are 2, 4,... n options for replacement Strategies 1.LRU – Least Recently Used Replace the block that has been unused for the longest time 2.Random Select the block to be replaced randomly Implementation

Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry Cache Address =( Word Addr ) modulo 4 4

Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry 1 611Miss 711Hit 822Miss 922Hit Cache Address =( Word Addr ) modulo 4 4

Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry 1 611Miss 711Hit 822Miss 922Hit 68171Miss Cache Address =( Word Addr ) modulo 4 4

Consider a Two Way Associative Cache with 4 word blocks with size of 8 blocks or 32 words. Reference Sequence Word Address Block Address Cache Address(Set) Hit or Miss Entry 0 Entry 1 611Miss 711Hit 822Miss 922Hit 68171Miss 611Hit 711Hit 822Hit 922Hit 69171Hit Cache Address =( Word Addr ) modulo 4 4