Improving Memory Access 1/3 The Cache and Virtual Memory

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 The Memory Hierarchy Ideally one would desire an indefinitely large memory capacity such that any particular word would be immediately available … We.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
The Memory Hierarchy CPSC 321 Andreas Klappenecker.
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Caches The principle that states that if data is used, its neighbor will likely be used soon.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff Case Western Reserve University This presentation.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Exam 2 Review Two’s Complement Arithmetic Ripple carry ALU logic and performance Look-ahead techniques, performance and equations Basic multiplication.
Improving Memory Access 2/3 The Cache and Virtual Memory
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Improving Memory Access The Cache and Virtual Memory
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
Yu-Lun Kuo Computer Sciences and Information Engineering
Chapter 7 Large and Fast: Exploiting Memory Hierarchy
The Goal: illusion of large, fast, cheap memory
CSC 4250 Computer Architectures
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Systems Architecture II
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Part V Memory System Design
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS-447– Computer Architecture Lecture 20 Cache Memories
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Memory Rabi Mahapatra
Lecture 7 Memory Hierarchy and Cache Design
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Improving Memory Access 1/3 The Cache and Virtual Memory EECS 322 Computer Architecture Improving Memory Access 1/3 The Cache and Virtual Memory

Principle of Locality • Principle of Locality states that programs access a relatively small portion of their address space at any instance of time • Two types of locality • Temporal locality (locality in time) If an item is referenced, then the same item will tend to be referenced soon “the tendency to reuse recently accessed data items” • Spatial locality (locality in space) If an item is referenced, then nearby items will be referenced soon “the tendency to reference nearby data items”

Cache Terminology A hit if the data requested by the CPU is in the upper level Hit rate or Hit ratio is the fraction of accesses found in the upper level Hit time is the time required to access data in the upper level = <detection time for hit or miss> + <hit access time> A miss if the data is not found in the upper level Miss rate or (1 – hit rate) is the fraction of accesses not found in the upper level Miss penalty is the time required to access data in the lower level = <lower access time>+<reload processor time>

Observe there is a Many-to-1 memory to cache relationship Direct Mapped Cache • Direct Mapped: assign the cache location based on the address of the word in memory • cache_address = memory_address % cache_size; Observe there is a Many-to-1 memory to cache relationship

Direct Mapped Cache: Mips Architecture Figure 7.7 Tag Index Data Compare Tags Hit

Bits in a Direct Mapped Cache How many total bits are required for a direct mapped cache with 64KB (= 216 KiloBytes) of data and one word (=32 bit) blocks assuming a 32 bit byte memory address? Cache index width = log2 words = log2 216/4 = log2 214 words = 14 bits Block address width = <byte address width> – log2 word = 32 – 2 = 30 bits Tag size = <block address width> – <cache index width> = 30 – 14 = 16 bits Cache block size = <valid size>+<tag size>+<block data size> = 1 bit + 16 bits + 32 bits = 49 bits Total size = <Cache word size>  <Cache block size> = 214 words  49 bits = 784  210 = 784 Kbits = 98 KB = 98 KB/64 KB = 1.5 times overhead

The DECStation 3100 cache write-through cache Always write the data into both the cache and memory and then wait for memory. DECStation uses a write-through cache • 128 KB total cache size (=32K words) • = 64 KB instruction cache (=16K words) • + 64 KB data cache (=16K words) • 10 processor clock cycles to write to memory In a gcc benchmark, 13% of the instructions are stores. • Thus, CPI of 1.2 becomes 1.2+13%x10 = 2.5 • Reduces the performance by more than a factor of 2!

Cache schemes write-through cache Always write the data into both the cache and memory and then wait for memory. Chip Area Speed write buffer write data into cache and write buffer. If write buffer full processor must stall. No amount of buffering can help if writes are being generated faster than the memory system can accept them. write-back cache Write data into the cache block and only write to memory when block is modified but complex to implement in hardware.

Hits vs. Misses Read hits this is what we want! Read misses stall the CPU, fetch block from memory, deliver to cache, and restart. Write hits write-through: can replace data in cache and memory. write-buffer: write data into cache and buffer. write-back: write the data only into the cache. Write misses read the entire block into the cache, then write the word.

The DECStation 3100 miss rates Figure 7.9 • A split instruction and data cache increases the bandwidth 6.1% 2.1% 5.4% Benchmark Program gcc Instruction miss rate Data miss rate Effective split miss rate Combined miss rate 4.8% spice 1.2% 1.3% Why a lower miss rate? Numerical programs tend to consist of a lot of small program loops split cache has slightly worse miss rate

Spatial Locality • Temporal only cache cache block contains only one word (No spatial locality). • Spatial locality Cache block contains multiple words. • When a miss occurs, then fetch multiple words. • Advantage Hit ratio increases because there is a high probability that the adjacent words will be needed shortly. • Disadvantage Miss penalty increases with block size

Spatial Locality: 64 KB cache, 4 words Figure 7.10 • 64KB cache using four-word (16-byte word) • 16 bit tag, 12 bit index, 2 bit block offset, 2 bit byte offset.

Effective split miss rate Performance Figure 7.11 Use split caches because there is more spatial locality in code: 6.1% 2.1% 5.4% Program Block size gcc =1 Instruction miss rate Data miss rate Effective split miss rate Combined miss rate 4.8% gcc =4 2.0% 1.7% 1.9% 4.8% spice =1 1.2% 1.3% spice =4 0.3% 0.6% 0.4% Spatial split cache: has lower miss rate Temporal only split cache: has slightly worse miss rate

Cache Block size Performance Figure 7.12 Increasing the block size tends to decrease miss rate:

Designing the Memory System Figure 7.13 Make reading multiple words easier by using banks of memory It can get a lot more complicated...

1-word-wide memory organization Figure 7.13 Suppose we have a system as follows C P U a c h e B u s M m o r y . O n - w d i g z t 1-word-wide memory organization 1 cycle to send the address 15 cycles to access DRAM 1 cycle to send a word of data If we have a cache block of 4 words Then the miss penalty is =(1 address send) + 4(15 DRAM reads)+4(1 data send) = 65 clocks per block read Thus the number of bytes transferred per clock cycle = 4 bytes/word x 4 words/65 clocks = 0.25 bytes/clock

Interleaved memory organization Figure 7.13 Suppose we have a system as follows C P U a c h e B u s M m o r y b n k 1 2 3 . I t l v d g i z 4-bank memory interleaving organization 1 cycle to send the address 15 cycles to access DRAM 1 cycle to send a word of data If we have a cache block of 4 words Then the miss penalty is = (1 address send) + 1(15 DRAM reads)+ 4(1 data send) = 20 clocks per block read Thus the number of bytes transferred per clock cycle = 4 bytes/word x 4 words/17 clocks = 0.80 bytes/clock we improved from 0.25 to 0.80 bytes/clock!

Wide bus: 4-word-wide memory organization Figure 7.13 Suppose we have a system as follows C P U B u s b . W i d e m o r y g a n z t M l p x c h 4-word-wide memory organization 1 cycle to send the address 15 cycles to access DRAM 1 cycle to send a word of data If we have a cache block of 4 words Then the miss penalty is = (1 address send) + 1(15 DRAM reads)+ 1(1 data send) = 17 clocks per block read Thus the number of bytes transferred per clock cycle = 4 bytes/word x 4 words/17 clocks = 0.94 bytes/clock we improved from 0.25 to 0.80 to 0.94 bytes/clock!

Memory organizations One word wide memory organization Advantage Figure 7.13 One word wide memory organization Advantage Easy to implement, low hardware overhead Disadvantage Slow: 0.25 bytes/clock transfer rate Chip Area Speed Interleave memory organization Advantage Better: 0.80 bytes/clock transfer rate Banks are valuable on writes: independently Disadvantage more complex bus hardware Wide memory organization Advantage Fastest: 0.94 bytes/clock transfer rate Disadvantage Wider bus and increase in cache access time