Memory & Cache.

Memory & Cache

Memories: Review Memory is required for storing Different memory types
Data Instructions Different memory types Dynamic RAM Static RAM Read-only memory (ROM) Characteristics Access time Price Volatility

Principle of Locality Users want Principle of locality
indefinitely large memory fast access to data items in the memory. Principle of locality Temporal locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close by will tend to be referenced soon. To take advantage of the principle of locality The memory of a computer implemented as a memory hierarchy.

Comparing Memories Memory technology Typical access time
$ per GB in 2004 SRAM 0.5-5 ns $4000-$10000 DRAM 50-70 ns $100-$200 Magnetic disk 5-20 million ns $0.50-$2

Memory Hierarchy Speed Size Cost($/bit) CPU Memory Smallest Highest
Fastest Smallest Highest Memory Memory Slowest Largest Lowest

Organization of the Hierarchy
Data in a memory level closer to the processor is a subset of data in any level further away. All the data is stored in the lowest level.

Access to the Data Data transfer takes place between two adjacent layers. The minimum unit of information is called a block. If a data requested by the processor appears in some block in the upper level, this is called a hit. Otherwise a miss occurs. Hit rate or hit ratio is the fraction of memory accesses found in the upper level. used to measure the performance of the memory hierarchy. Miss rate is the fraction of memory accesses not found in the upper memory level ( = 1 – hit rate).

Hit & Miss Hit time is the time to access to upper level of memory hierarchy, which include the time needed to determine whether the access is a hit or miss. Miss penalty is the time to replace a block in the upper level with corresponding block from lower level, plus the time to deliver this block to processor. Hit time is much smaller than the miss penalty. Read from register: one cycle Read from 1st level cache: one-two cycles Read from 2nd level cache: four-five cycles Read from main memory: cycles

Memory Pyramid CPU Level 1
Increasing distance from the CPU in terms of access time Levels in the memory hierarchy Level 2 … Level n Size of the memory at each level

Taking Advantage of Locality
Temporal Locality: keeping the recently accessed items closer to the processor. Usually in a fast memory called cache. Spatial Locality: Moving blocks consisting of multiple contiguous words in memory to upper levels of the hierarchy.

The Basics of Cache Cache is a term used to refer to any storage taking advantage of locality of access. In general, it is the fast memory between the CPU and main memory. First appeared in machines in the early 1960s. Virtually every general-purpose machine built today, from the fastest to the slowest, includes a cache. CPU Cache Main memory

Cache Example X1, X2, …, Xn-1 Access to word Xn It is a miss
Xn brought from memory into cache Xn-2 X1 X4 Xn-1 X2 X3 before the reference to Xn Xn-2 X1 X4 Xn-1 X2 X3 Xn after the reference to Xn

Direct-Mapped Cache Two issues involved: Direct-mapped cache
How do we know if a data item is in the cache? If it is, how do we find it? Direct-mapped cache Each memory location is mapped exactly to one location in the cache. Many items at the lower level share locations in the cache The mapping is simple (Block address) mod (number of blocks in the cache)

Direct-Mapped Cache Cache Main memory 00001 00101 01001 000 001 010
10001 01101 10101 11001 11101 000 111 001 010 011 100 101 110 Cache

Fields in the Cache If the number of blocks in the cache is a power of two, then lower log2(cache size in blocks)- bits of the address is used as the cache address. The remaining upper bits are used as tag to identify whether the requested block is in the cache Memory address = tag || cache address Valid bit is used to indicate whether a location in the cache contains a valid entry (e.g. startup ).

Ex: 8-word Direct-Mapped Cache
Address Hit or miss Tag Assigned cache block 10110 Miss 10 110 11010 11 010 Hit 10000 000 00011 00 011 10010

Index Valid Tag Data 000 N 001 010 011 100 101 110 111 The initial state of the cache Address of the memory reference: => MISS/HIT Index Valid Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem(10110) 111 After handling the miss

Index Valid Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem(10110) 111 Address of the memory reference: => ? Index Valid Tag Data 000 N 001 010 Y 11 Mem(11010) 011 100 101 110 10 Mem(10110) 111

Address of the memory reference: => ? Index Valid Tag Data 000 N 001 010 Y 11 Mem(11010) 011 100 101 110 10 Mem(10110) 111 Address of the memory reference: => ? Address of the memory reference: => ? Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 100 101 110 Mem(10110) 111

Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 100 101 110 Mem(10110) 111 Address of the memory reference: => ? Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 00 Mem(00011) 100 101 110 Mem(10110) 111

Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 00 Mem(00011) 100 101 110 Mem(10110) 111 Address of the memory reference: => ? Address of the memory reference: => ? Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 Mem(10010) 011 00 Mem(00011) 100 101 110 Mem(10110) 111

A More Realistic Cache 32-bit data, 32-bit address
Cache size is 1 K (=1024) words Block size is 1 word 1 word is 32 bits. cache index size = ? tag size = ? 2-bit byte offset A valid bit

A More Realistic Cache Address = 31 30 … 13 12 11 … 2 1 0 20 10 data
31 30 … 11 … 1 0 Address byte offset 20 10 data 32 valid tag data 1 2 … 1021 1022 1023 hit 20 =

Cache Size A formula for computing cache size 2n  (block size + tag size + 1) where 2n is the number of blocks in the cache. Example: Size of a direct-mapped cache with 64 KB of data and one-word blocks, assuming a 32-bit address? 64 KB = 214 blocks Tag size is 32 – = 16-bit Valid bit : 1 bit Total bits in the cache is  ( ) = bits

Handling Cache Misses When the requested data is found in the cache, the processor continues its normal execution. Cache misses are handled with CPU control unit and a separate control unit When a cache miss occurs: Stall the processor Activate the memory controller Get the requested data item from the memory to the cache Load it into the cache Continue as if it is a hit.

Read & Write Write hits & misses: Read misses Inconsistency
stall the CPU, fetch the requested block from memory, deliver it to the cache, and restart Write hits & misses: Inconsistency can replace data in cache and memory (write-through) write the data only into the cache (write-back the memory later)

Write-Through Scheme A memory writes takes additional 100 cycles
In SPEC2000Int benchmark 10% of all instructions are stores and the CPI without cache misses is about 1.17. With cache misses CPI =  100 = 11.17 A Write buffer can store the data while it is waiting to be written to the memory. Meanwhile, the processor can continue execution. if the rate at which the processor generates writes is more than the rate at which the memory system can accept them, then buffering is not a solution.

Write-Back Scheme When a write occurs, the new value is written only to the block in the cache. The modified block in the cache is written into the memory when it is replaced Write back scheme is especially useful, when the processor generates writes faster than the writes can be handled by the main memory Write-back schemes are more complicated to implement

Unified vs. Split Cache For instruction and data cache, there are two approaches: Split caches: Higher miss rate due to their sizes Higher bandwidth due to separate data path No conflict when accessing the data and the cache at the same time Unified cache: Lower miss rate thanks to larger size Lower bandwidth due to a single datapath. Possible stalls due to the simultaneous access to data and instruction.

Taking Advantage of Spatial Locality
The cache we described so far does not take advantage of spatial locality but temporal locality. Basic idea: whenever we have a miss, load a group of adjacent memory cells into the cache (i.e. having blocks of longer than one word and transfer entire block from memory to cache on a cache miss). Block mapping: cache index = (block address) % (# of blocks in cache)

An Example Cache The Intrinsity FastMATH processor Embedded processor
Uses MIPS Architecture 12 stage pipeline Separate instruction and data cache Each cache is 16 KB (4 K words) 16-word block Tag size = ?

Intrinsity FastMATH processor
31 30 … 13 … 5 2 1 0 Address 4 Block offset tag 18 8 cache index tag data V hit 256 entries data 18 32 32 32 = MUX

16-Word Cache Blocks Tag: [31–14] Index: [13-6] Block offset: [5-2] Byte offset: [1-0] Example: What is the block address that byte address corresponds to? Block address = (byte address) / (bytes per block) = 1800 /64 = 28

Read & Writes in Multi-Word Caches
Read misses: always brings the entire block Write hits & misses: more complicated Compare the tag in the cache and the upper address bits If they match, it is a hit. Continue with write back or write through If tags are not identical, then this is a miss Read entire block from memory into the cache and rewrite the cache with the word that caused the write miss. Unlike the case with one-word block, write misses with multi-word block will require reading from memory.

Performance of the Caches
Intrinsic FastMATH for SPEC2000 Instruction cache: 16 KB Data cache: 16 KB Instruction miss rate Data miss rate Effective combined miss rate 0.4% 11.4% 3.2% Effective combined miss rate for unified cache 3.18%

Block Size Small block size Large block size High miss rate
Does not take full advantage of spatial locality Short block loading time Large block size Low miss rate Long time for loading the entire block Higher miss penalty Early start: resume execution as soon as the requested word arrived in the cache Critical word first: the requested word is returned first, the rest is transferred later.

Miss Rate vs. Block Size

Memory System to Support Cache
DRAM (Dynamic Random Access Memory) Access time: The time between when a read is requested and when the desired word arrives in CPU. A hypothetical memory access time 1 clock cycle to send the address 15 clock cycles for initiating access for DRAM (for each word) 1 clock cycle to send a word of data

One-Word-Wide Memory Given a cache block of four words, the miss penalty for one-word-wide memory organization, miss penalty: 1 + 415 + 41 = = 65 Bandwidth (# of bytes transferred per clock cycle) (44)/65  0.25 CPU Cache Bus Memory

Wide Memory Organization
CPU With main memory of 4 words, the miss penalty for 4-word block: 1 = 17 The bandwidth (44)/17  0.94 Multiplexor Cache Bus Memory

Interleaved Memory Organization
With main memory of 4 banks, the miss penalty for a 4-word block 1 = 20 The bandwidth (44)/20=0.80 CPU Cache Bus Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3

Example 1/2 Block size: 1 word, Memory bus width: 1 word,
miss rate: 3%, memory access per instruction: 1.2 and CPI = 2 block size = 2 words  the miss rate is 2%, block size = 4 words  the miss rate is 1%, What is the improvement in performance of interleaving two ways and four ways versus doubling the memory width and the bus assuming the access times are , 15, 1 clock cycles

Example 2/2 CPI for one-word-wide machine Two-word block
one-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  (   2)) = 2.792 one-word bus & memory; interleaving CPI = 2 + (1.2  2%  (  1)) = 2.432 two-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  ( )) = 2.408 Four-word block CPI = 2 + (1.2  1%  (   4)) = 2.780 CPI = 2 + (1.2  1%  (  1)) = 2.24 CPI = 2 + (1.2  1%  (   1)) = 2.396

Improving Cache Performance
Reduce the miss rate By reducing the probability of contention Multilevel caching Second and third level caches good for reducing miss penalty as well

Flexible Placement of Cache Blocks
Direct mapped cache: A memory block goes exactly to one location in the cache Easy to find (Block no.) % (# of blocks in the cache) Compare the tags Many blocks contend for the same location Fully associative cache: A memory block can go in any cache line Difficult to find Search all the tags to see if the requested block is in the cache

Flexible Placement of Cache Blocks
Set-associative cache: There is a fixed number of cache locations (at least two) where each memory block can be placed. A set-associative cache with n locations for a block is called an n-way set-associative cache. The minimum set size is 2. Finding the block in the cache is relatively easier than fully associative cache. (Block no.) % (# of sets in the cache) Tags are compared within the set.

Locating Memory Blocks in the Cache
A block with address 12 1 2 3 4 5 6 7 Block # Set # Data Tag Direct Mapped 2-way set-associative Fully set-associative Search Search Search

Example Consider the following successive memory accesses for direct-mapped, two-way and four-way. Block length is one word. Access pattern is 0, 8, 0, 6, 8 Address of memory block accessed Hit or Miss Contents of cache blocks after reference 1 2 3 Miss Memory[0] 8 Miss Memory[8] Miss Memory[0] 6 Miss Memory[0] Memory[6] 8 Miss Memory[8] Memory[6] Direct mapped cache

Example Memory access: 0, 8, 0, 6, 8 Two-way set-associative cache
Address of memory block accessed Hit or Miss Contents of cache blocks after reference Set 0 Set 0 Set 1 Set 1 Miss Memory[0] 8 Miss Memory[0] Memory[8] Hit Memory[0] Memory[8] 6 Miss Memory[0] Memory[6] 8 Miss Memory[8] Memory[6] Two-way set-associative cache

Example Memory access: 0, 8, 0, 6, 8 Full associative cache
Address of memory block accessed Hit or Miss Contents of cache blocks after reference Block 0 Block 1 Block 2 Block 3 Miss Memory[0] 8 Miss Memory[0] Memory[8] Hit Memory[0] Memory[8] 6 Miss Memory[0] Memory[8] Memory[6] 8 Hit Memory[0] Memory[8] Memory[6] Full associative cache

Performance Improvement
associativity Data miss rate 1 10.3% 2 8.6% 4 8.3% 8 8.1% 16-word block 64 KB cache & SPEC2000

Locating a Block in Cache
31 30 … … = V Tag Data V Tag Data V Tag Data V Tag Data Hit Data

Replacement Strategy Direct mapping: no choice
Set and Fully-associative: Any location is possible for a block. Which one is to replace? Most commonly used schemes: LRU ( Least Recently Used) Keeping track which cache line has been accessed most recently. Replace one that has been unused for the longest time. Random Easy to implement Only slightly worse than LRU

Performance Equations
Formula 1: CPU time = (CPU execution clock cycles + Memory- stall clock cycles)  Clock cycle time Memory-stall clock cycles come primarily from cache misses. Formula 2: Memory-stall clock cycles = Memory access/program  Miss rate  Miss penalty

Performance Equations
Formula 6: Memory-stall clock cycles = Instruction/program  Memory access/Instruction  Miss rate  Miss penalty Example: CPI = 2 Instruction miss rate : 2%, Data miss rate : 4%, Miss penalty : 100 clock cycles for all misses, Frequency of loads and stores : 25% + 11% = 36% What is the CPI with memory stalls? CPI = 5.44

Continuing the Same Example
What if the processor is made faster, but the memory system stays the same? Assume that the CPI is reduced from 2 to 1. Then the amount of execution time spent on memory stalls would have risen from 3.44/5.44 = 0.63 to 3.44/4.44 = 0.77. Assume the processor clock rate doubles Miss penalty for all misses is 200 clock cycles. Total miss cycles per inst. = (2%  200) + 36%  (4%  200) = 6.88 CPI = 8.88 (compare it to 5.44) Speedup = (5.44)/(8.880.5) = 1.23

Multilevel Caches First level caches are often implemented on-chip on contemporary processors. Second level caches, which can be on-chip or off-chip in a separate set of SRAMs, are accessed whenever a miss occurs in the primary cache Larger size Larger block size Faster than main memory hit time is higher

Example: Multilevel Caches
CPI = 1.0, clock rate = 5 GHz, main memory access time = 100 ns, miss rate per instruction at the primary cache = 2%. How much faster if we add a secondary cache with 5 ns access time that can reduce overall miss rate to 0.5%? The miss penalty to the main memory = 100/0.2 = 500 cycles Total CPI = Memory-stall cycles per instruction = %  500 = (without the secondary cache) The miss penalty in the secondary cache = 5/0.2 = 25 cycles Total CPI = %  %  500 = 4.0

Design Considerations
First level cache the focus is to minimize the hit time miss rate could be slightly high smaller block size itself tend to be smaller Second level cache the focus is to reduce overall miss rate. access time is less important its local miss rate can be large larger uses larger block size.

Global vs. Local Miss Rate
Level 1 cache: 2% local miss rate Level 2 cache: 50% local miss rate What is the overall (global) miss rate?

Memory & Cache.

Similar presentations

Presentation on theme: "Memory & Cache."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory & Cache.

Similar presentations

Presentation on theme: "Memory & Cache."— Presentation transcript:

Similar presentations

About project

Feedback