Memory & Cache.

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
CMPE 421 Parallel Computer Architecture
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Computer Organization
COSC3330 Computer Architecture
Memory COMPUTER ARCHITECTURE
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Appendix B. Review of Memory Hierarchy
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
ECE 445 – Computer Organization
Lecture 11 Memory Hierarchy.
Lecture 11 Memory Hierarchy.
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
Adapted from slides by Sally McKee Cornell University
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Memory & Cache

Memories: Review Memory is required for storing Different memory types Data Instructions Different memory types Dynamic RAM Static RAM Read-only memory (ROM) Characteristics Access time Price Volatility

Principle of Locality Users want Principle of locality indefinitely large memory fast access to data items in the memory. Principle of locality Temporal locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close by will tend to be referenced soon. To take advantage of the principle of locality The memory of a computer implemented as a memory hierarchy.

Comparing Memories Memory technology Typical access time $ per GB in 2004 SRAM 0.5-5 ns $4000-$10000 DRAM 50-70 ns $100-$200 Magnetic disk 5-20 million ns $0.50-$2

Memory Hierarchy Speed Size Cost($/bit) CPU Memory Smallest Highest Fastest Smallest Highest Memory Memory Slowest Largest Lowest

Organization of the Hierarchy Data in a memory level closer to the processor is a subset of data in any level further away. All the data is stored in the lowest level.

Access to the Data Data transfer takes place between two adjacent layers. The minimum unit of information is called a block. If a data requested by the processor appears in some block in the upper level, this is called a hit. Otherwise a miss occurs. Hit rate or hit ratio is the fraction of memory accesses found in the upper level. used to measure the performance of the memory hierarchy. Miss rate is the fraction of memory accesses not found in the upper memory level ( = 1 – hit rate).

Hit & Miss Hit time is the time to access to upper level of memory hierarchy, which include the time needed to determine whether the access is a hit or miss. Miss penalty is the time to replace a block in the upper level with corresponding block from lower level, plus the time to deliver this block to processor. Hit time is much smaller than the miss penalty. Read from register: one cycle Read from 1st level cache: one-two cycles Read from 2nd level cache: four-five cycles Read from main memory: 20-100 cycles

Memory Pyramid CPU Level 1 Increasing distance from the CPU in terms of access time Levels in the memory hierarchy Level 2 … Level n Size of the memory at each level

Taking Advantage of Locality Temporal Locality: keeping the recently accessed items closer to the processor. Usually in a fast memory called cache. Spatial Locality: Moving blocks consisting of multiple contiguous words in memory to upper levels of the hierarchy.

The Basics of Cache Cache is a term used to refer to any storage taking advantage of locality of access. In general, it is the fast memory between the CPU and main memory. First appeared in machines in the early 1960s. Virtually every general-purpose machine built today, from the fastest to the slowest, includes a cache. CPU Cache Main memory

Cache Example X1, X2, …, Xn-1 Access to word Xn It is a miss Xn brought from memory into cache Xn-2 X1 X4 Xn-1 X2 X3 before the reference to Xn Xn-2 X1 X4 Xn-1 X2 X3 Xn after the reference to Xn

Direct-Mapped Cache Two issues involved: Direct-mapped cache How do we know if a data item is in the cache? If it is, how do we find it? Direct-mapped cache Each memory location is mapped exactly to one location in the cache. Many items at the lower level share locations in the cache The mapping is simple (Block address) mod (number of blocks in the cache)

Direct-Mapped Cache Cache Main memory 00001 00101 01001 000 001 010 10001 01101 10101 11001 11101 000 111 001 010 011 100 101 110 Cache

Fields in the Cache If the number of blocks in the cache is a power of two, then lower log2(cache size in blocks)- bits of the address is used as the cache address. The remaining upper bits are used as tag to identify whether the requested block is in the cache Memory address = tag || cache address Valid bit is used to indicate whether a location in the cache contains a valid entry (e.g. startup ).

Ex: 8-word Direct-Mapped Cache Address Hit or miss Tag Assigned cache block 10110 Miss 10 110 11010 11 010 Hit 10000 000 00011 00 011 10010

Ex: 8-word Direct-Mapped Cache Index Valid Tag Data 000 N 001 010 011 100 101 110 111 The initial state of the cache Address of the memory reference: 10110 => MISS/HIT Index Valid Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem(10110) 111 After handling the miss

Ex: 8-word Direct-Mapped Cache Index Valid Tag Data 000 N 001 010 011 100 101 110 Y 10 Mem(10110) 111 Address of the memory reference: 11010 => ? Index Valid Tag Data 000 N 001 010 Y 11 Mem(11010) 011 100 101 110 10 Mem(10110) 111

Ex: 8-word Direct-Mapped Cache Address of the memory reference: 10110 => ? Index Valid Tag Data 000 N 001 010 Y 11 Mem(11010) 011 100 101 110 10 Mem(10110) 111 Address of the memory reference: 11010 => ? Address of the memory reference: 10000 => ? Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 100 101 110 Mem(10110) 111

Ex: 8-word Direct-Mapped Cache Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 100 101 110 Mem(10110) 111 Address of the memory reference: 00011 => ? Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 00 Mem(00011) 100 101 110 Mem(10110) 111

Ex: 8-word Direct-Mapped Cache Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 11 Mem(11010) 011 00 Mem(00011) 100 101 110 Mem(10110) 111 Address of the memory reference: 10000 => ? Address of the memory reference: 10010 => ? Index Valid Tag Data 000 Y 10 Mem(10000) 001 N 010 Mem(10010) 011 00 Mem(00011) 100 101 110 Mem(10110) 111

A More Realistic Cache 32-bit data, 32-bit address Cache size is 1 K (=1024) words Block size is 1 word 1 word is 32 bits. cache index size = ? tag size = ? 2-bit byte offset A valid bit

A More Realistic Cache Address = 31 30 … 13 12 11 … 2 1 0 20 10 data 31 30 … 13 12 11 … 2 1 0 Address byte offset 20 10 data 32 valid tag data 1 2 … 1021 1022 1023 hit 20 =

Cache Size A formula for computing cache size 2n  (block size + tag size + 1) where 2n is the number of blocks in the cache. Example: Size of a direct-mapped cache with 64 KB of data and one-word blocks, assuming a 32-bit address? 64 KB = 214 blocks Tag size is 32 – 14 - 2 = 16-bit Valid bit : 1 bit Total bits in the cache is 214  (32 + 16 + 1) = 802816 bits

Handling Cache Misses When the requested data is found in the cache, the processor continues its normal execution. Cache misses are handled with CPU control unit and a separate control unit When a cache miss occurs: Stall the processor Activate the memory controller Get the requested data item from the memory to the cache Load it into the cache Continue as if it is a hit.

Read & Write Write hits & misses: Read misses Inconsistency stall the CPU, fetch the requested block from memory, deliver it to the cache, and restart Write hits & misses: Inconsistency can replace data in cache and memory (write-through) write the data only into the cache (write-back the memory later)

Write-Through Scheme A memory writes takes additional 100 cycles In SPEC2000Int benchmark 10% of all instructions are stores and the CPI without cache misses is about 1.17. With cache misses CPI = 1.17 + 0.1  100 = 11.17 A Write buffer can store the data while it is waiting to be written to the memory. Meanwhile, the processor can continue execution. if the rate at which the processor generates writes is more than the rate at which the memory system can accept them, then buffering is not a solution.

Write-Back Scheme When a write occurs, the new value is written only to the block in the cache. The modified block in the cache is written into the memory when it is replaced Write back scheme is especially useful, when the processor generates writes faster than the writes can be handled by the main memory Write-back schemes are more complicated to implement

Unified vs. Split Cache For instruction and data cache, there are two approaches: Split caches: Higher miss rate due to their sizes Higher bandwidth due to separate data path No conflict when accessing the data and the cache at the same time Unified cache: Lower miss rate thanks to larger size Lower bandwidth due to a single datapath. Possible stalls due to the simultaneous access to data and instruction.

Taking Advantage of Spatial Locality The cache we described so far does not take advantage of spatial locality but temporal locality. Basic idea: whenever we have a miss, load a group of adjacent memory cells into the cache (i.e. having blocks of longer than one word and transfer entire block from memory to cache on a cache miss). Block mapping: cache index = (block address) % (# of blocks in cache)

An Example Cache The Intrinsity FastMATH processor Embedded processor Uses MIPS Architecture 12 stage pipeline Separate instruction and data cache Each cache is 16 KB (4 K words) 16-word block Tag size = ?

Intrinsity FastMATH processor 31 30 … 14 13 … 6 5 2 1 0 Address 4 Block offset tag 18 8 cache index tag data V hit 256 entries data 18 32 32 32 = MUX

16-Word Cache Blocks Tag: [31–14] Index: [13-6] Block offset: [5-2] Byte offset: [1-0] Example: What is the block address that byte address 1800 corresponds to? Block address = (byte address) / (bytes per block) = 1800 /64 = 28

Read & Writes in Multi-Word Caches Read misses: always brings the entire block Write hits & misses: more complicated Compare the tag in the cache and the upper address bits If they match, it is a hit. Continue with write back or write through If tags are not identical, then this is a miss Read entire block from memory into the cache and rewrite the cache with the word that caused the write miss. Unlike the case with one-word block, write misses with multi-word block will require reading from memory.

Performance of the Caches Intrinsic FastMATH for SPEC2000 Instruction cache: 16 KB Data cache: 16 KB Instruction miss rate Data miss rate Effective combined miss rate 0.4% 11.4% 3.2% Effective combined miss rate for unified cache 3.18%

Block Size Small block size Large block size High miss rate Does not take full advantage of spatial locality Short block loading time Large block size Low miss rate Long time for loading the entire block Higher miss penalty Early start: resume execution as soon as the requested word arrived in the cache Critical word first: the requested word is returned first, the rest is transferred later.

Miss Rate vs. Block Size

Memory System to Support Cache DRAM (Dynamic Random Access Memory) Access time: The time between when a read is requested and when the desired word arrives in CPU. A hypothetical memory access time 1 clock cycle to send the address 15 clock cycles for initiating access for DRAM (for each word) 1 clock cycle to send a word of data

One-Word-Wide Memory Given a cache block of four words, the miss penalty for one-word-wide memory organization, miss penalty: 1 + 415 + 41 = 1 + 60 + 4 = 65 Bandwidth (# of bytes transferred per clock cycle) (44)/65  0.25 CPU Cache Bus Memory

Wide Memory Organization CPU With main memory of 4 words, the miss penalty for 4-word block: 1 + 15 + 11 = 17 The bandwidth (44)/17  0.94 Multiplexor Cache Bus Memory

Interleaved Memory Organization With main memory of 4 banks, the miss penalty for a 4-word block 1 + 15 + 41 = 20 The bandwidth (44)/20=0.80 CPU Cache Bus Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3

Example 1/2 Block size: 1 word, Memory bus width: 1 word, miss rate: 3%, memory access per instruction: 1.2 and CPI = 2 block size = 2 words  the miss rate is 2%, block size = 4 words  the miss rate is 1%, What is the improvement in performance of interleaving two ways and four ways versus doubling the memory width and the bus assuming the access times are 1, 15, 1 clock cycles

Example 2/2 CPI for one-word-wide machine Two-word block one-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  (1 + 15  2 + 1  2)) = 2.792 one-word bus & memory; interleaving CPI = 2 + (1.2  2%  (1 + 15 + 2  1)) = 2.432 two-word bus & memory; no-interleaving CPI = 2 + (1.2  2%  (1 + 15 + 1)) = 2.408 Four-word block CPI = 2 + (1.2  1%  (1 + 15  4 + 1  4)) = 2.780 CPI = 2 + (1.2  1%  (1 + 15 + 4  1)) = 2.24 CPI = 2 + (1.2  1%  (1 + 15  2 + 2  1)) = 2.396

Improving Cache Performance Reduce the miss rate By reducing the probability of contention Multilevel caching Second and third level caches good for reducing miss penalty as well

Flexible Placement of Cache Blocks Direct mapped cache: A memory block goes exactly to one location in the cache Easy to find (Block no.) % (# of blocks in the cache) Compare the tags Many blocks contend for the same location Fully associative cache: A memory block can go in any cache line Difficult to find Search all the tags to see if the requested block is in the cache

Flexible Placement of Cache Blocks Set-associative cache: There is a fixed number of cache locations (at least two) where each memory block can be placed. A set-associative cache with n locations for a block is called an n-way set-associative cache. The minimum set size is 2. Finding the block in the cache is relatively easier than fully associative cache. (Block no.) % (# of sets in the cache) Tags are compared within the set.

Locating Memory Blocks in the Cache A block with address 12 1 2 3 4 5 6 7 Block # Set # Data Tag Direct Mapped 2-way set-associative Fully set-associative Search Search Search

Example Consider the following successive memory accesses for direct-mapped, two-way and four-way. Block length is one word. Access pattern is 0, 8, 0, 6, 8 Address of memory block accessed Hit or Miss Contents of cache blocks after reference 1 2 3 Miss Memory[0] 8 Miss Memory[8] Miss Memory[0] 6 Miss Memory[0] Memory[6] 8 Miss Memory[8] Memory[6] Direct mapped cache

Example Memory access: 0, 8, 0, 6, 8 Two-way set-associative cache Address of memory block accessed Hit or Miss Contents of cache blocks after reference Set 0 Set 0 Set 1 Set 1 Miss Memory[0] 8 Miss Memory[0] Memory[8] Hit Memory[0] Memory[8] 6 Miss Memory[0] Memory[6] 8 Miss Memory[8] Memory[6] Two-way set-associative cache

Example Memory access: 0, 8, 0, 6, 8 Full associative cache Address of memory block accessed Hit or Miss Contents of cache blocks after reference Block 0 Block 1 Block 2 Block 3 Miss Memory[0] 8 Miss Memory[0] Memory[8] Hit Memory[0] Memory[8] 6 Miss Memory[0] Memory[8] Memory[6] 8 Hit Memory[0] Memory[8] Memory[6] Full associative cache

Performance Improvement associativity Data miss rate 1 10.3% 2 8.6% 4 8.3% 8 8.1% 16-word block 64 KB cache & SPEC2000

Locating a Block in Cache 31 30 … 12 11 10 9 8 ….. 3 2 1 0 = V Tag Data V Tag Data V Tag Data V Tag Data Hit Data

Replacement Strategy Direct mapping: no choice Set and Fully-associative: Any location is possible for a block. Which one is to replace? Most commonly used schemes: LRU ( Least Recently Used) Keeping track which cache line has been accessed most recently. Replace one that has been unused for the longest time. Random Easy to implement Only slightly worse than LRU

Performance Equations Formula 1: CPU time = (CPU execution clock cycles + Memory- stall clock cycles)  Clock cycle time Memory-stall clock cycles come primarily from cache misses. Formula 2: Memory-stall clock cycles = Memory access/program  Miss rate  Miss penalty

Performance Equations Formula 6: Memory-stall clock cycles = Instruction/program  Memory access/Instruction  Miss rate  Miss penalty Example: CPI = 2 Instruction miss rate : 2%, Data miss rate : 4%, Miss penalty : 100 clock cycles for all misses, Frequency of loads and stores : 25% + 11% = 36% What is the CPI with memory stalls? CPI = 5.44

Continuing the Same Example What if the processor is made faster, but the memory system stays the same? Assume that the CPI is reduced from 2 to 1. Then the amount of execution time spent on memory stalls would have risen from 3.44/5.44 = 0.63 to 3.44/4.44 = 0.77. Assume the processor clock rate doubles Miss penalty for all misses is 200 clock cycles. Total miss cycles per inst. = (2%  200) + 36%  (4%  200) = 6.88 CPI = 8.88 (compare it to 5.44) Speedup = (5.44)/(8.880.5) = 1.23

Multilevel Caches First level caches are often implemented on-chip on contemporary processors. Second level caches, which can be on-chip or off-chip in a separate set of SRAMs, are accessed whenever a miss occurs in the primary cache Larger size Larger block size Faster than main memory hit time is higher

Example: Multilevel Caches CPI = 1.0, clock rate = 5 GHz, main memory access time = 100 ns, miss rate per instruction at the primary cache = 2%. How much faster if we add a secondary cache with 5 ns access time that can reduce overall miss rate to 0.5%? The miss penalty to the main memory = 100/0.2 = 500 cycles Total CPI = 1.0 + Memory-stall cycles per instruction = 1.0 + 2%  500 = 11.0 (without the secondary cache) The miss penalty in the secondary cache = 5/0.2 = 25 cycles Total CPI = 1.0 + 2%  25 + 0.5%  500 = 4.0

Design Considerations First level cache the focus is to minimize the hit time miss rate could be slightly high smaller block size itself tend to be smaller Second level cache the focus is to reduce overall miss rate. access time is less important its local miss rate can be large larger uses larger block size.

Global vs. Local Miss Rate Level 1 cache: 2% local miss rate Level 2 cache: 50% local miss rate What is the overall (global) miss rate?