Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses.

Appendix C Memory Hierarchy

Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses 2

Levels of the Memory Hierarchy CPU Registers 100s Bytes <0.5 ns Cache K Bytes 1 ns 1-0.1 cents/bit Main Memory M Bytes 100ns $.0001-.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit -5 -6 Capacity Access Time Cost Tape infinite sec-min 10 -8 Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

Motivating memory hierarchies Two structures that hold data  Registers: small array of storage  Memory: large array of storage What characteristics would we like memory to have?  High capacity  Low latency  Low cost Can’t satisfy these requirements with one memory technology 4

Memory hierarchy Solution: use a little bit of everything!  Small SRAM array (cache) Small means fast and cheap  Larger DRAM array (Main memory) Hope you rarely have to use it  Extremely large disk Costs are decreasing at a faster rate than we fill them 5

Terminology Find data you want at a given level: hit Data is not present at that level: miss  In this case, check the next lower level Hit rate: Fraction of accesses that hit at a given level  (1 – hit rate) = miss rate Another performance measure: average memory access time AMAT = (hit time) + (miss rate) x (miss penalty) 6

Memory hierarchy operation We’d like most accesses to use the cache  Fastest level of the hierarchy But, the cache is much smaller than the address space Most caches have a hit rate > 80%  How is that possible?  Cache holds data most likely to be accessed 7

Principle of locality Programs don’t access data randomly—they display locality in two forms  Temporal locality: if you access a memory location (e.g., 1000), you are more likely to re- access that location than some random location  Spatial locality: if you access a memory location (e.g., 1000), you are more likely to access a location near it (e.g., 1001) than some random location 8

Cache Basics Fast (but small) memory close to processor When data referenced  If in cache, use cache instead of memory  If not in cache, bring into cache (actually, bring entire block of data, too)  Maybe have to kick something else out to do it! Important decisions  Placement: where in the cache can a block go?  Identification: how do we find a block in cache?  Replacement: what to kick out to make room in cache?  Write policy: What do we do about stores?

4 Questions for Memory Hierarchy Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy) 10

Q1: Cache Placement Placement  Which memory blocks are allowed into which cache lines Placement Policies  Direct mapped (block can go to only one line)  Fully Associative (block can go to any line)  Set-associative (block can go to one of N lines) E.g., if N=4, the cache is 4-way set associative Other two policies are extremes of this (E.g., if N=1 we get a direct-mapped cache)

Q1: Block placement Block 12 placed in 8 block cache:  Fully associative, direct mapped, 2-way set associative  S.A. Mapping = Block Number Modulo Number Sets Cache 01234567 Memory 1111111111222222222233 01234567890123456789012345678901 Fully Associative Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 12

Q2: Cache Identification When address referenced, need to  Find whether its data is in the cache  If it is, find where in the cache  This is called a cache lookup Each cache line must have  A valid bit (1 if line has data, 0 if line empty) We also say the cache line is valid or invalid  A tag to identify which block is in the line (if line is valid)

Q2: Block identification Tag on each block  No need to check index or block offset Increasing associativity shrinks index, expands tag Block Offset Block Address IndexTag 14

Address breakdown Block offset: byte address within block  # block offset bits = log 2 (block size) Index: line (or set) number within cache  # index bits = log 2 (# of cache lines) Tag: remaining bits Tag Block offset Index 15

Address breakdown example Given the following:  32-bit address  32 KB direct-mapped cache  Each block has 64-byte What are the sizes for the tag, index, and block offset fields? index = 9 bits since there are 32KB/64B = 2 9 blocks block offset = 6 bits since each block has 64B= 2 6 tag = 32 – 9 – 6 = 17 bits 16

Q3: Block replacement When we need to evict a line, what do we choose?  Easy choice for direct-mapped  What about set-associative or fully-associative? Want to choose data that is least likely to be used next Temporal locality suggests that’s the line that was accessed farthest in the past Least recently used (LRU)  Hard to implement exactly in hardware—often approximated Random (randomly selected line) FIFO (line that has been in cache the longest) 17

Q4: What happens on a write? Write-ThroughWrite-Back Policy Data written to cache block also written to lower-level memory Write data only to the cache Update lower level when a block falls out of the cache DebugEasyHard Do read misses produce writes? NoYes Do repeated writes make it to lower level? YesNo 18

Write Policy Do we allocate cache lines on a write?  Write-allocate A write miss brings block into cache  No-write-allocate A write miss leaves cache as it was Do we update memory on writes?  Write-through Memory immediately updated on each write  Write-back Memory updated when line replaced

Write Buffers for Write-Through Caches Q. Why a write buffer ? Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common. 20

Write-Back Caches Need a Dirty bit for each line  A dirty line has more recent data than memory Line starts as clean (not dirty) Line becomes dirty on first write to it  Memory not updated yet, cache has the only up- to-date copy of data for a dirty line Replacing a dirty line  Must write data back to memory (write-back)

Basic cache design Cache memory can copy data from any part of main memory  Tag: Memory address  Block: Actual data On each access  Compare the address with the tag If they match  hit!  Get the data from the cache block If they don’t  miss  Get the data from main memory 22

Cache organization Cache consists of multiple tag/block pairs, called cache lines/blocks  Can search lines in parallel (within reason)  Each line also has a valid bit  Write-back caches have a dirty bit Note that block sizes can vary  Most systems use between 32 and 128 bytes  Larger blocks exploit spatial locality  Larger block size  smaller tag size 23

Direct-mapped cache example Assume the following simple setup  Only 2 levels to hierarchy  16-byte memory  4-bit addresses  Cache organization Direct-mapped 8 total bytes 2 bytes per block  4 lines Write-back cache  Leads to the following address breakdown: Offset: 1 bit Index: 2 bits Tag: 1 bit 24

Direct-mapped cache example: initial state Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 00000 00000 00000 00000 Registers: $t0 = ?, $t1 = ? 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #1 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 00000 00000 00000 00000 Address = 1 = 0001 2  Tag = 0  Index = 00  Offset = 1 Hits: 0 Misses: 0 Registers: $t0 = ?, $t1 = ? 26 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #1 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1007829 00000 00000 00000 Address = 1 = 0001 2  Tag = 0  Index = 00  Offset = 1 Hits: 0 Misses: 1 Registers: $t0 = 29, $t1 = ? 27 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #2 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1007829 00000 00000 00000 Address = 8 = 1000 2  Tag = 1  Index = 00  Offset = 0 Hits: 0 Misses: 1 Registers: $t0 = 29, $t1 = ? 28 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #2 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1011821 00000 00000 00000 Address = 8 = 1000 2 Tag = 1 Index = 00 Offset = 0 Hits: 0 Misses: 2 Registers: $t0 = 29, $t1 = 18 29 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #3 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1011821 00000 00000 00000 Address = 4 = 0100 2  Tag = 0  Index = 10  Offset = 0 Hits: 0 Misses: 2 Registers: $t0 = 29, $t1 = 18 30 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #3 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1011821 00000 11018150 00000 Address = 4 = 0100 2  Tag = 0  Index = 10  Offset = 0 Hits: 0 Misses: 3 Registers: $t0 = 29, $t1 = 18 11/10/201531 M. Geiger CIS 570 Lec. 13 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #4 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 471 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1011821 00000 11018150 00000 Address = 13 = 1101 2  Tag = 1  Index = 10  Offset = 1 Hits: 0 Misses: 3 Registers: $t0 = 29, $t1 = 18 32 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #4 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 418 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1011821 00000 11018150 00000 Address = 13 = 1101 2 Tag = 1 Index = 10 Offset = 1 Hits: 0 Misses: 4 Registers: $t0 = 29, $t1 = 18 Must write back dirty block 11/10/201533 M. Geiger CIS 570 Lec. 13 0 1 2 3 4 5 6 7 Block #

Direct-mapped cache example: access #4 Instructions: lb $t0, 1($zero) lb $t1, 8($zero) sb $t1, 4($zero) sb $t0, 13($zero) lb $t1, 9($zero) Memory 078 129 2120 3123 418 5150 6162 7173 818 921 1033 1128 1219 13200 14210 15225 Cache VDTagData 1011821 00000 1111929 00000 Address = 13 = 1101 2 Tag = 1 Index = 10 Offset = 1 Hits: 0 Misses: 4 Registers: $t0 = 29, $t1 = 18 34 0 1 2 3 4 5 6 7 Block #

Cache performance Simplified model: CPU time = (CPU clock cycles + memory stall cycles)  cycle time memory stall cycles = # of misses x miss penalty = IC X Misses/instruction x miss penalty = IC x memory accesses/instruction x miss rate  miss penalty Average CPI = CPI(without stalls)+ memory accesses/instruction x miss rate  miss penalty AMAT = hit time + miss rate x miss penalty 37

Example A computer has CPI =1 when all hits. Loads and stores are 50% of instructions. If the miss penalty is 25 cycles and miss rate is 2%, how much faster would the computer be if all instructions were cache hits For all hits: CPU time =(ICxCPI +0)xCCT =ICx1.0 x CCT Real cache with stalls  Memory stall cycles =IC x (1+0.5)x0.02x25 = ICx0.75  CPU time = (ICx1.0+ICx0.75)xCCT = 1.75ICx CCT Speedup = 1.75ICxCCT/ICxCCT = 1.75 38

Average memory access time For unified cache  AMAT = (hit time) + (miss rate) x (miss penalty) For split cache  AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty) For multi-level cache  AMAT = hit time L1 + miss rate L1 x miss penalty L1 = hit time L1 + miss rate L1 x (hit time L2 + miss rate L2 x miss penalty L2 ) Miss rate(L2) is measured on the leftovers from L1 cache. 39

Example (split cache vs unified cache) Which has the lower miss rate? A 16KB instruction cache with a 16KB data cache or a 32KB unified cache? If the miss per 1000 instructions for instruction, data and unified caches are 3.82, 40.9 and 43.3, respectively. Assume 36% of instructions are data transfer instructions. Assume a hit takes 1 CC and the miss penalty is 200 cc. A load or store hit takes 1 extra cc on a unified cache. Which is the AMAT? Find miss rate = miss/instructions / memory access/instruction Miss rate(I) = 3.82/1000 / 1 = 0.004 Miss rate(D) = 40.9/1000 / 0.36 = 0.114 Miss rate (U) = 43.3/1000 / (1+0.36) = 0.0318 Miss rate (Split) = 74%x0.0004 + 26%x0.114 = 0.0326 A 32KB unified cache has a slightly lower miss rate 40

Example (cont.) AMAT=%instructions x (hit time+ instruction miss rate x miss penalty) + %data x (hit time + Data miss rate x miss penalty) AMAT (split) = 74%(1 + 0.004x200)+ 26%x(1+0.114x200) =7.52 AMAT(unified) = 74%x(1+0.0318x200) + 26%(1+1+0.0318x200) = 7.62 41

Another Example (multilevel cache) Suppose that in 1000 memory reference there are 40 misses in the L1 cache and 20 misses in the L2 cache. What are the miss rates? Assume miss penalty from L2 cache to memory is 200 CC, the hit time of the L2 cache is 10 CC, the hit time of L1 is 1 CC. What is the AMAT? Miss rate (L1) = 40/1000 = 4% Miss rate (L2) = 20/40 = 50% AMAT = hit time L1 + miss rate L1 x miss penalty L1 = = hit time L1 + miss rate L1 x (hit time L2 + miss rate L2 x miss penalty L2 ) = 1 + 4%(10+ 50%x200) = 5.4 CC 42

Reasons for cache misses AMAT = (hit time) + (miss rate) x (miss penalty) Reduce misses  improve performance The three C’s  First reference to an address: Compulsory miss Increasing the block size  Cache is too small to hold data: Capacity miss Increase the cache size  Replaced from a busy line or set: Conflict miss Increase associativity Would have had hit in a fully associative cache 43

44 Six Basic Cache Optimizations Reducing Miss Rate 1. Larger Block size (Compulsory misses) 2. Larger Cache size (Capacity misses) 3. Higher Associativity (Conflict misses Reducing Miss Penalty 4. Multilevel Caches 5. Giving Read misses Priority over Writes e.g., Read complete before earlier writes in write buffer Reducing hit time 6. Avoiding Address Translation during Cache Indexing

Problems with memory DRAM is too expensive to buy many gigabytes  We need our programs to work even if they require more memory than we have  A program that works on a machine with 512 MB should still work on a machine with 256 MB Most systems run multiple programs 45

Solutions Leave the problem up to the programmer  Assume programmer knows exact configuration Overlays  Compiler identifies mutually exclusive regions Virtual memory  Use hardware and software to automatically translate references from virtual address (what the programmer sees) to physical address (index to DRAM or disk) 46

Benefits of virtual memory CPU Memory A0-A31 D0-D31 Data User programs run in a standardized virtual address space Address Translation hardware managed by the operating system (OS) maps virtual address to physical memory “Physical Addresses” Address Translation VirtualPhysical “Virtual Addresses” Hardware supports “modern” OS features: Protection, Translation, Sharing 47

Managing virtual memory Effectively treat main memory as a cache  Blocks are called pages  Misses are called page faults Virtual address consists of virtual page number and page offset Virtual page numberPage offset 01131 48

Page tables encode virtual address spaces A machine usually supports pages of a few sizes (MIPS R4000): A valid page table entry codes physical memory “frame” address for the page A virtual address space is divided into blocks of memory called pages Physical Address Space Virtual Address Space frame 49

Page tables encode virtual address spaces A machine usually supports pages of a few sizes (MIPS R4000): Physical Memory Space A valid page table entry codes physical memory “frame” address for the page A virtual address space is divided into blocks of memory called pages frame A page table is indexed by a virtual address Page Table 11/10/201550

Physical Memory Space Page table maps virtual page numbers to physical frames (“PTE” = Page Table Entry) Virtual memory => treat memory  cache for disk Details of Page Table Virtual Address Page Table index into page table Page Table Base Reg V Access Rights PA V page no.offset 12 table located in physical memory P page no.offset 12 Physical Address frame virtual address Page Table 51

Paging the page table A table for 4KB pages for a 32-bit address space has 1M entries Each process needs its own address space! P1 indexP2 indexPage Offset 31121102122 32 bit virtual address Top-level table wired in main memory Subset of 1024 second-level tables in main memory; rest are on disk or unallocated Two-level Page Tables 11/10/201552 M. Geiger CIS 570 Lec. 13

VM and Disk: Page replacement policy... Page Table 1 0 useddirty 1 0 0 1 1 0 Set of all pages in Memory Tail pointer: Clear the used bit in the page table Head pointer Place pages on free list if used bit is still clear. Schedule pages with dirty bit set to be written to disk. Freelist Free Pages Dirty bit: page written. Used bit: set to 1 on any reference Architect’s role: support setting dirty and used bits 11/10/201553 M. Geiger CIS 570 Lec. 13

Virtual memory performance Address translation requires a physical memory access to read the page table Must then access physical memory again to actually get the data  Each load performs at least 2 memory reads  Each store performs at least 1 memory read followed by a write 11/10/201554 M. Geiger CIS 570 Lec. 13

Improving virtual memory performance Use a cache for common translations  Translation lookaside buffer (TLB) Virtual page vtagPhysical page Pg offset 11/10/201555 M. Geiger CIS 570 Lec. 13

Caches and virtual memory Using two different addresses: virtual and physical  Which should we use to access cache?  Physical address Pros: simpler to manage Cons: slower access  Virtual address Pros: faster access Cons: aliasing, difficult management  Use both: virtually indexed, physically tagged 56

Three Advantages of Virtual Memory Translation:  Program can be given consistent view of memory, even though physical memory is scrambled  Makes multithreading reasonable (now used a lot!)  Only the most important part of program (“Working Set”) must be in physical memory.  Contiguous structures (like stacks) use only as much physical memory as necessary yet still grow later. Protection:  Different threads (or processes) protected from each other.  Different pages can be given special behavior (Read Only, Invisible to user programs, etc).  Kernel data protected from User programs  Very important for protection from malicious programs Sharing:  Can map same physical page to multiple users (“Shared memory”)  Allows programs to share same physical memory without knowing what else is there  Makes memory appear larger than it actually is 57

Average memory access time AMAT = (hit time) + (miss rate) x (miss penalty) Given the following:  Cache: 1 cycle access time  Memory: 100 cycle access time  Disk: 10,000 cycle access time What is the average memory access time if the cache hit rate is 90% and the memory hit rate is 80%? 58

Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses.

Similar presentations

Presentation on theme: "Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses.

Similar presentations

Presentation on theme: "Appendix C Memory Hierarchy. Why care about memory hierarchies? Processor-Memory Performance Gap Growing Major source of stall cycles: memory accesses."— Presentation transcript:

Similar presentations

About project

Feedback