Computer Architecture Memory Hierarchy and Caches

Computer Architecture Memory Hierarchy and Caches
Muhammad Abid DCIS, PIEAS

Memory Subsystem The performance of a computer system not only depends on microprocessor but also its memory subsystem. Why? instructions and data are stored in memory. A good architect try to improve performance of both microprocessor and memory subsystem.

Why Memory Hierarchy? We want both fast and large Hard Disk Main
(DRAM) CPU Cache RF

Why Memory Hierarchy? We want both fast and large
But we cannot achieve both with a single level of memory Solution: Have multiple levels of storage (progressively bigger and slower as the levels are farther from the processor) Spatial and temporal locality in programs and ensure most of the data the processor needs is kept in the faster levels

The Memory Hierarchy fast move what you use here small
With good locality of reference, memory appears as fast as and as large as faster per byte cheaper per byte backup everything here big but slow

Locality One’s recent past is a very good predictor of his/her near future. Temporal Locality: If you just did something, it is very likely that you will do the same thing again soon since you are here today, there is a good chance you will be here again and again regularly Spatial Locality: If you did something, it is very likely you will do something similar/related (in space) every time I find you in this room, you are probably sitting close to the same people

Memory Locality A “typical” program has a lot of locality in memory references typical programs are composed of “loops” Temporal: A program tends to reference the same memory location many times and all within a small window of time Spatial: A program tends to reference a cluster of memory locations at a time most notable examples: 1. instruction memory references 2. array/data structure references

Caching Basics: Exploit Temporal Locality
Idea: Store recently accessed data in automatically managed fast memory (called cache) Anticipation: the data will be accessed again soon Temporal locality principle Recently accessed data will be again accessed in the near future

Caching Basics: Exploit Spatial Locality
Idea: Store addresses adjacent to the recently accessed one in automatically managed fast memory Logically divide memory into equal size blocks Fetch to cache the accessed block in its entirety Anticipation: nearby data will be accessed soon Spatial locality principle Nearby data in memory will be accessed in the near future E.g., sequential instruction access, array traversal This is what IBM 360/85 implemented 16 Kbyte cache with 64 byte blocks

The Bookshelf Analogy Book in your hand Desk Bookshelf
Book Boxes at home Recently-used books tend to stay on desk Comp Arch books, books for classes you are currently taking Until the desk gets full Adjacent books in the shelf needed around the same time If I have organized/categorized my books well in the shelf

A Note on Manual vs. Automatic Management
Manual: Programmer manages data movement across levels -- too painful for programmers on substantial programs still done in some embedded processors (on-chip scratch pad SRAM in lieu of a cache) Automatic: Hardware manages data movement across levels, transparently to the programmer ++ programmer’s life is easier simple heuristic: keep most recently used items in cache the average programmer doesn’t need to know about it You don’t need to know how big the cache is and how it works to write a “correct” program! (What if you want a “fast” program?)

A Modern Memory Hierarchy
Register File 32 words, sub-nsec L1 cache ~32 KB, ~nsec L2 cache 512 KB ~ 1MB, many nsec L3 cache, ..... Main memory (DRAM), GB, ~100 nsec Swap Disk 100 GB, ~10 msec manual/compiler register spilling Memory Abstraction Automatic HW cache management automatic demand paging

Review of previous lecture
we need memory hierarchy to make memory subsystem appears fast and large. Progarm features: spatial and temporal locality. caches exploit spatial and temporal locality by storing blocks of data. If no caches, spatial and temporal locality is still there but prog execution is not fast since processor needs to read data from memory. Processor moves block of data b/w caches and memory to make execution fast, assuming spatial and temporal locality exist.

Cache Basics and Operation

Cache Most commonly in the on-die context: an automatically-managed memory hierarchy based on SRAM stores in SRAM the most frequently accessed DRAM memory locations to avoid repeatedly paying for the DRAM access latency

Caching Basics Block (line): Unit of storage in the cache
Memory is logically divided into cache blocks that map to locations in the cache When data referenced HIT: If in cache, use cached data instead of accessing memory MISS: If not in cache, bring block into cache Maybe have to kick something else out to do it Some important cache design decisions Placement: where and how to place/find a block in cache? Replacement: what data to remove to make room in cache? Granularity of management: large, small, uniform blocks? Write policy: what do we do about writes? Instructions/data: Do we treat them separately?

Cache Abstraction and Metrics
Cache hit rate = (# hits) / (# accesses) = (# hits) / (# hits + # misses) Cache miss rate = (# misses ) / (# accesses) = (# misses ) / (# hits + # misses) cache hit rate + cache miss rate = 1 Address Tag Store (is the address in the cache? + bookkeeping) Data Store Hit/miss? Data

Blocks and Addressing the Cache
Memory is logically divided into cache blocks Each block maps to a location in the cache, determined by the index bits in the address 2ind = cache size / block size * associativity used to index into the tag and data stores Cache access: index into the tag and data stores with index bits in address, check valid bit in tag store, compare tag bits in address with the stored tag in tag store If a block is in the cache (cache hit), the tag store should have the tag of the block stored in the index of the block tag index byte in block 2b 3 bits 3 bits 8-bit address

Direct-Mapped Cache: Placement
Assume byte-addressable memory: bytes, 8-byte blocks  32 blocks Assume cache: 64 bytes, 8 blocks Direct-mapped: A block can go to only one location Addresses with same index contend for the same location Cause conflict misses Tag store Data store V tag =? MUX Hit? Data

Direct-Mapped Cache: Access
byte address: 31, 192? tag index byte in block 2b 3 bits 3 bits Tag store Data store Address V tag byte in block =? MUX Hit? Data

Direct-Mapped Caches Problem
Direct-mapped cache: Two blocks in memory that map to the same index in the cache cannot be present in the cache at the same time One index  one entry Can lead to 0% hit rate if more than one block accessed in an interleaved manner map to the same index Assume addresses A and B have the same index bits but different tag bits A, B, A, B, A, B, A, B, …  conflict in the cache index All accesses are conflict misses

Set Associativity Block Addresses 0 and 8 always conflict in direct mapped cache Instead of having one column of 8, have 2 columns of 4 blocks Tag store Data store SET V tag V tag =? =? MUX Logic byte in block MUX Hit? Address tag index byte in block Associative memory within the set -- More complex, slower access, larger tag store + Accommodates conflicts better (fewer conflict misses) 3b 2 bits 3 bits

Higher Associativity 4-way
-- More tag comparators and wider data mux; larger tags + Likelihood of conflict misses even lower Tag store =? =? =? =? Logic Hit? Data store MUX byte in block MUX

Full Associativity Fully associative cache
A block can be placed in any cache location Tag store =? =? =? =? =? =? =? =? Logic Hit? Data store MUX byte in block MUX

Associativity (and Tradeoffs)
How many blocks can map to the same index (or set)? Higher associativity ++ Higher hit rate -- Slower cache access time (hit latency and data access latency) -- More expensive hardware (more comparators) Diminishing returns from higher associativity hit rate associativity

Which block to replace on a cache miss?
Which block in the set to replace on a cache miss? Any invalid block first If all are valid, consult the replacement policy FIFO Least recently used (how to implement?) Random

Implementing LRU Idea: Evict the least recently accessed block
Problem: Need to keep track of access ordering of blocks, i.e. timestamps Question: 2-way set associative cache: What do you need to implement LRU? Question: 4-way set associative cache: How many different access orderings possible for the 4 blocks in the set? How many bits needed to encode the LRU order of a block? What is the logic needed to determine the LRU victim?

Approximations of LRU Most modern processors do not implement “true LRU” in highly-associative caches Why? True LRU is complex Pseudo-LRU: associate a bit to each block; set it when block is accessed If all bits for a set are set then reset bits for that set with the exception of most-recently set bit. Victim: block whose bit is off; randomly choose if more than one bit is off.

What’s In A Tag Store Entry?
Valid bit Tag Replacement policy bits Dirty bit? Write back vs. write through caches

Handling Writes (Stores)
When do we write the modified data in a cache to the next level? Write back: When the block is evicted Write through: At the time the write happens Write-back + Can consolidate multiple writes to the same block before eviction Potentially saves bandwidth between cache levels + saves energy -- Need a bit in the tag store indicating the block is “modified” Write-through + Simpler + All levels are up to date. Consistency: Simpler cache coherence -- More bandwidth intensive; no coalescing of writes

Do we allocate a cache block on a write miss? Allocate on write miss: Yes No-allocate on write miss: No Allocate on write miss + Can consolidate writes instead of writing each of them individually to next level + Simpler because write misses can be treated the same way as read misses -- Requires transfer of the whole cache block No-allocate + Conserves cache space if locality of writes is low (potentially better cache hit rate)

write-through/write-back caches and allocate/no-allocate on a write miss: typically write-back caches allocate block on a write miss typically write-through caches don’t allocate block on a write miss Multicore processors can implement write-through policy for L1 caches to keep L2 cache consistent. However, L2 cache can implement write-back policy to reduce bandwidth utilization.

Write Buffer & Write-through Caches
In write-through caches processor waits for data to be written to lower level, e.g. memory Write Buffer: Using write buffer, processor just needs to write data there and can continue as soon as data is written. writes to same block are merged overlapping processor execution with memory updating

Victim Buffer & Write-back Caches
Dirty replaced blocks are written to victim buffer rather than lower level e.g. memory

Miss Causes: 3 C’s Compulsory or cold miss Capacity miss
First-ever reference to a given block of memory occurs in all types of caches Measure: number of misses in an infinite cache model Capacity miss working set exceeds cache capacity useful blocks (with future references) replaced & later retrieved from lower level, e.g. memory Measure: additional misses in a fully-associative cache

Miss Causes: 3 C’s Conflict miss
occurs in direct-mapped or set-associative cache too many blocks mapped to an index and so a block is replaced and later retrieved Measure: misses that occurs bec of conflict

Reducing 3 C’s Compulsory misses: Capacity misses: Conflict misses:
increase the block size Capacity misses: increase cache size Conflict misses: increase associativity spreads out references to more indices

Hierarchical Latency Analysis
For a given memory hierarchy level i it has a technology-intrinsic access time of ti, The perceived access time Ti is longer than ti Except for the outer-most hierarchy, when looking for a given address there is a chance (hit-rate hi) you “hit” and access time is ti a chance (miss-rate mi) you “miss” and access time ti +Ti+1 hi + mi = 1 Thus Ti = hi·ti + mi·(ti + Ti+1) Ti = ti + mi ·Ti+1

Hierarchy Design Considerations
Recursive latency equation Ti = ti + mi ·Ti+1 The goal: achieve desired T1 within allowed cost Ti  ti is desirable Keep mi low increasing capacity Ci lowers mi, but beware of increasing ti lower mi by smarter management (replacement::anticipate what you don’t need, prefetching::anticipate what you will need) Keep Ti+1 low faster lower hierarchies, but beware of increasing cost introduce intermediate hierarchies as a compromise

Intel Pentium 4 Example 90nm P4, 3.6 GHz L1 cache L2 cache Main memory
C1 = 16K t1 = 4 cyc L2 cache C2 =1024 KB t2 = 18 cyc Main memory t3 = ~ 50ns or 180 cyc if m1=0.1, m2=0.1 T1=7.6, T2=36 if m1=0.01, m2=0.01 T1=4.2, T2=19.8 if m1=0.05, m2=0.01 T1=5.00, T2=19.8 if m1=0.01, m2=0.50 T1=5.08, T2=108

Reducing Memory Access Time
AMAT = access time + Miss rate * Miss penalty Reducing miss rate: larger block size larger cache size higher associativity smarter management (replacement::anticipate what you don’t need, prefetching::anticipate what you will need)

Reducing Memory Access Time
AMAT = access time + Miss rate * Miss penalty Reducing miss penalty: multilevel caches prioritizing reads early restart/ critical word first for read Reducing access time : avoiding VA to PA translation use simple/small cache i.e. direct-mapped access tag and data store in parallel for read/write

Virtual Memory

Physical Addressing CPU generates PA to access memory Where used?
no VA Where used? early PCs digital signal processors, embedded microcontrollers and Cray supercomputers still use it

Virtual addressing CPU generates VA Where used?
Most laptops, server, and modern PCs

Memory: Programmers’ View

Virtual Memory Virtual memory is imaginary memory Why virtual memory?
it gives you illusion of memory that’s not physically there it provides the illusion of a large address space This illusion is provided separately for each program Why virtual memory? Using physical memory efficiently Using physical memory simply Using physical memory safely

Using Physical Memory Efficiently
key idea: only active portion of the program is loaded into memory achieved using demand paging operating system copies a disk page into physical memory only if an attempt is made to access it The rest of the virtual address space is stored on disk Virtual memory gets the most out of physical memory Keep only active areas of virtual address space in fast memory Transfer data back and forth as needed

Prob: A page is on disk and not in memory, i.e. page fault Sol: An OS routine is called to load data from disk to memory; Current process suspends execution; OS has full control over placement

Demand paging:

Using Physical Memory Simply
Key idea: Each process has its own virtual address space defined by its page table simplifies memory allocation a virtual page can be mapped to any physical page simplifies sharing code and data among processes the OS can map virtual pages to same shared physical page simplifies loading the program anywhere in memory just change mapping in the page table physical memory need not be contiguous simplifies allocating memory to applications malloc() in c

Using Physical Memory Simply

Using Physical Memory Safely
Key idea: protection Processes cannot interfere with each other Because they operate in different address space User processes can read but cannot modify privileged information, e.g. I/O Different sections of address space have different permissions, e.g. read-only, read/write, execute permission bits in the page table entries

Using Physical Memory Safely

Virtual Memory Implementation

HW/SW Support for Virtual Memory
Processor must support Virtual address & Physical address support two modes of execution user & supervisor protect privileged info user process can only read user/supervisor mode bit, page table pointer, TLB support switching b/w user mode to supervisor mode and vice versa user process wants to perform I/O operation user process requests to allocate memory, e..g malloc() check protection bits while making a reference to memory

HW/SW Support for Virtual Memory
Operating System creates separate page tables for processes so they don’t interfere all page tables reside in the OS address space and so user process cannot update/modify them assists processes to share data by making changes in the page tables assigns protection bits in the page tables, e.g. Read-only

Paging physical virtual 4GB Program 1 Program 2
One of the techniques to implement virtual memory virtual physical Program 1 Program 2 4GB 16MB

Paging Assigns virtual address space to each process
Each “program” gets its own separate virtual address space Divides this address space into fixed-sized chunks called virtual pages Divides the physical address spaces into fixed-sized chunks called Physical pages Also called a frame Size of virtual page == Size of physical page

Paging Maps virtual pages to physical pages
they must be mapped to physical pages This mapping is stored in page tables

Paging-4Qs Where can a page be placed in main memory?
page miss penalty is very high lower miss rate  fully-associative placement trade off b/w placement algorithm vs miss rate How is a page found if it’s in main memory? using page table indexed by virtual page number

Paging-4Qs which page should be replaced if main memory is full?
OS uses LRU replacement policy it replaces page that was referenced/touched long time ago reference or use bit for each page in the page table what happens when write is performed on a page? write-back policy because access time of magnetic disk is very high, i.e. millions of cycles dirty bit for each page in the page table

Paging in Intel 80386 Intel 80386 (Mid 80s) 32-bit processor
4KB virtual/physical pages Q: What is the size of a virtual address space? A: 2^32 = 4GB Q: How many virtual pages per virtual address space? A: 4GB/4KB = 1M Q: What is the size of the physical address space? A: assume 2^32 = 4GB Q: How many physical pages in the physical address space?

··· Intel 80386: Virtual Pages 32-bit Virtual Address Virtual Page 0
0KB ··· Virtual Page 1M-1 4KB 8KB 4GB 31 XXXXX 11 12 32-bit Virtual Address …

Intel 80386: Translation Offset VPN PPN Translated Virtual Address:
Assume: Virtual Page 7 is mapped to Physical Page 32 For an access to Virtual Page 7 … 31 …011001 11 12 Offset VPN Virtual Address: PPN Physical Address: Translated

Intel 80386: Flat Page Table VA=32-bit; Page size = 12-bit; PTE= 4Bytes Page Table Size = 2^20 * 4B = 4MB

Intel 80386: Flat Page Table ··· ··· VPN PPN Virtual Address
uint32 PAGE_TABLE[1<<20]; PAGE_TABLE[7]=2; Virtual Address PAGE_TABLE 31 31 12 11 NULL PTE1<<20-1 XXX ··· VPN Offset NULL PTE7 PTE7 ··· 31 12 11 NULL PTE1 XXX NULL PTE0 PPN Offset Physical Address

Intel 80386: Page Table Problem #1: Page table is too large
Two problems with page tables Problem #1: Page table is too large Page table has 1M entries Each entry is 4B (because 4B = 20-bit PPN + protection bits) Page table = 4MB (!!) very expensive in the 80s Problem #2: Page table is stored in memory Before every memory access, always fetch the PTE from the slow memory?  Large performance penalty

Problem #1: Page table is too large
Page Table: A “lookup table” for the mappings Can be thought of as an array Each element in the array is called a page table entry (PTE) uint32 PAGE_TABLE[1<<20]; PAGE_TABLE[65]=981; PAGE_TABLE[3161]=1629; PAGE_TABLE[9327]=524; ...

Typically, the vast majority of PTEs are empty PAGE_TABLE[0]=141; ... PAGE_TABLE[532]=1190; PAGE_TABLE[534]=NULL; PAGE_TABLE[ ]=NULL; PAGE_TABLE[ ]=845; PAGE_TABLE[ ]=742; // =(1<<20)-1; Q: Why? − A: Virtual address space is extremely large empty

Solution 01: “Unallocate” the empty PTEs to save space PAGE_TABLE[0]=141; ... PAGE_TABLE[532]=1190; PAGE_TABLE[534]=NULL; PAGE_TABLE[ ]=NULL; PAGE_TABLE[ ]=845; PAGE_TABLE[ ]=742; // =(1<<20)-1; Unallocating every single empty PTE is tedious Instead, unallocate only long stretches of empty PTEs

To allow PTEs to be “unallocated” … the page table must be restructured Before restructuring: flat uint32 PAGE_TABLE[1024*1024]; uint32 PAGE_TABLE[7]=2; uint32 PAGE_TABLE[1023]=381; Sol 02: After restructuring: hierarchical uint32 *PAGE_DIRECTORY[1024]; PAGE_DIRECTORY[0]=malloc(sizeof(uint32)*1024); PAGE_DIRECTORY[0][7]=2; PAGE_DIRECTORY[0][1023]=381; PAGE_DIRECTORY[1]=NULL; // 1024 PTEs unallocated PAGE_DIRECTORY[2]=NULL; // 1024 PTEs unallocated

PAGE_DIRECTORY[0][7]=2; PAGE_DIR PAGE_TABLE0 31 31 PDE1023 NULL NULL PTE1023 NULL PTE7 PTE7 PDE1 NULL PDE0 PDE0 NULL &PT0 NULL PTE0 VPN[19:0]= _ Directory index Table index

Intel 80386: Page Table Problem #1: Page table is too large
Two problems with page tables Problem #1: Page table is too large Page table has 1M entries Each entry is 4B (because 4B = 20-bit PPN + protection bits) Page table = 4MB (!!) very expensive in the 80s Problem #2: Page table is stored in memory Before every memory access, always fetch the PTE from the slow memory?  Large performance penalty

Problem #2: Page table is stored in memory
every time processor accesses memory it needs to access process’ page table to get VA translated into PA so memory is accessed two times  large performance penalty Sol: cache PTE

Problem #2: Page table is stored in memory
Solution: Translation Lookaside Buffer (TLB) caches Page Table Entries (PTEs) that are read from process’ page tables stored in memory it makes successive translations from VA to PA fast typically small fully-associative cache with few entries e.g. AMD Opteron processor: 40 entries data TLB TLB hit or miss

Intel 80386: Page Table Two problems with page tables
Problem #1: Page table is too large Page table has 1M entries Each entry is 4B Page table = 4MB (!!) very expensive in the 80s Solution: Hierarchical page table Problem #2: Page table is in memory Before every memory access, always fetch the PTE from the slow memory?  Large performance penalty Solution: Translation Lookaside Buffer

Translation Lookaside Buffer (TLB)
Problem: Context Switch Assume that Process X is running Process X’s VPN 5 is mapped to PPN 100 The TLB caches this mapping VPN 5  PPN 100 Now assume a context switch to Process Y Process Y’s VPN 5 is mapped to PPN 200 When Process Y tries to access VPN 5, it searches the TLB Process Y finds an entry whose tag is 5 Hurray! It’s a TLB hit! The PPN must be 100! … Are you sure?

Translation Lookaside Buffer (TLB)
Problem: Context Switch Sol #1. Flush the TLB Whenever there is a context switch, flush the TLB All TLB entries are invalidated Example: 80386 Sol # 2. Associate TLB entries with PID All TLB entries have an extra field in the tag ... That identifies the process to which it belongs also use PID while comparing tags Example: Modern x86, MIPS

Handling TLB Misses The TLB is small; it cannot hold all PTEs
Some translations will inevitably miss in the TLB Must access memory to find the appropriate PTE Called walking the page directory/table Large performance penalty Who handles TLB misses? Hardware-Managed TLB Software-Managed TLB

Handling TLB Misses Approach #1. Hardware-Managed (e.g., x86)
The hardware does the page walk The hardware fetches the PTE and inserts it into the TLB If the TLB is full, the entry replaces another entry All of this is done transparently Approach #2. Software-Managed (e.g., MIPS) The hardware raises an exception The operating system does the page walk The operating system fetches the PTE The operating system inserts/evicts entries in the TLB

Handling TLB Misses Hardware-Managed TLB Software-Managed TLB
Pro: No exceptions. Instruction just stalls Pro: Independent instructions may continue Pro: Small footprint (no extra instructions/data) Con: Page directory/table organization is etched in stone Software-Managed TLB Pro: The OS can design the page directory/table Pro: More advanced TLB replacement policy Con: Flushes pipeline Con: Performance overhead

Page Fault If a virtual page is not mapped to a physical page …
The virtual page does not have a valid PTE What would happen if you accessed that page? A hardware exception: page fault The operating system needs to handle it Page fault handler Reads the data from disk into a physical page in memory Maps the virtual page to the physical page Creates the appropriate PDE/PTE Resume program that caused the page fault

Servicing a Page Fault Processor communicates with controller
Read block of length P starting at disk address X and store starting at memory address Y Read occurs Direct Memory Access (DMA) Done by I/O controller Controller signals completion using Interrupt OS resumes suspended process

Why Mapped to Disk? Demand Paging Two possible reasons
Why would a virtual page ever be mapped to disk? Two possible reasons Demand Paging When a large file in disk needs to be read, not all of it is loaded into memory at once Instead, page-sized chunks are loaded on-demand If most of the file is never actually read … Saves time (remember, disk is extremely slow) Saves memory space Q: When can demand paging be bad?

Why Mapped to Disk? (cont’d)
Swapping Assume that physical memory is exhausted You are running many programs that require lots of memory What happens if you try to run another program? Some physical pages are “swapped out” to disk I.e., the data in some physical pages are migrated to disk This frees up those physical pages As a result, their PTEs become invalid When you access a physical page that has been swapped out, only then is it brought back into physical memory This may cause another physical page to be swapped out If this “ping-ponging” occurs frequently, it is called thrashing Extreme performance degradation

Virtual Memory and Cache Interaction

Address Translation and Caching
When do we do the address translation? Before or after accessing the L1 cache? In other words, is the cache virtually addressed or physically addressed? Virtual versus physical cache What are the issues with a virtually addressed cache? Synonym problem: Two different virtual addresses can map to the same physical address  same physical address can be present in multiple locations in the cache  can lead to inconsistency in data

Homonyms and Synonyms Homonym: Same VA can map to two different PAs
Why? VA is in different processes Synonym: Different VAs can map to the same PA Different pages can share the same physical frame within or across processes Reasons: shared libraries, shared data, copy-on-write pages within the same process, … Do homonyms and synonyms create problems when we have a cache? Is the cache virtually or physically addressed?

Cache-VM Interaction physical cache virtual (L1) cache
CPU CPU CPU TLB cache VA PA cache tlb VA PA cache tlb VA PA lower hier. lower hier. lower hier. physical cache virtual (L1) cache virtual-physical cache

Physical Cache No homonym/ synonym problem no homonym/ synonym

Virtual Cache Problem: homonym/ synonym homonym and synonym problem

Virtual Cache Why don’t we use virtual cache? Page level protection
homonym synonym I/O devices

Virtual-Physical Cache
-No homonym -synonym can occur depending on the cache index size -Depending on the index, synonym problem can occur -no homonym

Virtually-Indexed Physically-Tagged
If C≤(page_size  associativity), the cache index bits come only from page offset (same in VA and PA) If both cache and TLB are on chip index both arrays concurrently using VA bits check cache tag (physical) against TLB output at the end VPN Page Offset Index BiB TLB physical cache PPN = tag data TLB hit? cache hit?

Virtually-Indexed Physically-Tagged
If C>(page_size  associativity), the cache index bits include VPN  Synonyms can cause problems The same physical address can exist in two locations Solutions? VPN Page Offset Index BiB a TLB physical cache PPN = tag data TLB hit? cache hit?

Some Solutions to the Synonym Problem
Limit cache size to (page size times associativity) get index from page offset On a write to a block, search all possible indices that can contain the same physical block, and update/invalidate Used in Alpha 21264, MIPS R10K Restrict page placement in OS make sure index(VA) = index(PA) Called page coloring Used in many SPARC processors

Instruction vs. Data Caches
Unified: + Dynamic sharing of cache space: no overprovisioning that might happen with static partitioning (i.e., split I and D caches) -- Instructions and data can thrash each other (i.e., no guaranteed space for either) -- I and D are accessed in different places in the pipeline. Where do we place the unified cache for fast access? First level caches are almost always split Mainly to avoid structural hazard Second and higher levels are almost always unified

Multi-level Caching in a Pipelined Design
First-level caches (instruction and data) Decisions very much affected by cycle time Small, lower associativity Tag store and data store accessed in parallel Second-level caches Decisions need to balance hit rate and access latency Usually large and highly associative; latency not as important Tag store and data store accessed serially Serial vs. Parallel access of levels Serial: Second level cache accessed only if first-level misses Second level does not see the same accesses as the first First level acts as a filter

Performance

Program Execution Time
Prog Execution Time = (CPU cycles + Memory stall cycles) cycle time CPU cycles = Instruction Count (IC) * Average Cycles Per Instruction (CPI) CPI = Total Prog Execution Cycles / IC Memory stall cycles = Total Misses * Miss Penalty = IC * Misses/Inst * Miss Penalty = IC * Memory Accesses Per Inst * Miss Rate * Miss Penalty Misses / Inst = (Miss Rate * Total Memory Accesses)/ IC = Miss Rate * Memory Accesses Per Inst Prog Execution Time = IC (CPI + Misses/Inst * Miss Penalty) cycle time = IC (CPI + Memory Accesses Per Inst * Miss Rate * Miss Penalty ) cycle time

Miss Rate Two ways to specify miss rate
Miss Rate = misses / total accesses Misses/Inst = Miss Rate * Memory Accesses Per Inst dependent on ISA, e.g. X86 vs MIPS

Effect of Cache on Program Execution
Sol: Miss Penalty = 200 cycles CPI = 1 cycle Miss Rate = 2% Memory Accesses Per Inst = 1.5 Misses/1000 = 30 Prog Execution Time(Perfect cache) = ? Prog Execution Time(with cache) = ? Prog Execution Time(no cache) = ?

Effect of Cache on Program Execution
Prog Execution Time(Perfect cache) = IC (CPI + Memory Accesses Per Inst * Miss Rate * Miss Penalty ) cycle time put Miss Rate = 0% = IC(1+0) = IC *cycle time Prog Execution Time(with cache) = ? = IC ( * 2% * 200) = IC* 7 * cycle time Prog Execution Time(no cache) = ? put Miss Rate = 100% = IC ( * 100% * 200) = IC* 301* cycle time

Choosing the Right Cache
Sol: CPI = 1.6 cycle cycle time = 0.35ns memory Accesses Per Inst = 1.4 cycle time = 0.35*1.35ns for 2-way miss penalty = 65ns hit time = 1 cycle miss rate = 2.1% (direct-mapped); miss rate = 1.9% (2-way) average access time=? performance=?

Choosing the Right Cache
Direct-mapped: access time = hit time + miss rate * miss penalty = % * 65 = 1.72ns Prog. execution time =IC[(CPI*cycle time) +(miss rate*memory access per inst*miss penalty* cycle time)] = IC[(1.6*0.35)+(2.1%*1.4*65)] = 2.47 * IC ns 2-way: access time = 0.35* % * 65 = 1.71ns Prog. execution time = IC[(1.6*0.35*1.35)+(1.9%*1.4*65)] = 2.49 * IC ns Relative performance = 2.49 / 2.47 = 1.01

Computer Architecture Memory Hierarchy and Caches

Similar presentations

Presentation on theme: "Computer Architecture Memory Hierarchy and Caches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Architecture Memory Hierarchy and Caches

Similar presentations

Presentation on theme: "Computer Architecture Memory Hierarchy and Caches"— Presentation transcript:

Similar presentations

About project

Feedback