1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory

2 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

3 1. Fast Hit Times via Small, Simple Caches  Simple caches can be faster cache hit time increasingly a bottleneck to CPU performance cache hit time increasingly a bottleneck to CPU performance  set associativity requires complex tag matching  slower  direct-mapped are simpler  faster  shorter CPU cycle times –tag check can be overlapped with transmission of data  Smaller caches can be faster can fit on the same chip as CPU can fit on the same chip as CPU  avoid penalty of going off-chip for L2 caches: compromise for L2 caches: compromise  keep tags on chip, and data off chip –fast tag check, yet greater cache capacity L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV

4 2. Fast Hits by Avoiding Addr. Translation  Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache, vs. Physical Cache Every time process is switched logically must flush the cache; otherwise get false hits Every time process is switched logically must flush the cache; otherwise get false hits  Cost is time to flush + “compulsory” misses from empty cache Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address I/O must interact with cache, so need mapping to virtual address I/O must interact with cache, so need mapping to virtual address  Solution to aliases HW guarantee: each cache frame holds unique physical address HW guarantee: each cache frame holds unique physical address SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring  Solution to cache flush Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process

5 Virtually Addressed Caches CPU TLB Cache MEM VA PA Conventional Organization CPU Cache TLB MEM VA PA Virtually Addressed Cache Translate only on miss Synonym Problem CPU CacheTLB MEM VA PA Tags PA Overlap $ access with VA translation: requires $ index to remain invariant across translation VA Tags L2 $

6 2. Avoiding Translation: Process ID impact  Black is uniprocess  Light Gray is multiprocess when flush cache  Dark Gray is multiprocess when use Process ID tag  Y axis: Miss Rates up to 20%  X axis: Cache size from 2 KB to 1024 KB  Fig 5.25 (HP3)

7 2. Avoiding Translation: Index with Physical Portion of Address  Use page offset as cache index part that is identical in both virtual and physical addresses part that is identical in both virtual and physical addresses read cache AND translate virtual part in parallel read cache AND translate virtual part in parallel tag match uses result of translation (physical address) tag match uses result of translation (physical address)  virtually indexed, physically tagged  Limits cache to page size: what if we want bigger caches Use higher associativity Use higher associativity Page Address Page Offset Address Tag Index Block Offset

8 Cache Optimization Summary TechniqueMRMPHT Complexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 Small & Simple Caches–+0 Avoiding Address Translation+2

9 Impact of Caches  1960-1985: Speed = ƒ(no. operations)  1997 Pipelined Execution & Fast Clock Rate Pipelined Execution & Fast Clock Rate Out-of-Order completion Out-of-Order completion Superscalar Instruction Issue Superscalar Instruction Issue  1999: Speed = ƒ(non-cached memory accesses)  What does this mean for Compilers, Architects, Algorithms, Data Structures? Compilers, Architects, Algorithms, Data Structures?

10 Virtual Memory  HP3 Section 5.10  For background: Refer to PH (Comp. Org.)

11Characteristics

12Addressing  Always a congruence mapping  Assume 4GB VM composed of 220 4KB pages 4GB VM composed of 220 4KB pages 64MB DRAM main memory composed of 16384 page frames (of same size) 64MB DRAM main memory composed of 16384 page frames (of same size)  Only those pages (of the 220) that are not empty actually exist Each is either in main memory or on disk Each is either in main memory or on disk Can be located with two mappings (implemented with tables) Can be located with two mappings (implemented with tables) Virtual address= (virtual page number,page offset) VA= (VPN,offset) 32 bits= (20 bits +12 bits) Physical address= (real page number,page offset) PA= (RPN, offset) 26 bits= (14 bits +12 bits)

13 Address Translation  RPN = f M (VPN) In reality, VPN is mapped to a page table entry (PTE) In reality, VPN is mapped to a page table entry (PTE)  which contains RPN …  … as well as miscellaneous control information (e.g., valid bit, dirty bit, replacement information, access control) VA  PA (VPN, offset within page)  (RPN, offset within page) VA  disk address

14 Single-Level, Direct Page Table in MM  Fully associative mapping: when VM page is brought in from disk to MM, it may go into any of the real page frames when VM page is brought in from disk to MM, it may go into any of the real page frames  Simplest addressing scheme: one-level, direct page table (page table base address + VPN) = PTE or page fault (page table base address + VPN) = PTE or page fault Assume that PTE size is 4 bytes Assume that PTE size is 4 bytes Then whole table requires 4  2 20 = 4MB of main memory Then whole table requires 4  2 20 = 4MB of main memory  Disadvantage: 4MB of main memory must be reserved for page tables, even when the VM space is almost empty

15 Single-Level Direct Page Table in VM  To avoid tying down 4MB of physical memory Put page tables in VM Put page tables in VM Bring into MM only those that are actually needed Bring into MM only those that are actually needed “Paging the page tables” “Paging the page tables”  Needs only 1K PTEs in main memory, rather than 4MB  Slows down access to VM pages by possibly needing disk accesses for the PTEs

16 Multi-Level Direct Page Table in MM  Another solution to storage problem  Break 20-bit VPN into two 10-bit parts VPN = (VPN1, VPN2) VPN = (VPN1, VPN2)  This turns original one-level page table into a tree structure (1st level base address + VPN1) = 2nd level base address (1st level base address + VPN1) = 2nd level base address (2nd level base address + VPN2) = PTE or page fault (2nd level base address + VPN2) = PTE or page fault  Storage situation much improved Always need root node (1K 4-byte entries = 1 VM page) Always need root node (1K 4-byte entries = 1 VM page) Ned only a few of the second level nodes Ned only a few of the second level nodes  Allocated on demand  Can be anywhere in main memory  Access time to PTE has doubled

17 Inverted Page Tables  Virtual address spaces may be vastly larger (and more sparsely populated) than real address spaces less-than-full utilization of tree nodes in multi-level direct page table becomes more significant less-than-full utilization of tree nodes in multi-level direct page table becomes more significant  Ideal (i.e., smallest possible) page table would have one entry for every VM page actually in main memory Need 4  16K = 64KB of main memory to store this ideal page table Need 4  16K = 64KB of main memory to store this ideal page table Storage overhead = 0.1% Storage overhead = 0.1%  Inverted page table implementations are approximations to this ideal page table Associative inverted page table in special hardware (ATLAS) Associative inverted page table in special hardware (ATLAS) Hashed inverted page table in MM (IBM, HP PA-RISC) Hashed inverted page table in MM (IBM, HP PA-RISC)

18 Translation Lookaside Buffer (TLB)  To avoid two or more MM accesses for each VM access, use a small cache to store (VPN, PTE) pairs PTE contains RPN, from which RA can be constructed PTE contains RPN, from which RA can be constructed  This cache is the TLB, and it exploits locality DEC Alpha (32 entries, fully associative) DEC Alpha (32 entries, fully associative) Amdahl V/8 (512 entries, 2-way set-associative) Amdahl V/8 (512 entries, 2-way set-associative)  Processor issues VA TLB hit TLB hit  Send RA to main memory TLB miss TLB miss  Make two or more MM accesses to page tables to retrieve RA  Send RA to MM –(Any of these may cause page fault)

19 TLB Misses  Causes for TLB miss VM page is not in main memory VM page is not in main memory VM page is in main memory, but TLB entry has not yet been entered into TLB VM page is in main memory, but TLB entry has not yet been entered into TLB VM page is in main memory, but TLB entry has been removed for some reason (removed as LRU, invalidated because page table was updated, etc.) VM page is in main memory, but TLB entry has been removed for some reason (removed as LRU, invalidated because page table was updated, etc.)  Miss rates are remarkably low (~0.1%) Miss rate depends on size of TLB and on VM page size (coverage) Miss rate depends on size of TLB and on VM page size (coverage)  Miss penalty varies from a single cache access to several page faults

20 Dirty Bits and TLB: Two Solutions  TLB is read-only cache  Dirty bit is contained only in page table in MM  TLB contains only a write- access bit Initially set to zero (denying writing of page) Initially set to zero (denying writing of page)  On first attempt to write VM page An exception is caused An exception is caused Sets the dirty bit in page table in MM Sets the dirty bit in page table in MM Resets the write access bit to 1 in TLB Resets the write access bit to 1 in TLB  TLB is a read-write cache  Dirty bit present in both TLB and page table in MM  On first write to VM page Only dirty bit in TLB is set Only dirty bit in TLB is set  Dirty bit in page table is brought up-to-date when TLB entry is evicted when TLB entry is evicted when VM page and PTE are evicted when VM page and PTE are evicted

21 Virtual Memory Access Time  Assume existence of TLB, physical cache, MM, disk  Processor issues VA TLB hit TLB hit  Send RA to cache TLB miss TLB miss  Exception: Access page tables, update TLB, retry  Memory reference may involve accesses to TLB TLB Page table in MM Page table in MM Cache Cache Page in MM Page in MM  Each of these can be a hit or a miss 16 possible combinations 16 possible combinations

22 Virtual Memory Access Time (2)  Constraints among these accesses Hit in TLB  hit in page table in MM Hit in TLB  hit in page table in MM Hit in cache  hit in page in MM Hit in cache  hit in page in MM Hit in page in MM  hit in page table in MM Hit in page in MM  hit in page table in MM  These constraints eliminate eleven combinations

23 Virtual Memory Access Time (3)  Number of MM accesses depends on page table organization MIPS R2000/R4000 accomplishes table walking with CPU instructions (eight instructions per page table level) MIPS R2000/R4000 accomplishes table walking with CPU instructions (eight instructions per page table level) Several CISC machines implement this in microcode, with MC88200 having dedicated hardware for this Several CISC machines implement this in microcode, with MC88200 having dedicated hardware for this RS/6000 implements this completely in hardware RS/6000 implements this completely in hardware  TLB miss penalty dominated by having to go to main memory Page tables may not be in cache Page tables may not be in cache Further increase in miss penalty if page table organization is complex Further increase in miss penalty if page table organization is complex TLB misses can have very damaging effect on physical caches TLB misses can have very damaging effect on physical caches

24 Page Size  Choices Fixed at design time (most early VM systems) Fixed at design time (most early VM systems) Statically configurable Statically configurable  At any moment, only pages of same size exist in system  MC68030 allowed page sizes between 256B and 32KB this way Dynamically configurable Dynamically configurable  Pages of different sizes coexist in system  Alpha 21164, UltraSPARC: 8KB, 64KB, 512KB, 4MB  MIPS R10000, PA-8000: 4KB, 16Kb, 64KB, 256 KB, 1 MB, 4 MB, 16 MB  All pages are aligned Dynamic configuration is a sophisticated way to decrease TLB miss Dynamic configuration is a sophisticated way to decrease TLB miss  Increasing # TLB entries increases processor cycle time  Increasing size of VM page increases internal memory fragmentation  Needs fully associative TLBs

25 Segmentation and Paging  Paged segments: Segments are made up of pages  Paging system has flat, linear address space 32-bit VA = (10-bit VPN1, 10-bit VPN2, 12-bit offset) 32-bit VA = (10-bit VPN1, 10-bit VPN2, 12-bit offset) If, for given VPN1, we reach max value of VPN2 and add 1, we reach next page at address (VPN+1, 0) If, for given VPN1, we reach max value of VPN2 and add 1, we reach next page at address (VPN+1, 0)  Segmented version has two-dimensional address space 32-bit VA = (10-bit segment #, 10-bit page number, 12-bit offset) 32-bit VA = (10-bit segment #, 10-bit page number, 12-bit offset) If, for given segment #, we reach max page number and add 1, we get an undefined value If, for given segment #, we reach max page number and add 1, we get an undefined value  Segments are not contiguous  Segments do not need to have the same size Size can even vary dynamically Size can even vary dynamically  Implemented by storing upper bound for each segment and checking every reference against it

26 Example 1: Alpha 21264 TLB  Figure 5.36

27 Example 2: Hypothetical Virtual Mem  Figure 5.37

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory.

Similar presentations

Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory.

Similar presentations

Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory."— Presentation transcript:

Similar presentations

About project

Feedback