1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory.

Slides:



Advertisements
Similar presentations
Virtual Memory In this lecture, slides from lecture 16 from the course Computer Architecture ECE 201 by Professor Mike Schulte are used with permission.
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.
Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Virtual Memory I Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 153 Design of Operating Systems Spring 2015
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 17, 2003 Topic: Virtual Memory.
Memory Management (II)
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 12, 2003 Topics: 1. Cache Performance (concl.) 2. Cache Coherence.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
ENGS 116 Lecture 131 Caches and Virtual Memory Vincent H. Berk October 31 st, 2008 Reading for Today: Sections C.1 – C.3 (Jouppi article) Reading for Monday:
1 Above: The Burrough B5000 computer. The first commercial machine with virtual memory. Right: First experimental virtual memory. The Manchester Atlas.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 6, 2002 Topic: 1. Virtual Memory; 2. Cache Coherence.
Mem. Hier. CSE 471 Aut 011 Evolution in Memory Management Techniques In early days, single program run on the whole machine –Used all the memory available.
Operating Systems & Memory Systems: Address Translation Computer Science 220 ECE 252 Professor Alvin R. Lebeck Fall 2008.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Lecture 19: Virtual Memory
Operating Systems ECE344 Ding Yuan Paging Lecture 8: Paging.
The Three C’s of Misses 7.5 Compulsory Misses The first time a memory location is accessed, it is always a miss Also known as cold-start misses Only way.
Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Improving Cache Performance Four categories of optimisation: –Reduce miss rate –Reduce miss penalty –Reduce miss rate or miss penalty using parallelism.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
MBG 1 CIS501, Fall 99 Lecture 11: Memory Hierarchy: Caches, Main Memory, & Virtual Memory Michael B. Greenwald Computer Architecture CIS 501 Fall 1999.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
CS203 – Advanced Computer Architecture Virtual Memory.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Virtual Memory Chapter 8.
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
Above: The Burrough B5000 computer
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
From Address Translation to Demand Paging
Section 9: Virtual Memory (VM)
From Address Translation to Demand Paging
CSC 4250 Computer Architectures
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Memory Hierarchy Virtual Memory, Address Translation
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
From Address Translation to Demand Paging
Evolution in Memory Management Techniques
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Virtual Memory Overcoming main memory size limitation
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
CSE451 Virtual Memory Paging Autumn 2002
Cache - Optimization.
CSE 451: Operating Systems Winter 2005 Page Tables, TLBs, and Other Pragmatics Steve Gribble 1.
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Review What are the advantages/disadvantages of pages versus segments?
Virtual Memory 1 1.
Presentation transcript:

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 4, 2002 Topic: 1. Caches (contd.); 2. Virtual Memory

2 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

3 1. Fast Hit Times via Small, Simple Caches  Simple caches can be faster cache hit time increasingly a bottleneck to CPU performance cache hit time increasingly a bottleneck to CPU performance  set associativity requires complex tag matching  slower  direct-mapped are simpler  faster  shorter CPU cycle times –tag check can be overlapped with transmission of data  Smaller caches can be faster can fit on the same chip as CPU can fit on the same chip as CPU  avoid penalty of going off-chip for L2 caches: compromise for L2 caches: compromise  keep tags on chip, and data off chip –fast tag check, yet greater cache capacity L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV

4 2. Fast Hits by Avoiding Addr. Translation  Send virtual address to cache? Called Virtually Addressed Cache or just Virtual Cache, vs. Physical Cache Every time process is switched logically must flush the cache; otherwise get false hits Every time process is switched logically must flush the cache; otherwise get false hits  Cost is time to flush + “compulsory” misses from empty cache Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address Dealing with aliases (sometimes called synonyms); Two different virtual addresses map to same physical address I/O must interact with cache, so need mapping to virtual address I/O must interact with cache, so need mapping to virtual address  Solution to aliases HW guarantee: each cache frame holds unique physical address HW guarantee: each cache frame holds unique physical address SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring SW guarantee: lower n bits must have same address; as long as covers index field & direct mapped, they must be unique; called page coloring  Solution to cache flush Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process Add process identifier tag that identifies process as well as address within process: can’t get a hit if wrong process

5 Virtually Addressed Caches CPU TLB Cache MEM VA PA Conventional Organization CPU Cache TLB MEM VA PA Virtually Addressed Cache Translate only on miss Synonym Problem CPU CacheTLB MEM VA PA Tags PA Overlap $ access with VA translation: requires $ index to remain invariant across translation VA Tags L2 $

6 2. Avoiding Translation: Process ID impact  Black is uniprocess  Light Gray is multiprocess when flush cache  Dark Gray is multiprocess when use Process ID tag  Y axis: Miss Rates up to 20%  X axis: Cache size from 2 KB to 1024 KB  Fig 5.25 (HP3)

7 2. Avoiding Translation: Index with Physical Portion of Address  Use page offset as cache index part that is identical in both virtual and physical addresses part that is identical in both virtual and physical addresses read cache AND translate virtual part in parallel read cache AND translate virtual part in parallel tag match uses result of translation (physical address) tag match uses result of translation (physical address)  virtually indexed, physically tagged  Limits cache to page size: what if we want bigger caches Use higher associativity Use higher associativity Page Address Page Offset Address Tag Index Block Offset

8 Cache Optimization Summary TechniqueMRMPHT Complexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 Small & Simple Caches–+0 Avoiding Address Translation+2

9 Impact of Caches  : Speed = ƒ(no. operations)  1997 Pipelined Execution & Fast Clock Rate Pipelined Execution & Fast Clock Rate Out-of-Order completion Out-of-Order completion Superscalar Instruction Issue Superscalar Instruction Issue  1999: Speed = ƒ(non-cached memory accesses)  What does this mean for Compilers, Architects, Algorithms, Data Structures? Compilers, Architects, Algorithms, Data Structures?

10 Virtual Memory  HP3 Section 5.10  For background: Refer to PH (Comp. Org.)

11Characteristics

12Addressing  Always a congruence mapping  Assume 4GB VM composed of 220 4KB pages 4GB VM composed of 220 4KB pages 64MB DRAM main memory composed of page frames (of same size) 64MB DRAM main memory composed of page frames (of same size)  Only those pages (of the 220) that are not empty actually exist Each is either in main memory or on disk Each is either in main memory or on disk Can be located with two mappings (implemented with tables) Can be located with two mappings (implemented with tables) Virtual address= (virtual page number,page offset) VA= (VPN,offset) 32 bits= (20 bits +12 bits) Physical address= (real page number,page offset) PA= (RPN, offset) 26 bits= (14 bits +12 bits)

13 Address Translation  RPN = f M (VPN) In reality, VPN is mapped to a page table entry (PTE) In reality, VPN is mapped to a page table entry (PTE)  which contains RPN …  … as well as miscellaneous control information (e.g., valid bit, dirty bit, replacement information, access control) VA  PA (VPN, offset within page)  (RPN, offset within page) VA  disk address

14 Single-Level, Direct Page Table in MM  Fully associative mapping: when VM page is brought in from disk to MM, it may go into any of the real page frames when VM page is brought in from disk to MM, it may go into any of the real page frames  Simplest addressing scheme: one-level, direct page table (page table base address + VPN) = PTE or page fault (page table base address + VPN) = PTE or page fault Assume that PTE size is 4 bytes Assume that PTE size is 4 bytes Then whole table requires 4  2 20 = 4MB of main memory Then whole table requires 4  2 20 = 4MB of main memory  Disadvantage: 4MB of main memory must be reserved for page tables, even when the VM space is almost empty

15 Single-Level Direct Page Table in VM  To avoid tying down 4MB of physical memory Put page tables in VM Put page tables in VM Bring into MM only those that are actually needed Bring into MM only those that are actually needed “Paging the page tables” “Paging the page tables”  Needs only 1K PTEs in main memory, rather than 4MB  Slows down access to VM pages by possibly needing disk accesses for the PTEs

16 Multi-Level Direct Page Table in MM  Another solution to storage problem  Break 20-bit VPN into two 10-bit parts VPN = (VPN1, VPN2) VPN = (VPN1, VPN2)  This turns original one-level page table into a tree structure (1st level base address + VPN1) = 2nd level base address (1st level base address + VPN1) = 2nd level base address (2nd level base address + VPN2) = PTE or page fault (2nd level base address + VPN2) = PTE or page fault  Storage situation much improved Always need root node (1K 4-byte entries = 1 VM page) Always need root node (1K 4-byte entries = 1 VM page) Ned only a few of the second level nodes Ned only a few of the second level nodes  Allocated on demand  Can be anywhere in main memory  Access time to PTE has doubled

17 Inverted Page Tables  Virtual address spaces may be vastly larger (and more sparsely populated) than real address spaces less-than-full utilization of tree nodes in multi-level direct page table becomes more significant less-than-full utilization of tree nodes in multi-level direct page table becomes more significant  Ideal (i.e., smallest possible) page table would have one entry for every VM page actually in main memory Need 4  16K = 64KB of main memory to store this ideal page table Need 4  16K = 64KB of main memory to store this ideal page table Storage overhead = 0.1% Storage overhead = 0.1%  Inverted page table implementations are approximations to this ideal page table Associative inverted page table in special hardware (ATLAS) Associative inverted page table in special hardware (ATLAS) Hashed inverted page table in MM (IBM, HP PA-RISC) Hashed inverted page table in MM (IBM, HP PA-RISC)

18 Translation Lookaside Buffer (TLB)  To avoid two or more MM accesses for each VM access, use a small cache to store (VPN, PTE) pairs PTE contains RPN, from which RA can be constructed PTE contains RPN, from which RA can be constructed  This cache is the TLB, and it exploits locality DEC Alpha (32 entries, fully associative) DEC Alpha (32 entries, fully associative) Amdahl V/8 (512 entries, 2-way set-associative) Amdahl V/8 (512 entries, 2-way set-associative)  Processor issues VA TLB hit TLB hit  Send RA to main memory TLB miss TLB miss  Make two or more MM accesses to page tables to retrieve RA  Send RA to MM –(Any of these may cause page fault)

19 TLB Misses  Causes for TLB miss VM page is not in main memory VM page is not in main memory VM page is in main memory, but TLB entry has not yet been entered into TLB VM page is in main memory, but TLB entry has not yet been entered into TLB VM page is in main memory, but TLB entry has been removed for some reason (removed as LRU, invalidated because page table was updated, etc.) VM page is in main memory, but TLB entry has been removed for some reason (removed as LRU, invalidated because page table was updated, etc.)  Miss rates are remarkably low (~0.1%) Miss rate depends on size of TLB and on VM page size (coverage) Miss rate depends on size of TLB and on VM page size (coverage)  Miss penalty varies from a single cache access to several page faults

20 Dirty Bits and TLB: Two Solutions  TLB is read-only cache  Dirty bit is contained only in page table in MM  TLB contains only a write- access bit Initially set to zero (denying writing of page) Initially set to zero (denying writing of page)  On first attempt to write VM page An exception is caused An exception is caused Sets the dirty bit in page table in MM Sets the dirty bit in page table in MM Resets the write access bit to 1 in TLB Resets the write access bit to 1 in TLB  TLB is a read-write cache  Dirty bit present in both TLB and page table in MM  On first write to VM page Only dirty bit in TLB is set Only dirty bit in TLB is set  Dirty bit in page table is brought up-to-date when TLB entry is evicted when TLB entry is evicted when VM page and PTE are evicted when VM page and PTE are evicted

21 Virtual Memory Access Time  Assume existence of TLB, physical cache, MM, disk  Processor issues VA TLB hit TLB hit  Send RA to cache TLB miss TLB miss  Exception: Access page tables, update TLB, retry  Memory reference may involve accesses to TLB TLB Page table in MM Page table in MM Cache Cache Page in MM Page in MM  Each of these can be a hit or a miss 16 possible combinations 16 possible combinations

22 Virtual Memory Access Time (2)  Constraints among these accesses Hit in TLB  hit in page table in MM Hit in TLB  hit in page table in MM Hit in cache  hit in page in MM Hit in cache  hit in page in MM Hit in page in MM  hit in page table in MM Hit in page in MM  hit in page table in MM  These constraints eliminate eleven combinations

23 Virtual Memory Access Time (3)  Number of MM accesses depends on page table organization MIPS R2000/R4000 accomplishes table walking with CPU instructions (eight instructions per page table level) MIPS R2000/R4000 accomplishes table walking with CPU instructions (eight instructions per page table level) Several CISC machines implement this in microcode, with MC88200 having dedicated hardware for this Several CISC machines implement this in microcode, with MC88200 having dedicated hardware for this RS/6000 implements this completely in hardware RS/6000 implements this completely in hardware  TLB miss penalty dominated by having to go to main memory Page tables may not be in cache Page tables may not be in cache Further increase in miss penalty if page table organization is complex Further increase in miss penalty if page table organization is complex TLB misses can have very damaging effect on physical caches TLB misses can have very damaging effect on physical caches

24 Page Size  Choices Fixed at design time (most early VM systems) Fixed at design time (most early VM systems) Statically configurable Statically configurable  At any moment, only pages of same size exist in system  MC68030 allowed page sizes between 256B and 32KB this way Dynamically configurable Dynamically configurable  Pages of different sizes coexist in system  Alpha 21164, UltraSPARC: 8KB, 64KB, 512KB, 4MB  MIPS R10000, PA-8000: 4KB, 16Kb, 64KB, 256 KB, 1 MB, 4 MB, 16 MB  All pages are aligned Dynamic configuration is a sophisticated way to decrease TLB miss Dynamic configuration is a sophisticated way to decrease TLB miss  Increasing # TLB entries increases processor cycle time  Increasing size of VM page increases internal memory fragmentation  Needs fully associative TLBs

25 Segmentation and Paging  Paged segments: Segments are made up of pages  Paging system has flat, linear address space 32-bit VA = (10-bit VPN1, 10-bit VPN2, 12-bit offset) 32-bit VA = (10-bit VPN1, 10-bit VPN2, 12-bit offset) If, for given VPN1, we reach max value of VPN2 and add 1, we reach next page at address (VPN+1, 0) If, for given VPN1, we reach max value of VPN2 and add 1, we reach next page at address (VPN+1, 0)  Segmented version has two-dimensional address space 32-bit VA = (10-bit segment #, 10-bit page number, 12-bit offset) 32-bit VA = (10-bit segment #, 10-bit page number, 12-bit offset) If, for given segment #, we reach max page number and add 1, we get an undefined value If, for given segment #, we reach max page number and add 1, we get an undefined value  Segments are not contiguous  Segments do not need to have the same size Size can even vary dynamically Size can even vary dynamically  Implemented by storing upper bound for each segment and checking every reference against it

26 Example 1: Alpha TLB  Figure 5.36

27 Example 2: Hypothetical Virtual Mem  Figure 5.37