Appendix B. Review of Memory Hierarchy

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
The Memory Hierarchy II CPSC 321 Andreas Klappenecker.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
Lecture 19: Virtual Memory
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
1 Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: –illusion of having more physical memory –program relocation.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Lecture 20 Last lecture: Today’s lecture: Types of memory
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CS161 – Design and Architecture of Computer
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Soner Onder Michigan Technological University
CSE 351 Section 9 3/1/12.
The Memory System (Chapter 5)
ECE232: Hardware Organization and Design
Memory COMPUTER ARCHITECTURE
CS161 – Design and Architecture of Computer
Lecture 12 Virtual Memory.
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
How will execution time grow with SIZE?
CS 704 Advanced Computer Architecture
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers Memory & Cache
Morgan Kaufmann Publishers
Virtual Memory 4 classes to go! Today: Virtual Memory.
Andy Wang Operating Systems COP 4610 / CGS 5765
Lecture 08: Memory Hierarchy Cache Performance
Andy Wang Operating Systems COP 4610 / CGS 5765
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Adapted from slides by Sally McKee Cornell University
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
CS 704 Advanced Computer Architecture
CSC3050 – Computer Architecture
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Memory & Cache.
Sarah Diesburg Operating Systems CS 3430
Andy Wang Operating Systems COP 4610 / CGS 5765
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Sarah Diesburg Operating Systems COP 4610
Presentation transcript:

Appendix B. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB Hello, the title of my talk today is ……. In this talk, I am going to describe several join work between my research group and folks at Intel’s Microprocessor Researcher Lab. It is well-known that the performance gap between processor and memory has been continuously widening. Cache memory is commonly used to bridge this performance gap. It is not well-publicized, however, that the cache latency can also make huge impact on processor performance. In today’s talk, I am going to focus on the cache latency problem and propose a few solutions. CDA5155 Fall 2013, Peir / University of Florida

Cache Basics A cache is a (hardware managed) storage, intermediate in size, speed, and cost-per-bit between the programmer-visible registers (usually SRAM) and main physical memory (usually DRAM), used to hide memory latency The cache itself may be SRAM or fast DRAM. There may be >1 levels of caches Basis for cache to work: Principle of Locality When a location is accessed, it and “nearby” locations are likely to be accessed again soon. “Temporal” locality - Same location likely again soon. “Spatial” locality - Nearby locations likely soon.

Cache Performance Formulas Memory stalls per program (blocking cache): CPU time formula: More cache performance will be given later!

Four Basic Questions Consider access of levels in a memory hierarchy. For memory: byte addressable For caches: Use block (or called line) for the unit of data transfer; satisfy Principle of Locality. But, need to locate blocks (not byte addressable) Transfer between cache levels, and the memory Cache design is described by four behaviors: Block Placement: Where could a new block be placed in the level? Block Identification: How is a block found if it is in the level? Block Replacement: Which existing block should be replaced if necessary? Write Strategy: How are writes to the block handled?

Block Placement Schemes

Direct-Mapped Placement A block can only go into one frame in the cache Determined by block’s address (in memory space) Frame number usually given by some low-order bits of block address. This can also be expressed as: (Frame number) = (Block address) mod (Number of frames (sets) in cache) Note that in a direct-mapped cache, block placement & replacement are both completely determined by the address of the new block that is to be accessed.

Direct-Mapped Identification Tags for locating block Tags Block frames Memory Address Block Data Tag Frm# Off. Decode & Row Select One Selected &Compared Mux select Compare Tags ? Data Word Hit

Fully-Associative Placement One alternative to direct-mapped is: Allow block to fill any empty frame in the cache. How do we then locate the block later? Can associate each stored block with a tag Identifies the block’s location in cache. When the block is needed, we can use the cache as an associative memory, using the tag to match all frames in parallel, to pull out the appropriate block. Another alternative to direct-mapped is placement under full program control. A register file can be viewed as a small programmer-controlled cache (w. 1-word blocks).

Fully-Associative Identification Block addrs Block frames Address Block addr Off. Parallel Compare & Select Note that, compared to Direct-M: More address bits have to be stored with each block frame. A comparator is needed for each frame, to do the parallel associative lookup. Mux select Hit Data Word

Set-Associative Placement The block address determines not a single frame, but a frame set (several frames, grouped together). (Frame set #) = (Block address) mod (# of frame sets) The block can be placed associatively anywhere within that frame set. If there are n frames in each frame set, the scheme is called “n-way set-associative”. Direct mapped = 1-way set-associative. Fully associative: There is only 1 frame set.

Set-Associative Identification Tags Block frames Address Tag Set# Off. Note: 4 separate sets Set Select Parallel Compare within the Set Intermediate between direct-mapped and fully-associative in number of tag bits needed to be associated with cache frames. Still need a comparator for each frame (but only those in one set need be activated). Hit Mux select Data Word

Cache Size Equation Simple equation for the size of a cache: (Cache size) = (Block size) × (Number of sets) × (Set Associativity) Can relate to the size of various address fields: (Block size) = 2(# of offset bits) (Number of sets) = 2(# of index bits) (# of tag bits) = (# of memory address bits)  (# of index bits)  (# of offset bits) Determine the set Memory address

Replacement Strategies Which block do we replace when a new block comes in (on cache miss)? Direct-mapped: There’s only one choice! Associative (fully- or set-): If any frame in the set is empty, pick one of those. Otherwise, there are many possible strategies: (Pseudo-) Random: Simple, fast, and fairly effective Least-Recently Used (LRU), and approximations thereof Require bits to record replacement info., e.g. 4-way requires 4! = 24 permutations, need 5 bits to define the MRU to LRU positions Least-Frequently Used (LFU) using counters Optimal: Replace the block used farthest future (possible?)

Implement LRU Replacement Pure LRU, 4-way  use 6 bits (minimum 5 bits) Partitioned LRU (Pseudo LRU): Instead of record full combination, use a binary tree to maintain only (n-1) bits for n-way set associativity 4-way example: 12 13 14 23 24 34 01 1 10 LRU LRU LRU Replacement

Write Strategies Most accesses are reads, not writes Especially if instruction reads are included Optimize for reads – What matters performance Direct mapped can return value before valid check Writes are more difficult Can’t write to cache till we know the right block Object written may have various sizes (1-8 bytes) When to synchronize cache with memory? Write through - Write to cache & to memory Prone to stalls due to high bandwidth requirements Write back - Write to memory upon replacement Memory may be out of date

Write Miss Strategies What do we do on a write to a block that’s not in the cache? Two main strategies: Both do not stop processor Write-allocate (fetch on write) - Cache the block. No write-allocate (write around) - Just write to memory. Write-back caches tend to use write-allocate. White-through tends to use no write-allocate. Use Dirty Bit to indicate writeback is needed in write-back strategy Remember, write won’t occur until commit (out-of-order model)

Write Buffers A mechanism to help reduce write stalls On a write to memory, block address and data to be written are placed in a write buffer. CPU can continue immediately Unless the write buffer is full. Write merging: If the same block is written again before it has been flushed to memory, old contents are replaced with new contents. Care must be taken to not violate memory consistency and proper write ordering

Write Merging Example

Instruction vs. Data Caches Instructions and data have different patterns of temporal and spatial locality Also instructions are generally read-only Can have separate instruction & data caches Advantages Doubles bandwidth between CPU & memory hierarchy Each cache can be optimized for its pattern of locality Disadvantages Slightly more complex design Can’t dynamically adjust cache space taken up by instructions vs. data; combined I/D cache has slightly higher hit ratios

I/D Split and Unified Caches Size I-Cache D-Cache Unified Cache 8KB 8.16 44.0 63.0 16KB 3.82 40.9 51.0 32KB 1.36 38.4 43.3 64KB 0.61 36.9 39.4 128KB 0.30 35.3 36.2 256KB 0.02 32.6 32.9 Miss per 1000 instructions From 5 SPECint2000 and 5 SPECfp2000 Much lower instruction miss rate than data miss rate Miss rate comparison: Unified 32KB cache: (43.3/1000) / (1+.36) = 0.0318 (36% data transfer inst.) Split 16KB + 16 KB caches: Miss rate 16KB instruction: (3.82/1000) / 1 = 0.004 Miss rate 16KB data: (40.9/1000) / .36 = 0.0318 Combined: (74%*0.004) + (26%*0.0318) = 0.0324

Other Hierarchy Levels The components that aren’t normally considered caches as being part of the memory hierarchy anyway, and look at their placement / identification / replacement / write strategies. Levels under consideration: Pipeline registers Register file Caches (multiple levels) Physical memory (new memory technology…) Virtual memory Local Files The delay and capacity of each level is significantly different.

Example: Alpha AXP 21064 Direct-mapped

Another Example – Alpha 21264 64KB, 2-way, 64-byte block, 512 sets 44 physical address bits See also Figure B.5

Virtual Memory The addition of the virtual memory mechanism complicated the cache access

Addressing Virtual Memories

TLB Example: Alpha 21264 For fast address translation, a “cache” of the page table, called TLB is implemented

An Memory Hierarchy Example 28 Vir:64b; Phy:41b Page = 8KB, block: 64B TLB: Dir-map 256 entries L1: Dir-map 8KB L2: Dir-map 4MB

Multi-level Virtual Addressing Sparse page table Reduce page table size: - Multi-level page table - Inverted page table