Lecture 5 Cache Operation ECE 463/521 Fall 2002 Edward F. Gehringer Based on notes by Drs. Eric Rotenberg & Tom Conte of NCSU.

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

CSC1016 Coursework Clarification Derek Mortimer March 2010.

1 Lecture 20: Cache Hierarchies, Virtual Memory Today’s topics:  Cache hierarchies  Virtual memory Reminder:  Assignment 8 will be posted soon (due.

The Lord of the Cache Project 3. Caches Three common cache designs: Direct-Mapped store in exactly one cache line Fully Associative store in any cache.

Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )

Modified from notes by Saeid Nooshabadi

CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.

Review CPSC 321 Andreas Klappenecker Announcements Tuesday, November 30, midterm exam.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

Computer ArchitectureFall 2008 © November 3 rd, 2008 Nael Abu-Ghazaleh CS-447– Computer.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

Cache intro CSE 471 Autumn 011 Principle of Locality: Memory Hierarchies Text and data are not accessed randomly Temporal locality –Recently accessed items.

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

2/27/2002CSE Cache II Caches, part II CPU On-chip cache Off-chip cache DRAM memory Disk memory.

COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.

Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.

Lecture 33: Chapter 5 Today’s topic –Cache Replacement Algorithms –Multi-level Caches –Virtual Memories 1.

Lecture 19: Virtual Memory

Lecture Objectives: 1)Define set associative cache and fully associative cache. 2)Compare and contrast the performance of set associative caches, direct.

Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

Computer Architecture Lecture 26 Fasih ur Rehman.

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Chapter 9 Memory Organization By Nguyen Chau Topics Hierarchical memory systems Cache memory Associative memory Cache memory with associative mapping.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Lecture 20 Last lecture: Today’s lecture: Types of memory

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Cache Operation.

CAM Content Addressable Memory

Constructive Computer Architecture Realistic Memories and Caches Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.

Cache Organization 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

CS 61C: Great Ideas in Computer Architecture Caches Part 2 Instructors: Nicholas Weaver & Vladimir Stojanovic

處理器設計與實作 CPU LAB for Computer Organization 1. LAB 7: MIPS Set-Associative D- cache.

Lecture 5 Cache Operation

CSCI206 - Computer Organization & Programming

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Consider a Direct Mapped Cache with 4 word blocks

CS61C : Machine Structures Lecture 6. 2

Lecture 21: Memory Hierarchy

Lecture 21: Memory Hierarchy

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Basic Cache Operation Prof. Eric Rotenberg

Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg

Cache Memory and Performance

Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )

10/18: Lecture Topics Using spatial locality

Presentation transcript:

Lecture 5 Cache Operation ECE 463/521 Fall 2002 Edward F. Gehringer Based on notes by Drs. Eric Rotenberg & Tom Conte of NCSU

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

Cache Parameters SIZE = total amount of cache data storage, in bytes BLOCKSIZE = total number of bytes in a single block ASSOC = associativity, i.e., # of blocks in a set

Cache Parameters (cont.) Equation for # of cache blocks in cache: Equation for # of sets in cache:

Address Fields 031 block offset indextag Tag field is compared to the tag(s) of the indexed cache line(s). – If it matches, block is there (hit). – If it doesn’t match, block is not there (miss). Used to look up a “set,” whose lines contain one or more memory blocks. (The # of blocks per set is the “associativity”.) Once a block is found, the offset selects a particular byte or word of data in the block.

Address Fields (cont.) Widths of address fields (# bits) # index bits = log 2 (# sets) # block offset bits = log 2 (block size) # tag bits = 32 – # index bits – # block offset bits Assuming 32-bit addresses 031 block offset indextag

TagsData 031 block offset indextag Byte or word select requested data (byte or word) (to processor) Match? hit/miss Address (from processor)

Example of cache operation Example: Processor accesses a 256- byte direct-mapped cache, which has 32-byte lines, with the following sequence of addresses. –Show contents of the cache after each access. –Count # of hits and # of replacements.

Example address sequence Address (hex)Tag (hex)Index & offset bits (binary) Index (decimal) Comment 0xFF0040E0 0xBEEF005C 0xFF0040E2 0xFF0040E8 0x x002183E0 0x x C 0x

Example (cont.) # index bits = log 2 (# sets) = log 2 (8) = 3 # block offset bits = log 2 (block size) = log 2 (32 bytes) = 5 # tag bits = total # address bits – # index bits – # block offset bits = 32 bits – 3 bits – 5 bits = 24 Thus, the top 6 nibbles (24 bits) of address form the tag and lower 2 nibbles (8 bits) of address form the index and block offset fields.

Address (hex)Tag (hex)Index & offset bits (binary) Index (decimal) Comment 0xFF0040E00xFF Miss 0xBEEF005C0xBEEF Miss 0xFF0040E20xFF Hit 0xFF0040E80xFF Hit 0x x Miss 0x002183E00x Miss/Replace 0x x Hit 0x C0x Miss/Replace 0x x Hit

3 24 Match? TagsData 031 block offset index tag FF00407 Get block from memory (slow)

BEEF002 FF Match? TagsData 031 block offset index tag Get block from memory (slow) 24

FF00407 BEEF00 FF0040 Match? TagsData 031 block offset index tag

FF00407 BEEF00 FF0040 Match? TagsData 031 block offset index tag

BEEF00 FF0040 Get block from memory (slow) Match? TagsData 031 block offset index tag

BEEF00 FF Get block from Memory (slow) Match? TagsData 031 block offset index tag

BEEF Match? TagsData 031 block offset index tag

BEEF Get block from memory (slow) Match? TagsData 031 block offset index tag

Match? TagsData 031 block offset index tag

Set-Associative Example Example: Processor accesses a 256-byte 2-way set-associative cache, which has block size of 32 bytes, with the following sequence of addresses. –Show contents of cache after each access. –Count # of hits and # of replacements.

TagsData 031 block offset indextag select a block Match? hit Address (from processor)  32 bytes  select certain bytes Match? or  32 bytes 

Example (cont.) # index bits = log 2 (# sets) = log 2 (4) = 2 # block offset bits = log 2 (block size) = log 2 (32 bytes) = 5 # tag bits = total # address bits – # index bits – # block offset bits = 32 bits – 2 bits – 5 bits = 25

Address (hex)Tag (hex)Index & offset bits (binary) Index (decimal) Comment 0xFF0040E00x1FE Miss 0xBEEF005C0x17DDE Miss 0x x Miss 0xFF0040E20x1FE Hit 0x x Hit 0x002183E00x Miss/Replace 0x x Hit

2 Match? 031 block offset indextag FE00813 Match? 25 Data not shown for convenience Tags FE0081

17DDE002 1FE DDE00 2 Match? 031 block offset indextag 7654 Match? 25 Data not shown for convenience Tags

FE DDE Match? 031 block offset indextag 7654 Match? 25 Data not shown for convenience Tags

1FE DDE Match? 031 block offset indextag 7654 Match? 25 Data not shown for convenience Tags

FE DDE Match? 031 block offset indextag 7654 Match? 25 Data not shown for convenience Tags

FE DDE Match? 031 block offset indextag 7654 Match? 25 Data not shown for convenience Tags

DDE Match? 031 block offset indextag 7654 Match? 25 Data not shown for convenience Tags

Generic Cache Every cache is an n-way set-associative cache. 1.Direct-mapped: –1-way set-associative 2.n-way set-associative: –n-way set-associative 3.Fully-associative: –n-way set-associative, where there is only 1 set containing n blocks –# index bits = log 2 (1) = 0 (equation still works!)

Generic Cache (cont.) The same equations hold for any cache type Equation for # of cache blocks in cache: Equation for # of sets in cache: Fully-associative: ASSOC = # cache blocks

Generic Cache (cont.) What this means –You don’t have to treat the three cache types differently in your simulator! –Support a generic n-way set-associative cache –You don’t have to specifically worry about the two extremes (direct-mapped & fully-associative). –E.g., even DM cache has LRU information (in simulator), even though it is not used.

Replacement Policy Which block in a set should be replaced when a new block has to be allocated? –LRU (least recently used) is the most common policy. –Implementation: Real hardware maintains LRU bits with each block in set. Simulator: –Cheat using “sequence numbers” as timestamps. –Makes simulation easier – gives same effect as hardware.

LRU Example Assume that blocks A, B, C, D, and E all map to the same set Trace: A B C D D D B D E Assign sequence numbers (increment by 1 for each new access). This yields … A(0) B(1) C(2) D(3) D(4) D(5) B(6) D(7) E(8) A(0) B(1) A(0)C(2)B(1) A(0)C(2)B(1)D(3) A(0)C(2)B(1)D(4) A(0)C(2)B(1)D(5) A(0)C(2)B(6)D(5) A(0)C(2)B(6)D(7) E(8)C(2)B(6)D(7)

Handling Writes What happens when a write occurs? Two issues 1.Is just the cache updated with new data, or are lower levels of memory hierarchy updated at same time? 2.Do we allocate a new block in the cache if the write misses?

The Write-Update Question Write-through (WT) policy: Writes that go to the cache are also “written through” to the next level in the memory hierarchy. cache next level in memory hierarchy

The Write-Update Question, cont. Write-back (WB) policy: Writes go only to the cache, and are not (immediately) written through to the next level of the hierarchy. cache next level in memory hierarchy

The Write-Update Question, cont. Write-back, cont. –What happens when a line previously written to needs to be replaced? 1.We need to have a “dirty bit” (D) with each line in the cache, and set it when line is written to. 2.When a dirty block is replaced, we need to write the entire block back to next level of memory (“write-back”).

The Write-Update Question, cont. With the write-back policy, replacement of a dirty block triggers a writeback of the entire line. cache next level in memory hierarchy D Replacement causes writeback

The Write-Allocation Question Write-Allocate (WA) –Bring the block into the cache if the write misses (handled just like a read miss). –Typically used with write-back policy: WBWA Write-No-Allocate (NA) –Do not bring the block into the cache if the write misses. –Must be used in conjunction with write-through: WTNA.

The Write-Allocation Question, cont. WTNA (scenario: the write misses) cache next level in memory hierarchy write miss

The Write Allocation Question, cont. WBWA (scenario: the write misses) cache next level in memory hierarchy D

Victim Caches A victim cache is a small fully-associative cache that sits alongside primary cache –E.g., holds 2–16 cache blocks –Management is slightly “weird” compared to conventional caches When main cache evicts (replaces) a block, the victim cache will take the evicted (replaced) block. The evicted block is called the “victim.” When the main cache misses, the victim cache is searched for recently discarded blocks. A victim-cache hit means the main cache doesn’t have to go to memory for the block.

Victim-Cache Example Suppose we have a 2-entry victim cache –It initially holds blocks X and Y. –Y is the LRU block in the victim cache. The main cache is direct-mapped. –Blocks A and B map to the same set in main cache –Trace: A B A B A B …

X Y (LRU) A Main cache Victim cache Victim-Cache Example, cont.

X (LRU) A B Main cache Victim cache 1.B misses in main cache and evicts A. 2.A goes to victim cache & replaces Y. (the previous LRU) 3. X becomes LRU. Victim-Cache Example, cont.

X (LRU) B A L1 cache Victim cache 1.A misses in L1 but hits in victim cache, so A and B swap positions: 2. A is moved from victim cache to L1, and B (the victim) goes to victim cache where A was located. (Note: We don’t replace the LRU block, X, in case of victim-cache hit) Victim-Cache Example, cont.

Victim Cache – Why? Direct-mapped caches suffer badly from repeated conflicts –Victim cache provides illusion of set associativity. –A poor-man’s version of set associativity. –A victim cache does not have to be large to be effective; even a 4–8 entry victim cache will frequently remove > 50% of conflict misses.