Lecture 5 Cache Operation

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Lecture 19: Virtual Memory
Multilevel Memory Caches Prof. Sirer CS 316 Cornell University.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
Computer Architecture Lecture 26 Fasih ur Rehman.
Multiprocessor cache coherence. Caching: terms and definitions cache line, line size, cache size degree of associativity –direct-mapped, set and fully.
Lecture 5 Cache Operation ECE 463/521 Fall 2002 Edward F. Gehringer Based on notes by Drs. Eric Rotenberg & Tom Conte of NCSU.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Lecture 20 Last lecture: Today’s lecture: Types of memory
Cache Operation.
CAM Content Addressable Memory
Cache Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module.
Cache Organization 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.
CSCI206 - Computer Organization & Programming
CS161 – Design and Architecture of Computer
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CSE 351 Section 9 3/1/12.
CS161 – Design and Architecture of Computer
Replacement Policy Replacement policy:
Multilevel Memories (Improving performance using alittle “cash”)
Basic Performance Parameters in Computer Architecture:
Lecture: Cache Hierarchies
Caches III CSE 351 Autumn 2017 Instructor: Justin Hsia
Consider a Direct Mapped Cache with 4 word blocks
Lecture: Cache Hierarchies
CS61C : Machine Structures Lecture 6. 2
ECE 445 – Computer Organization
Lecture 21: Memory Hierarchy
Lecture 21: Memory Hierarchy
Bojian Zheng CSCD70 Spring 2018
Computer Architecture
Lecture 22: Cache Hierarchies, Memory
Module IV Memory Organization.
Performance metrics for caches
ECE232: Hardware Organization and Design
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
/ Computer Architecture and Design
Caches III CSE 351 Autumn 2018 Instructor: Justin Hsia
Lecture 22: Cache Hierarchies, Memory
Lecture 11: Cache Hierarchies
CS-447– Computer Architecture Lecture 20 Cache Memories
CS 3410, Spring 2014 Computer Science Cornell University
Virtual Memory Prof. Eric Rotenberg
Basic Cache Operation Prof. Eric Rotenberg
Lecture 21: Memory Hierarchy
Caches: AAT, 3C’s model of misses Prof. Eric Rotenberg
Performance metrics for caches
Cache - Optimization.
Cache Memory Rabi Mahapatra
Lecture 9: Caching and Demand-Paged Virtual Memory
Cache Memory and Performance
Principle of Locality: Memory Hierarchies
Lecture 13: Cache Basics Topics: terminology, cache organization (Sections )
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Caches III CSE 351 Spring 2019 Instructor: Ruth Anderson
Presentation transcript:

Lecture 5 Cache Operation ECE 463/521 Spring 2005 Edward F. Gehringer Based on notes by Drs. Eric Rotenberg & Tom Conte of NCSU

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

Cache Parameters SIZE = total amount of cache data storage, in bytes BLOCKSIZE = total number of bytes in a single block ASSOC = associativity, i.e., # of lines in a set

Cache Parameters (cont.) Equation for # of cache lines in cache: Equation for # of sets in cache:

Address Fields 31 block offset tag index block offset tag index Used to look up a “set,” whose lines contain one or more memory blocks. (The # of lines per set is the “associativity”.) Tag field is compared to the tag(s) of the indexed cache line(s). – If it matches, block is there (hit). – If it doesn’t match, block is not there (miss). Once a block is found, the offset selects a particular byte or word of data in the block.

Address Fields (cont.) Widths of address fields (# bits) # index bits = log2(# sets) # block offset bits = log2(block size) # tag bits = 32 – # index bits – # block offset bits Assuming 32-bit addresses 31 block offset tag index

31 Address (from processor) block offset tag index Tags Data Match? Byte or word select requested data (byte or word) hit/miss (to processor)

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

Example of cache operation Example: Processor accesses a 256-byte direct-mapped cache, which has 32-byte lines, with the following sequence of addresses. Show contents of the cache after each access. Count # of hits and # of replacements.

Example address sequence Address (hex) Tag (hex) Index & offset bits (binary) Index (decimal) Comment 0xFF0040E0 0xBEEF005C 0xFF0040E2 0xFF0040E8 0x00101078 0x002183E0 0x00101064 0x0012255C 0x00122544

Example (cont.) # index bits = log2(# sets) = log2(8) = 3 # block offset bits = log2(block size) = log2(32 bytes) = 5 # tag bits = total # address bits – # index bits – # block offset bits = 32 bits – 3 bits – 5 bits = 24 Thus, the top 6 nibbles (24 bits) of address form the tag and lower 2 nibbles (8 bits) of address form the index and block offset fields.

Index & offset bits (binary) Address (hex) Tag (hex) Index & offset bits (binary) Index (decimal) Comment 0xFF0040E0 0xFF0040 1110 0000 7 Miss 0xBEEF005C 0xBEEF00 0101 1100 2 0xFF0040E2 1110 0010 Hit 0xFF0040E8 1110 1000 0x00101078 0x001010 0111 1000 3 0x002183E0 0x002183 Miss+Replace 0x00101064 0110 0100 0x0012255C 0x001225 0x00122544 0100 0100

MISS 31 8 7 5 4 tag index block offset FF0040 7 24 3 Tags Data 1 2 3 4 tag index block offset FF0040BEEF00 FF0040 001010 002183 001225 FF0040 7 24 3 Tags Data 1 2 3 4 5 6 7 FF0040 Match? Get block from memory (slow) MISS

MISS 31 8 7 5 4 tag index block offset BEEF00 2 3 Tags Data 1 2 BEEF00 tag index block offset FF0040 BEEF00 001010 002183 001225 BEEF00 2 3 Tags Data 1 2 BEEF00 3 4 5 24 6 7 FF0040 Match? Get block from memory (slow) MISS

HIT 31 8 7 5 4 tag index block offset FF0040 7 3 Tags Data 1 2 BEEF00 tag index block offset FF0040 BEEF00 001010 002183 001225 FF0040 7 3 Tags Data 1 2 BEEF00 3 4 5 24 6 7 FF0040 Match? HIT

HIT 31 8 7 5 4 tag index block offset FF0040 7 3 Tags Data 1 2 BEEF00 tag index block offset FF0040 BEEF00 001010 002183 001225 FF0040 7 3 Tags Data 1 2 BEEF00 3 4 5 24 6 7 FF0040 Match? HIT

MISS 31 8 7 5 4 tag index block offset 001010 3 3 Tags Data 1 2 BEEF00 tag index block offset FF0040 BEEF00 001010 002183 001225 001010 3 3 Tags Data 1 2 BEEF00 3 001010 4 5 24 6 7 FF0040 Match? Get block from memory (slow) MISS

MISS & REPLACE 31 8 7 5 4 tag index block offset 002183 7 3 Tags Data tag index block offset FF0040 BEEF00 001010 002183 001225 002183 7 3 Tags Data 1 2 BEEF00 3 001010 4 5 24 6 002183 7 FF0040 Match? Get block from Memory (slow) MISS & REPLACE

HIT 31 8 7 5 4 tag index block offset 001010 3 3 Tags Data 1 2 BEEF00 tag index block offset FF0040 BEEF00 001010 002183 001225 001010 3 3 Tags Data 1 2 BEEF00 3 001010 4 5 24 6 7 002183 Match? HIT

MISS & REPLACE 31 8 7 5 4 tag index block offset 001225 2 3 Tags Data tag index block offset FF0040 BEEF00 001010 002183 001225 001225 2 3 Tags Data 1 001225 2 BEEF00 3 001010 4 5 24 6 7 002183 Match? Get block from memory (slow) MISS & REPLACE

HIT 31 8 7 5 4 tag index block offset 001225 2 3 Tags Data 1 2 001225 tag index block offset FF0040 BEEF00 001010 002183 001225 001225 2 3 Tags Data 1 2 001225 3 001010 4 5 24 6 7 002183 Match? HIT

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

Set-Associative Example Example: Processor accesses a 256-byte 2-way set-associative cache, which has line size of 32 bytes, with the following sequence of addresses. Show contents of cache after each access. Count # of hits and # of replacements.

Address (from processor) block offset tag index 31 Address (from processor) block offset tag index Tags Data 32 bytes 32 bytes Match? select a block Match? select certain bytes hit or

Example (cont.) # index bits = log2(# sets) = log2(4) = 2 # block offset bits = log2(block size) = log2(32 bytes) = 5 # tag bits = total # address bits – # index bits – # block offset bits = 32 bits – 2 bits – 5 bits = 25

Index & offset bits (binary) Address (hex) Tag (hex) Index & offset bits (binary) Index (decimal) Comment 0xFF0040E0 0x1FE0081 110 0000 3 Miss 0xBEEF005C 0x17DDE00 101 1100 2 0x00101078 0x0002020 111 1000 0xFF0040E2 110 0010 Hit 0x002183E0 0x0004307 Miss/Replace 0x00101064 110 0100

MISS 31 7 6 5 4 tag index block offset 1FE0081 3 2 Tags 1 2 Data not tag index block offset 1FE0081 3 2 Tags 1 2 Data not shown for convenience 25 3 1FE0081 Match? Match? MISS

MISS 31 7 6 5 4 tag index block offset 17DDE00 2 2 Tags 1 2 17DDE00 tag index block offset 17DDE00 2 2 Tags 1 2 17DDE00 Data not shown for convenience 25 3 1FE0081 Match? Match? MISS

MISS 31 7 6 5 4 tag index block offset 0002020 3 2 Tags 1 2 17DDE00 tag index block offset 0002020 3 2 Tags 1 2 17DDE00 Data not shown for convenience 25 3 1FE0081 0002020 Match? Match? MISS

HIT 31 7 6 5 4 tag index block offset 1FE0081 3 2 Tags 1 2 17DDE00 tag index block offset 1FE0081 3 2 Tags 1 2 17DDE00 Data not shown for convenience 25 3 1FE0081 0002020 Match? Match? HIT

HIT 31 7 6 5 4 tag index block offset 0002020 3 2 Tags 1 2 17DDE00 tag index block offset 0002020 3 2 Tags 1 2 17DDE00 Data not shown for convenience 25 3 1FE0081 0002020 Match? Match? HIT

MISS & REPLACE LRU BLOCK 31 7 6 5 4 tag index block offset 0004307 3 2 Tags 1 2 17DDE00 0004307 Data not shown for convenience 25 3 1FE0081 0002020 Match? Match? MISS & REPLACE LRU BLOCK

HIT 31 7 6 5 4 tag index block offset 0002020 3 2 Tags 1 2 17DDE00 tag index block offset 0002020 3 2 Tags 1 2 17DDE00 Data not shown for convenience 25 3 0004307 0002020 Match? Match? HIT

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

Generic Cache Every cache is an n-way set-associative cache. Direct-mapped: 1-way set-associative n-way set-associative: n-way set-associative Fully associative: n-way set-associative, where there is only 1 set containing n blocks # index bits = log2(1) = 0 (equation still works!)

Generic Cache (cont.) The same equations hold for any cache type Equation for # of cache lines in cache: Equation for # of sets in cache: Fully-associative: ASSOC = # cache lines

Generic Cache (cont.) What this means You don’t have to treat the three cache types differently in your simulator! Support a generic n-way set-associative cache You don’t have to specifically worry about the two extremes (direct-mapped & fully-associative). E.g., even DM cache has LRU information (in simulator), even though it is not used.

Replacement Policy Which block in a set should be replaced when a new block has to be allocated? LRU (least recently used) is the most common policy. Implementation: Real hardware maintains LRU bits with each line in set. Simulator: Cheat using “sequence numbers” as timestamps. Makes simulation easier – gives same effect as hardware.

LRU Example Assume that blocks A, B, C, D, and E all map to the same set Trace: A B C D D D B D E Assign sequence numbers (increment by 1 for each new access). This yields … A(0) B(1) C(2) D(3) D(4) D(5) B(6) D(7) E(8) A(0) A(0) B(1) C(2) D(4) A(0) B(1) A(0) B(1) C(2) D(5) A(0) B(1) C(2) A(0) B(6) C(2) D(5) A(0) B(1) C(2) D(3) A(0) B(6) C(2) D(7) E(8) B(6) C(2) D(7)

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

Handling Writes What happens when a write occurs? Two issues Is just the cache updated with new data, or are lower levels of memory hierarchy updated at same time? Do we allocate a new line in the cache if the write misses?

The Write-Update Question Write-through (WT) policy: Writes that go to the cache are also “written through” to the next level in the memory hierarchy. cache next level in memory hierarchy

The Write-Update Question, cont. Write-back (WB) policy: Writes go only to the cache, and are not (immediately) written through to the next level of the hierarchy. cache next level in memory hierarchy

The Write-Update Question, cont. Write-back, cont. What happens when a line previously written to needs to be replaced? We need to have a “dirty bit” (D) with each line in the cache, and set it when line is written to. When a dirty line is replaced, we need to write the entire line back to next level of memory (“write-back”).

The Write-Update Question, cont. With the write-back policy, replacement of a dirty line triggers a writeback of the entire line. cache D Replacement causes writeback next level in memory hierarchy

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

The Write-Allocation Question Write-Allocate (WA) Bring the block into the cache if the write misses (handled just like a read miss). Typically used with write-back policy: WBWA Write-No-Allocate (NA) Do not bring the block into the cache if the write misses. Must be used in conjunction with write-through: WTNA.

The Write-Allocation Question, cont. WTNA (scenario: the write misses) write miss cache next level in memory hierarchy

The Write Allocation Question, cont. WBWA (scenario: the write misses) cache D next level in memory hierarchy

Outline Review of cache parameters Example of operation of a direct-mapped cache. Example of operation of a set-associative cache. Simulating a generic cache. Write-through vs. write-back caches. Write-allocate vs. no write-allocate. Victim caches.

Victim Caches A victim cache is a small fully-associative cache that sits alongside primary cache E.g., holds 2–16 cache lines Management is slightly “weird” compared to conventional caches When main cache evicts (replaces) a block, the victim cache will take the evicted (replaced) block. The evicted block is called the “victim.” When the main cache misses, the victim cache is searched for recently discarded blocks. A victim-cache hit means the main cache doesn’t have to go to memory for the block.

Victim-Cache Example Suppose we have a 2-entry victim cache It initially holds blocks X and Y. Y is the LRU block in the victim cache. The main cache is direct-mapped. Blocks A and B map to the same set in main cache Trace: A B A B A B …

Victim-Cache Example, cont. Y (LRU) Victim cache Main cache

Victim-Cache Example, cont. B misses in main cache and evicts A. A goes to victim cache & replaces Y. (the previous LRU) 3. X becomes LRU. B X (LRU) A Victim cache Main cache

Victim-Cache Example, cont. A misses in L1 but hits in victim cache, so A and B swap positions: 2. A is moved from victim cache to L1, and B (the victim) goes to victim cache where A was located. (Note: We don’t replace the LRU block, X, in case of victim-cache hit) A X (LRU) B Victim cache L1 cache

Victim Cache – Why? Direct-mapped caches suffer badly from repeated conflicts Victim cache provides illusion of set associativity. A poor-man’s version of set associativity. A victim cache does not have to be large to be effective; even a 4–8 entry victim cache will frequently remove > 50% of conflict misses.