Computer Architecture 2014 – Caches 1 Computer Architecture Cache Memory By Yoav Etsion and Dan Tsafrir Presentation based on slides by David Patterson,

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Computer Design 2007 – Caches 1 Dr. Lihu Rappoport Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh MAMAS – Computer.
Computer Architecture 2008 – Caches 1 Dr. Lihu Rappoport Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh Computer.
Computer Architecture Cache Memory
Computer Architecture 2011 – Caches (lec 3-5) 1 Computer Architecture Cache Memory By Dan Tsafrir, 14/3/2011, 21/3/2011, 28/3/2011 Presentation based on.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy & Cache Memory © Avi Mendelson, 3/ MAMAS – Computer Architecture Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Computer Architecture Cache Memory
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
CMPE 421 Parallel Computer Architecture
Computer Architecture 2015 – Caches 1 Computer Architecture Cache Memory By Yoav Etsion and Dan Tsafrir Presentation based on slides by David Patterson,
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Computer Architecture 2012 – Caches (lec 3-5) 1 Computer Architecture Cache Memory By Dan Tsafrir 26/3/2012, 2/4/2012 Presentation based on slides by David.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Computer Structure 2015 – Caches 1 Lihu Rappoport and Adi Yoaz Computer Structure Cache Memory.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Computer Structure Cache Memory
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
CSE 351 Section 9 3/1/12.
The Goal: illusion of large, fast, cheap memory
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
ECE/CS 552: Cache Design © Prof. Mikko Lipasti
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
CS 3410, Spring 2014 Computer Science Cornell University
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Performance Improvements
Memory & Cache.
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Computer Architecture 2014 – Caches 1 Computer Architecture Cache Memory By Yoav Etsion and Dan Tsafrir Presentation based on slides by David Patterson, Avi Mendelson, Lihu Rappoport, and Adi Yoaz

Computer Architecture 2014 – Caches 2 In the old days… u The predecessor of ENIAC (the first general-purpose electronic computer) u Designed & built in by Eckert & Mauchly (who also invented ENIAC), with John Von Neumann u Unlike ENIAC, binary rather than decimal, and a “stored program” machine u Operational until 1961 EDVAC (Electronic Discrete Variable Automatic Computer)

Computer Architecture 2014 – Caches 3 In the olden days… u In 1945, Von Neumann wrote: “…This result deserves to be noted. It shows in a most striking way where the real di ffi culty, the main bottleneck, of an automatic very high speed computing device lies: at the memory.” Von Neumann & EDVAC

Computer Architecture 2014 – Caches 4 In the olden days… u Later, in 1946, he wrote: “…Ideally one would desire an indefinitely large memory capacity such that any particular … word would be immediately available… …We are forced to recognize the possibility of constructing a hierarchy of memories, each of which has greater capacity than the preceding but which is less quickly accessible” Von Neumann & EDVAC

Computer Architecture 2014 – Caches 5 Not so long ago… u In 1994, in their paper “Hitting the Memory Wall: Implications of the Obvious”, William Wulf and Sally McKee said: “We all know that the rate of improvement in microprocessor speed exceeds the rate of improvement in DRAM memory speed – each is improving exponentially, but the exponent for microprocessors is substantially larger than that for DRAMs. The difference between diverging exponentials also grows exponentially; so, although the disparity between processor and memory speed is already an issue, downstream someplace it will be a much bigger one.”

Computer Architecture 2014 – Caches 6 Not so long ago… DRAM 9% per yr 2X in 10 yrs CPU 60% per yr 2X in 1.5 yrs Gap grew 50% per year

Computer Architecture 2014 – Caches 7 More recently (2008)… lower = slower Fast Slow The memory wall in the multicore era Performance (seconds) Processor cores Conventional architecture

Computer Architecture 2014 – Caches 8 Memory Trade-Offs u Large (dense) memories are slow u Fast memories are small, expensive and consume high power u Goal: give the processor a feeling that it has a memory which is large (dense), fast, consumes low power, and cheap u Solution: a Hierarchy of memories Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest Power: Highest Lowest L1 Cache CPU L2 Cache L3 Cache Memory (DRAM)

Computer Architecture 2014 – Caches 9 Typical levels in mem hierarchy Response timeSizeMemory level ≈ 0.5 ns≈ 100 bytesCPU registers ≈ 1 ns≈ 64 KBL1 cache ≈ 20 ns≈ 8 – 32 MBLast Leve cache (LLC) ≈ 150 ns≈ 4 – 100s GBMain memory (DRAM) W? r?128 GBSSD ≈ 5 ms≈ 1 – 4 TBHard disk (SATA)

Computer Architecture 2014 – Caches 10 Why Hierarchy Works: Locality u Temporal Locality (Locality in Time):  If an item is referenced, it will tend to be referenced again soon  Example: code and variables in loops  Keep recently accessed data closer to the processor u Spatial Locality (Locality in Space):  If an item is referenced, nearby items tend to be referenced soon  Example: scanning an array  Move contiguous blocks closer to the processor u Due to locality, memory hierarchy is a good idea  We’re going to use what we’ve just recently used  And we’re going to use its immediate neighborhood

Computer Architecture 2014 – Caches 11 Programs with locality cache well... Time Memory Address (one dot per access) Spatial Locality Temporal Locality Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): (1971) Bad locality behavior

Computer Architecture 2014 – Caches 12 Memory Hierarchy: Terminology u For each memory level define the following  Hit: data appears in the memory level  Hit Rate: the fraction of accesses found in that level  Hit Latency: time to access the memory level includes also the time to determine hit/miss  Miss: need to retrieve data from next level  Miss Rate: 1 - (Hit Rate)  Miss Penalty: Time to bring in the missing info (replace a block) + Time to deliver the info to the accessor u Average memory access time = t_effective = (Hit Lat.  Hit Rate) + (Miss Pen.  Miss Rate) = (Hit Lat.  Hit Rate) + (Miss Pen.  (1- Hit Rate))  If hit rate is close to 1, t_effective is close to Hit latency, which is generally what we want

Computer Architecture 2014 – Caches 13 Effective Memory Access Time u Cache – holds a subset of the memory  Hopefully – the subset that is being used now  Known as “the working set” u Effective memory access time t effective = (t cache  Hit Rate) + (t mem  (1 – Hit rate)) t mem includes the time it takes to detect a cache miss u Example  Assume: t cache = 10 ns, t mem = 100 nsec Hit Ratet eff (nsec) u t mem /t cache goes up  more important that hit-rate closer to 1

Computer Architecture 2014 – Caches 14 u The cache holds a small part of the entire memory  Need to map parts of the memory into the cache u Main memory is (logically) partitioned into “blocks” or “lines” or, when the info is cached, “cachelines”  Typical block size is 32, 64 bytes  Blocks are “aligned” in memory u Cache partitioned to cache lines  Each cache line holds a block  Only a subset of the blocks is mapped to the cache at a given time  The cache views an address as u Why use lines/blocks rather than words? Cache – main idea Block #offset memory cache Line Tag

Computer Architecture 2014 – Caches 15 Cache Lookup u Cache hit  Block is mapped to the cache – return data according to block’s offset u Cache miss  Block is not mapped to the cache  do a cacheline fill Fetch block into fill buffer may require few cycles Write fill buffer into cache  May need to evict another block from the cache Make room for the new block memory cache

Computer Architecture 2014 – Caches 16 Checking valid bit & tag u Initially cache is empty  Need to have a “line valid” indication – line valid bit u A line may also be invalidated Line Tag Array Tag Offset Data array 031 = = = hitdata valid bit v

Computer Architecture 2014 – Caches 17 Cache organization u Basic questions:  Associativity: Where can we place a memory block in the cache?  Eviction policy: Which cache line should be evicted on a miss? u Associativity:  Ideally, every memory block can go to each cache line Called Fully-associative cache Most flexible, but most expensive  Compromise: simpler designs Blocks can only reside in a subset of cache lines Direct-mapped cache 2-way set associative cache N-way set associative cache

Computer Architecture 2014 – Caches 18 Fully Associative Cache u An address is partitioned to  offset within block  block number u Each block may be mapped to each of the cache lines  Lookup block in all lines u Each cache line has a tag  All tags are compared to the block# in parallel  Need a comparator per line  If one of the tags matches the block#, we have a hit Supply data according to offset u Best hit rate, but most wasteful  Must be relatively small Tag Array Tag Tag = Block#Offset Address Fields 0431 Data array 031 Line = = = hitdata

Computer Architecture 2014 – Caches 19 Fully Associative Cache u Is said to be a “CAM”  Content Addressable Memory Tag Array Tag Tag = Block#Offset Address Fields 0431 Data array 031 Line = = = hitdata

Computer Architecture 2014 – Caches 20 Direct Map Cache u Each memory block can only be mapped to a single cache line u Offset  Byte within the cache-line u Set  The index into the “cache array”, and to the “tag array”  For a given set (an index), only one of the cache lines that has this set can reside in the cache u Tag  Remaining block bits are used as tag  Tag uniquely identifies mem. block  Must compare the tag stored in the tag array to the tag of the address Tag Array Tag Set# 0 31 TagSetOffset Address Line 5 Block number 2 9 =512 sets Data Array 14

Computer Architecture 2014 – Caches 21 Direct Map Cache (cont) u Partition memory into slices  slice size = cache size u Partition each slice to blocks  Block size = cache line size  Distance of block from slice start indicates position in cache (set) u Advantages  Easy & fast hit/miss resolution  Easy & fast replacement algorithm  Lowest power u Disadvantage  Line has only “one chance”  Lines replaced due to “conflict misses”  Organization with highest miss-rate Cache Size x x x Mapped to set X Cache Size Cache Size

Computer Architecture 2014 – Caches 22 Line Size: 32 bytes  5 Offset bits Cache Size: 16KB = 2 14 Bytes #lines = cache size / line size = 2 14 /2 5 =2 9 =512 #sets = #lines = 512 #set bits = 9 bits (=5…13) #Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits (=14…31) Lookup Address: 0x Direct Map Cache – Example offset= 0x18 set= 0x0B3 tag= 0x048B1 Tag SetOffset Address Fields Tag Array = Hit/Miss 514

Computer Architecture 2014 – Caches 23 Direct map (tiny example) u Assume  Memory size is 2^5 = 32 bytes  For this, need 5-bit address  A block is comprised of 4 bytes  Thus, there are exactly 8 blocks u Note  Need only 3-bits to identify a block  The offset is exclusively used within the cache lines  The offset is not used to locate the cache line Offset (within a block) Block index Address Address Address 00001

Computer Architecture 2014 – Caches 24 Direct map (tiny example) u Further assume  The size of our cache is 2 cache- lines (=> need 2=5-2-1 tag bits) u The address divides like so  b4 b3| b2| b1 b0  tag| set| offset Offset (within a block) Block index b3b4 0 1 tag array (bits) data array (bytes) memory array (bytes) even cache lines odd cache lines

Computer Architecture 2014 – Caches 25 Direct map (tiny example) u Accessing address  (= marked “C”) u The address divides like so  b4 b3| b2| b1 b0  tag (00)| set (0)| offset (10) ABCD Offset (within a block) Block index ABCD0 1 b3b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 26 Direct map (tiny example) u Accessing address  (=Y) u The address divides like so  b4 b3| b2| b1 b0  tag (01)| set (0)| offset (10) WXYZ Offset (within a block) Block index WXYZ0 1 b3b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 27 Direct map (tiny example) u Accessing address  (=Q) u The address divides like so  b4 b3| b2| b1 b0  tag (10)| set (0)| offset (10) TRQP Offset (within a block) Block index TRQP0 1 b3b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 28 Direct map (tiny example) u Accessing address  (=J) u The address divides like so  b4 b3| b2| b1 b0  tag (11)| set (0)| offset (10) LKJI Offset (within a block) Block index LKJI0 1 b3b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 29 Direct map (tiny example) u Accessing address  (=B) u The address divides like so  b4 b3| b2| b1 b0  tag (00)| set (1)| offset (10) DCBA Offset (within a block) Block index DCBA1 b3b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 30 Direct map (tiny example) u Accessing address  (=Y) u The address divides like so  b4 b3| b2| b1 b0  tag (01)| set (1)| offset (10) WZYX Offset (within a block) Block index WZYX1 b3b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 31 Direct map (tiny example) u Now assume  The size of our cache is 4 cache- lines u The address divides like so  b4| b3 b2| b1 b0  tag| set| offset DCBA Offset (within a block) Block index DCBA b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 32 Direct map (tiny example) u Now assume  The size of our cache is 4 cache- lines u The address divides like so  b4| b3 b2| b1 b0  tag| set| offset WZYX Offset (within a block) Block index WZYX b tag array (bits) cache array (bytes) memory array (bytes)

Computer Architecture 2014 – Caches 33 2-Way Set Associative Cache u Each set holds two line (way 0 and way 1)  Each block can be mapped into one of two lines in the appropriate set (HW checks both ways in parallel) u Cache effectively partitioned into two Example: Line Size: 32 bytes Cache Size 16KB #of lines512 lines #sets256 Offset bits5 bits Set bits8 bits Tag bits19 bits Address Offset: = 0x18 = 24 Set: = 0x0B3 = 179 Tag: = = 0x091A2 LineTag Line TagSetOffset Address Fields Cache storage Way 1 Tag Array Set# 031Way 0 Tag Array Set# 031Cache storage WAY #1WAY #0 513

Computer Architecture 2014 – Caches 34 2-Way Cache – Hit Decision TagSetOffset Way 0 Tag Set# Data = Hit/Miss MUX Data Out Data Tag Way 1 = 513

Computer Architecture 2014 – Caches 35 2-Way Set Associative Cache (cont) u Partition memory into “slices” or “ways”  slice size = way size = ½ cache size u Partition each slice to blocks  Block size = cache line size  Distance of block from slice-start indicates position in cache (set) u Compared to direct map cache  Half size slice  2× #slices  2× #blocks mapped to each cache set  Each set can have 2 blocks at a given time ++ Fewer collisions/evictions ---- More logic, more power consuming Way Size x x x Mapped to set X Way Size Way Size

Computer Architecture 2014 – Caches 36 N-way set associative cache u Similarly to 2-way u At the extreme, every cache line is a way…

Computer Architecture 2014 – Caches 37 Cache organization summary u Increasing set associativity  Improves hit rate  Increases power consumption  Increases access time u Strike a balance

Computer Architecture 2014 – Caches 38 Cache Read Miss u On a read miss – perform a cache line fill  Fetch entire block that contains the missing data from memory u Block is fetched into the cache line fill buffer  May take a few bus cycles to complete the fetch e.g., 64 bit (8 byte) data bus, 32 byte cache line  4 bus cycles Can stream (forward) the critical chunk into the core before the line fill ends u Once the entire block fetched into the fill buffer  It is moved into the cache

Computer Architecture 2014 – Caches 39 Cache Replacement Policy u Direct map cache – easy  A new block is mapped to a single line in the cache  Old line is evicted (re-written to memory if needed) u N-way set associative cache – harder  Choose a victim from all ways in the appropriate set  But which? To determine, use a replacement algorithm u Example replacement policies  FIFO (First In First Out)  Random  LRU (Least Recently used)  Optimum (theoretical, postmortem, called “Belady”) u More on this next week…

Computer Architecture 2014 – Caches 40 Cache Replacement Policy u Direct map cache – easy  A new block is mapped to a single line in the cache  Old line is evicted (re-written to memory if needed) u N-way set associative cache – harder  Choose a victim from all ways in the appropriate set  But which? To determine, use a replacement algorithm u Example replacement policies  Optimum (theoretical, postmortem, called “Belady”)  FIFO (First In First Out)  Random  LRU (Least Recently used) A decent approximation of Belady

Computer Architecture 2014 – Caches 41 LRU Implementation u 2 ways  1 bit per set to mark latest way accessed in set  Evict way not pointed by bit u k-way set associative LRU  Requires full ordering of way accesses  Algorithm: when way i is accessed x = counter[i] counter[i] = k-1 for (j = 0 to k-1) if( (j  i) && (counter[j]>x) ) counter[j]--;  When replacement is needed evict way with counter = 0  Expensive even for small k-s Because invoked for every load/store  Need a log 2 k bit counter per line Initial State Way Count Access way 2 Way Count Access way 0 Way Count

Computer Architecture 2014 – Caches 42 Pseudo LRU (PLRU) u In practice, it’s sufficient to efficiently approximate LRU  Maintain k-1 bits, instead of k ∙ log 2 k bits u Assume k=4, and let’s enumerate the way’s cache lines  We need 2 bits: cache line 00, cl-01, cl-10, and cl-11 u Use a binary search tree to represent the 4 cache lines  Set each of the 3 (=k-1) internal nodes to hold a bit variable: B 0, B 1, and B 2 u Whenever accessing a cache line b 1 b 0  Set the bit variable B j to be the corresponding cache line bit b k  Can think about the bit value as B j “right side was referenced more recently” u Need to evict? Walk tree as follows:  Go left if B j = 1; go right if B j = 0  Evict the leaf you’ve reached (= the opposite direction relative to previous insertions) B0B0 B1B1 B2B2 cache lines

Computer Architecture 2014 – Caches 43 Pseudo LRU (PLRU) – Example u Access 3 (11), 0 (00), 2 (10), 1 (01) => next victim is 3 (11), as expected B0B0 B1B1 B2B cache lines B1B1

Computer Architecture 2014 – Caches 44 LRU vs. Random vs. FIFO u LRU: hardest u FIFO: easier, approximates LRU (oldest rather then LRU) u Random: easiest u Results:  Misses per 1000 instructions in L1-d, on average  Average across ten SPECint2000 / SPECfp2000 benchmarks  PLRU turns out rather similar to LRU Size2-way4-way8-way LRURandFIFOLRURandFIFOLRURandFIFO 16K K K

Computer Architecture 2014 – Caches 45 Effect of Cache on Performance u MPKI (miss per kilo-instruction)  Average number of misses for every 1000 instructions  MPKI = Memory accesses per kilo-instruction × Miss rate u Memory stall cycles = |Memory accesses| × Miss rate × Miss penalty cycles = IC/1000 × MPKI × Miss penalty cycles u CPU time = (CPU execution cycles + Memory stall cycles) × cycle time = IC/1000 × (1000* CPI execution + MPKI × Miss penalty cycles) × cycle time

Computer Architecture 2014 – Caches 46 Memory Update Policy on Writes u Write back: Lazy writes to next cache level; prefer cache u Write through: Immediately update next cache level

Computer Architecture 2014 – Caches 47 Write Back: Cheaper writes u Store operations that hit the cache  Write only to cache; next cache level (or memory) not accessed u Line marked as “modified” or “dirty”  When evicted, line written to next level only if dirty u Pros:  Saves memory accesses when line updated more than once  Attractive for multicore/multiprocessor u Cons:  On eviction, the entire line must be written to memory (there’s no indication which bytes within the line were modified)  Read miss might require writing to memory (evicted line is dirty)

Computer Architecture 2014 – Caches 48 Write Through: Cheaper evictions u Stores that hit the cache  Write to cache, and  Write to next cache level (or memory) u Need to write only the bytes that were changed  Not entire line  Less work u When evicting, no need to write to next cache level  Never dirty, so don’t need to be written  Still need to throw stuff out, though u Use write buffers  To mask waiting for lower level memory

Computer Architecture 2014 – Caches 49 Write through: need write-buffer u A write buffer between cache & memory  Processor core: writes data into cache & write buffer  Write buffer allows processor to avoid stalling on writes u Works ok if store frequency in cycles << DRAM write cycle  Otherwise store buffer overflows no matter how big it is u Write combining  Combine adjacent writes to same location in write buffer u Note: on cache miss need to lookup write buffer (or drain it) Processor Cache Write Buffer DRAM

Computer Architecture 2014 – Caches 50 Cache Write Miss u The processor is not waiting for data   continues to work u Option 1: Write allocate: fetch the line into the cache  Goes with write back policy Because, with write back, write ops are quicker if line in cache  Assumes more writes/reads to cache line will be performed soon  Hopes that subsequent accesses to the line hit the cache u Option 2: Write no allocate: do not fetch line into cache  Goes with write through policy  Subsequent writes would update memory anyhow  (If read ops occur, first read will bring line to cache)

Computer Architecture 2014 – Caches 51 WT vs. WB – Summary Write-ThroughWrite-Back Policy Data written to cache block (if present) also written to lower- level memory Write data only to the cache Update lower level when a block falls out of the cache ComplexityLessMore Can read misses produce writes? NoYes Do repeated writes make it to lower level? YesNo Upon write miss Write no allocateWrite allocate

Computer Architecture 2014 – Caches 52 Write Buffers for WT – Summary Write Buffers for WT – Summary Q. Why a write buffer ? Processor Cache Write Buffer Lower Level Memory Holds data awaiting write-through to lower level memory A. So CPU doesn’t stall Q. Why a buffer, why not just one register ? A. Bursts of writes are common Q. Are Read After Write (RAW) hazards an issue for write buffer? A. Yes! Drain buffer before next read, or check in buffer

Computer Architecture 2014 – Caches 53 Write-back vs. Write-through u Commercial processors favor write-back  Write bursts to the same line are common  Simplifies management of multi-cores Data in two consecutive cache levels is inconsistent while write is in-flight With write-through, this happens on every write

Computer Architecture 2014 – Caches 54 Optimizing the Hierarchy

Computer Architecture 2014 – Caches 55 Cache Line Size u Larger line size takes advantage of spatial locality  Too big blocks: may fetch unused data  While possibly evicting useful date  miss rate goes up u Larger line size means larger miss penalty  Longer time to fill line (critical chunk first reduces the problem)  Longer time to evict u avgAccessTime  = missPenalty × missRate + hitTime × (1 – missRate)

Computer Architecture 2014 – Caches 56 Classifying Misses: 3 Cs u Compulsory  First access to a block which is not in the cache  Block must be brought into cache  Cache size does not matter  Solution: prefetching u Capacity  Cache cannot contain all blocks needed during program execution  Blocks are evicted and later retrieved  Solution: increase cache size, stream buffers u Conflict  Occurs in set associative or direct mapped caches when too many blocks are mapped to the same set  Solution: increase associativity, victim cache

Computer Architecture 2014 – Caches 57 Conflict 3Cs in SPEC92 Compulsory Capacity Miss rate (fraction)

Computer Architecture 2014 – Caches 58 Multi-ported Cache u N-ported cache enables n accesses in parallel  Parallelize cache access in different pipeline stages  Parallelize cache access in a super-scalar processors u For n=2, more than doubles the cache area size   Wire complexity also degrades access times u Can help: “banked cache”  Each line is divided to n banks  Can fetch data from k  n different banks in possibly different lines

Computer Architecture 2014 – Caches 59 Separate Code / Data Caches u Parallelize data access and instruction fetch u Code cache is a read only cache  No need to write back line into memory when evicted  Simpler to manage u What about self modifying code ?  I-cache “snoops” (=monitors) all write ops Requires a dedicated snoop port: read tag array + match tag (Otherwise snoops would stall fetch)  If the code cache contains the written address Invalidate the corresponding cache line Flush the pipeline – it may contain stale code

Computer Architecture 2014 – Caches May-2013

Computer Architecture 2014 – Caches 61 Last-level cache (LLC) u Either L2 or L3 u LLC cache is bigger, but with higher latency  Reduces L1 miss penalty – saves access to memory  On modern processors, LLC is located on-chip u Since LLC contains L1 it needs to be significantly larger  Data is replicated across the cache levels Fetching from LLC to L1 replicates data  E.g., if LLC is only 2× L1, half of LLC is duplicated in L1 u LLC is typically unified (code / data)

Computer Architecture 2014 – Caches 62 Core 2 Duo Die Photo L2 Cache (Core 2 Duo L2 size is up to 6MB; it is shared by the cores.)

Computer Architecture 2014 – Caches 63 Ivy Bridge (L3, “last level” cache) (64KB data + 64KB instruction L1 cache per core; 512KB L2 data cache per core; and up to 32MB L3 cache shared by all cores)

Computer Architecture 2014 – Caches 64 AMD Phenom II Six Core

Computer Architecture 2014 – Caches 65 LLC: Inclusiveness u Data replication across cache levels presents a tradeoff: Inclusive vs. non-inclusive caches u Inclusive: LLC contains all data in higher cache levels  Evicting a line from the LLC also evicts it from the higher levels  Pro: makes it easy to manage cache hierarchy LLC serves as coordination point  Con: wasted cache space u Non-inclusive: L1 may contain data not present in LLC  Pro: better use of cache resources  Con: how do we know what data is in the caches? u Critical issue in multicore design  Data coherency and consistency across individual L1 caches

Computer Architecture 2014 – Caches 66 LLC: Inclusiveness u Practicality wins - LLC is typically inclusive  All addresses in L1 are also contained in LLC u LLC eviction process  Address evicted from LLC  snoop invalidate it in L1  But data in L1 might be newer than in L2 When evicting a dirty line from L1  write to L2  Thus, when evicting a line from L2 which is dirty in L1 Snoop invalidate to L1 generates a write from L1 to L2 Line marked as modified in L2  Line written to memory

Computer Architecture 2014 – Caches 67 Victim Cache u The load on a cache set may be non-uniform  Some sets may have more conflict misses than others  Solution: allocate ways to sets dynamically u Victim buffer adds some associativity to direct-mapped caches  A line evicted from L1 cache is placed in the victim cache  If victim cache is full  evict its LRU line  On L1 cache lookup, in parallel, also search victim cache Direct-mapped cacheVictim buffer (fully-assoc.)

Computer Architecture 2014 – Caches 68 Victim Cache u On victim cache hit  Line is moved back to cache  Evicted line moved to the victim cache  Same access time as cache hit Direct-mapped cacheVictim buffer (fully-assoc.)

Computer Architecture 2014 – Caches 69 Stream Buffers u Before inserting a new line into cache  Put new line in a stream buffer u If the line is expected to be accessed again  Move the line from the stream buffer into cache  E.g., if the line hits in the stream buffer u Example:  Scanning a very large array (much larger than the cache)  Each item in the array is accessed just once  If the array elements are inserted into the cache The entire cache will be thrashed  If we detect that this is just a scan-once operation E.g., using a hint from the software Can avoid putting the array lines into the cache

Computer Architecture 2014 – Caches 70 Prefetching u Predict future memory accesses  Fetch them from memory ahead of time u Instruction Prefetching  On cache miss, prefetch sequential lines into stream buffers  Branch-predictor-directed prefetching Let branch predictor run ahead u Data Prefetching - predict future data accesses  Next sequential (block prefetcher)  Stride  General pattern u Software Prefetching  Compiler injects special prefetching instructions

Computer Architecture 2014 – Caches 71 Prefetching u Prefetch can greatly improve performance  …but incurs high overheads! u Predictions are not 100% accurate  Need to predict correct address and make sure it arrives on time Too early: line may be evicted Too late: processor has to stall  Closer to 50-60% in practice u Can waste memory bandwidth and power  In some commodity processors, roughly 50% of data brought from memory is never used due to aggressive prefetching

Computer Architecture 2014 – Caches 72 Critical Word First u Reduce Miss Penalty u Don’t wait for full block to be loaded before restarting CPU  Early restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution  Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives Let the CPU continue execution while filling the rest of the words in the line Also called wrapped fetch and requested word first u Example: Pentium  8 byte bus, 32 byte cache line  4 bus cycles to fill line  Fetch data from 95H 80H-87H88H-8FH90H-97H98H-9FH 1432

Computer Architecture 2014 – Caches 73 Non-Blocking Cache u Very important in OoO processors u Hit Under Miss  Allow cache hits while one miss is in progress  Another miss has to wait u Miss Under Miss, Hit Under Multiple Misses  Allow hits and misses when other misses in progress  Memory system must allow multiple pending requests  Manage a list of outstanding cache misses When miss is served and data gets back, update list u Pending operations manages by MSHR  Also known as “Miss-Status Holding Register”

Computer Architecture 2014 – Caches 74 Compiler/Programmer Optimizations: Merging Arrays u Merge 2 arrays into a single array of compound elements /* BEFORE: two sequential arrays */ int val[SIZE]; int key[SIZE]; /* AFTER: One array of structures */ struct merge { int val; int key; } merged_array[SIZE];  Reduce conflicts between val and key u Improves spatial locality

Computer Architecture 2014 – Caches 75 Compiler optimizations: Loop Fusion u Combine 2 independent loops that have same looping and some variables overlap  Assume each element in a is 4 bytes, 32KB cache, 32 B / line for (i = 0; i < 10000; i++) a[i] = 1 / a[i]; for (i = 0; i < 10000; i++) sum = sum + a[i];  First loop: hit 7/8 of iterations  Second loop: array > cache  same hit rate as in 1 st loop u Fuse the loops for (i = 0; i < 10000; i++) { a[i] = 1 / a[i]; sum = sum + a[i]; }  First line: hit 7/8 of iterations  Second line: hit all

Computer Architecture 2014 – Caches 76 Compiler Optimizations: Loop Interchange u Change loops nesting to access data in order stored in memory u Two dimensional array in memory: x[0][0] x[0][1] … x[0][99] x[1][0] x[1][1] … /* Before */ for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; u Sequential accesses instead of striding through memory every 100 words  Improved spatial locality

Computer Architecture 2014 – Caches 77 Case Study u Direct map used mostly for embedded due to poor performance u Can we make direct-map outperform alternatives? Etsion & Feitelson, "L1 Cache Filtering Through Random Selection of Memory References“, in International Conference on Parallel Architectures & Compilation Techniques (PACT), u See dedicated presentation

Computer Architecture 2014 – Caches 78 Summary: Cache and Performance u Reduce cache miss rate  Larger cache  Reduce compulsory misses Larger Block Size HW Prefetching (Instr, Data) SW Prefetching (Data)  Reduce conflict misses Higher Associativity Victim Cache  Stream buffers Reduce cache thrashing  Compiler Optimizations u Reduce the miss penalty  Early Restart and Critical Word First on miss  Non-blocking Caches (Hit under Miss, Miss under Miss)  2nd/3rd Level Cache u Reduce cache hit time  On-chip caches  Smaller size cache (hit time increases with cache size)  Direct map cache (hit time increases with associativity) u Bring frequently accessed data closer to the processor