Computer Structure 2015 – Caches 1 Lihu Rappoport and Adi Yoaz Computer Structure Cache Memory.

Slides:



Advertisements
Similar presentations
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
Computer Design 2007 – Caches 1 Dr. Lihu Rappoport Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh MAMAS – Computer.
Computer Architecture 2008 – Caches 1 Dr. Lihu Rappoport Some of the slides were taken from Avi Mendelson, Randi Katz, Patterson, Gabriel Loh Computer.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
© Karen Miller, What do we want from our computers?  correct results we assume this feature, but consider... who defines what is correct?  fast.
Computer Architecture Cache Memory
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Chapter 7 Cache Memories.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy & Cache Memory © Avi Mendelson, 3/ MAMAS – Computer Architecture Lectures 3-4 Memory Hierarchy and Cache Memories Dr. Avi.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Computer Architecture Cache Memory
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Systems I Locality and Caching
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.
Chapter 5 Memory Hierarchy Design. 2 Many Levels in Memory Hierarchy Pipeline registers Register file 1st-level cache (on-chip) 2nd-level cache (on same.
Computer Structure Cache Memory
CMSC 611: Advanced Computer Architecture
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.
CSE 351 Section 9 3/1/12.
Yu-Lun Kuo Computer Sciences and Information Engineering
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache Performance Improvements
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Computer Structure 2015 – Caches 1 Lihu Rappoport and Adi Yoaz Computer Structure Cache Memory

Computer Structure 2015 – Caches 2 Processor – Memory Gap Processor-Memory Performance Gap: (grows 50% / year)

Computer Structure 2015 – Caches 3 Memory Trade-Offs u Large (dense) memories are slow u Fast memories are small, expensive and consume high power u Goal: give the processor a feeling that it has a memory which is large (dense), fast, consumes low power, and cheap u Solution: a Hierarchy of memories Speed: Fastest Slowest Size: Smallest Biggest Cost: Highest Lowest Power: Highest Lowest L1 Cache CPU L2 Cache L3 Cache Memory (DRAM)

Computer Structure 2015 – Caches 4 Why Hierarchy Works u Temporal Locality (Locality in Time):  If an item is referenced, it will tend to be referenced again soon  Example: code and variables in loops  Keep recently accessed data closer to the processor u Spatial Locality (Locality in Space):  If an item is referenced, nearby items tend to be referenced soon  Example: scanning an array  Move contiguous blocks closer to the processor u Locality + smaller HW is faster + Amdahl’s law  memory hierarchy

Computer Structure 2015 – Caches 5 Memory Hierarchy: Terminology u For each memory level define the following  Hit: data available in the memory level  Hit Rate: the fraction of accesses found in that level  Hit Time: time from data request till data received when hitting in the memory level; includes also the time to determine hit/miss  Miss: data not present in memory level  Miss Rate = 1 – (Hit Rate)  Miss Penalty: Time to replace a block in the upper level + Time to deliver the block the processor u Average memory-access time = t effective = (Hit time  Hit Rate) + (Miss Time  Miss rate) = (Hit time  Hit Rate) + (Miss Time  (1- Hit rate))  If hit rate is close to 1, t effective is close to Hit time

Computer Structure 2015 – Caches 6 Effective Memory Access Time u Cache – holds a subset of the memory  Hopefully – the subset being used now u Effective memory access time t effective = (t cache  Hit Rate) + (t mem  (1 – Hit rate)) t mem includes the time it takes to detect a cache miss u Example  Assume t cache = 10 nsec and t mem = 100 nsec Hit Ratet eff (nsec)  t mem /t cache goes up  more important that hit-rate closer to 1

Computer Structure 2015 – Caches 7 u The cache holds a small part of the entire memory  Need to map parts of the memory into the cache u Main memory is (logically) partitioned into blocks  Typical block size is 32 to 64 bytes  Blocks are aligned u Cache partitioned to cache lines  Each cache line holds a block  Only a subset of the blocks is mapped to the cache at a given time  The cache views an address as Cache – Main Idea memory cache Block # offset

Computer Structure 2015 – Caches 8 Cache Lookup u Cache hit  Block is mapped to the cache – return data according to block’s offset u Cache miss  Block is not mapped to the cache  do a cache line fill Fetch block into fill buffer may require few bus cycle Write fill buffer into cache  May need to evict another block from the cache Make room for the new block memory cache Block #

Computer Structure 2015 – Caches 9 Fully Associative Cache u An address is partitioned to  offset within block  block number u Each block may be mapped to each of the cache lines  Lookup block in all lines u Each cache line has a tag  All tags are compared to the block# in parallel  Need a comparator per line  If one of the tags matches the block#, we have a hit Supply data according to offset hit Tag Array block# Tag = Block#Offset Requested Address 0539 Data array 063 Line = = = data

Computer Structure 2015 – Caches 10 Valid Bit u Initially cache is empty  Need to have a “line valid” indication – line valid bit u A line may also be invalidated 063 Line Tag Array Tag Data array = = = hitdata valid bit v Tag = Block#Offset Requested Address 0539

Computer Structure 2015 – Caches 11 Direct Map Cache u #block l.s.bits map block to a specific cache line  A given block is mapped to a specific cache line Also called set  If two blocks are mapped to the same line, only one can reside in the cache u The rest of the #block bits are used as tag  Compared to the tag stored in the cache for the appropriate set TagSetOffset Tag Array Hit/Miss Tag = line Data 2 8 =256 sets Block number Data Array 6 14

Computer Structure 2015 – Caches 12 Line Size: 32 bytes  5 Offset bits Cache Size: 16KB = 2 14 Bytes #lines = cache size / line size = 2 14 /2 5 =2 9 =512 #sets = #lines #set bits = 9 bits #Tag bits = 32 – (#set bits + #offset bits) = 32 – (9+5) = 18 bits Lookup Address: 0x Direct Map Cache – Example offset= 0x18 set= 0x0B3 tag= 0x048B1 Tag SetOffset Address Fields Tag Array = Hit/Miss 514

Computer Structure 2015 – Caches 13 Direct Map Cache (cont) u Partition memory into slices  slice size = cache size u Partition each slice to blocks  Block size = cache line size  Distance of block from slice start indicates position in cache (set) u Advantages  Easy hit/miss resolution  Easy replacement algorithm  Lowest power u Disadvantage  Line replacements due to “set conflict misses” Cache Size x x x Mapped to set X Cache Size Cache Size

Computer Structure 2015 – Caches 14 2-Way Set Associative Cache u Each set holds two line (way 0 and way 1)  Each block can be mapped into one of two lines in the appropriate set Example: Line Size: 32 bytes Cache Size 16KB # of lines512 lines #sets256 Offset bits5 bits Set bits8 bits Tag bits19 bits Address 0x Offset: = 0x18 = 24 Set: = 0x0B3 = 179 Tag: = 0x091A2 LineTag Line TagSetOffset Address Fields Cache storage Way 1 Tag Array Set# 031Way 0 Tag Array Set# 031Cache storage WAY #1WAY #0 513

Computer Structure 2015 – Caches 15 2-Way Cache – Hit Decision TagSetOffset Way 0 Set# = Hit/Miss MUX Data Tag Way 1 Way 0 Way 1 =

Computer Structure 2015 – Caches 16 2-Way Set Associative Cache (cont) u Partition memory into slices  slice size = way size = ½ cache size u Partition each slice to blocks  Block size = cache line size  Distance of block from slice start indicates position in cache (set) u Compared to direct map cache  Half size slice  2× number of slices  2× number of blocks mapped to each set in the cache  But in each set we can have 2 blocks at a given time Way Size x x x Mapped to set X Way Size Way Size

Computer Structure 2015 – Caches 17 Cache Read Miss u On a read miss – perform a cache line fill  Fetch entire block that contains the missing data from memory u Block is fetched into the cache line fill buffer  May take a few bus cycles to complete the fetch e.g., 64 bit (8 byte) data bus, 32 byte cache line  4 bus cycles Can stream the critical chunk into the core before the line fill ends u Once the entire block fetched into the fill buffer  it is moved from the fill buffer into the cache

Computer Structure 2015 – Caches 18 Cache Replacement u Direct map  A new block is mapped to a single line in the cache  Old line is evicted (re-written to memory if needed) u N-way set associative cache  Choose a victim from all ways in the appropriate set u Replacement algorithms  Optimal replacement algorithm  FIFO – First In First Out  Random  LRU – Least Recently used u LRU is the best  but not that much better even than random

Computer Structure 2015 – Caches 19 LRU Implementation u 2 ways  1 bit per set to mark latest way accessed in set  Evict way not pointed by bit u k-way set associative LRU  Requires full ordering of way accesses  Hold a log 2 k bit counter per line  When way i is accessed X = Counter[i] Counter[i] = k-1 for (j = 0 to k-1) if ((j  i) AND (Counter[j]>X)) Counter[j]--;  When replacement is needed evict way with counter = 0  Expensive for even small k’s Initial State Way Count Access way 2 Way Count Access way 0 Way Count

Computer Structure 2015 – Caches 20 Pseudo LRU (PLRU) u PLRU records a partial order using a tree structure  The ways of a set are in the leaves; the PLRU bits are the internal nodes For n leaves (ways) there are n-1 internal nodes (PLRU bits) u Example: 4-ways, using 3 bits per-set:  Bit 0 specifies if LRU in ways {0,1} or in ways {2,3}  Bit 1 specifies the LRU between {0,1}  Bit 2 specifies the LRU between {2,3} u On each access to a way  Update PLRU bits from root to hit way to point to the way (0-right, 1-left) u Example: ways access order : 3,0,2,1  bit 0 =0 points to the pair with way 1  bit 1 =1 points to way 1  bit 2 =0 way 2 was accessed after way 3 u Victim selection  Follow the opposite directions pointed by the PLRU bits bit 0 bit 2 bit bit 0 bit 2 bit 1

Computer Structure 2015 – Caches 21 Effect of Cache on Performance u MPI – miss per instruction  MPI= #cache misses / #instructions = #cache misses / #mem access × #mem access / #instructions  More correlative to performance than cache miss rate Takes into account also frequency of memory accesses u Memory stall cycles = #memory accesses × miss rate × miss penalty = IC × MPI × miss penalty u CPU time = (CPU execution cycles + Memory stall cycles) × cycle time = IC × (CPI execution + MPI × Miss penalty) × cycle time

Computer Structure 2015 – Caches 22 Memory Update Policy on Writes u Write back – cheaper writes  “Store” operations that hit the cache write only to the cache Main memory is not accessed in case of write hit Line is marked as modified or dirty  Modified cache line written to memory only when it is evicted Saves memory accesses when a line is updated many times On eviction, the entire line must be written to memory There is no indication which bytes within the line were modified u Write through – cheaper evictions  Stores that hit the cache write both to cache and to memory Need to write only the bytes that were changed (not entire line)  The ratio between reads and write is ~ 4:1 While writing to memory, the processor goes on reading from cache  No need to evict blocks from the cache (never dirty)  Use write buffers so that don’t wait for lower level memory

Computer Structure 2015 – Caches 23 Write Buffer for Write Through u A Write Buffer between the Cache and Memory  Processor: writes data into the cache and the write buffer  Memory controller: write contents of the buffer to memory u Work ok if store frequency << 1/DRAM write cycle  Otherwise store buffer overflows no matter how big it is u Write combining  combine writes in the write buffer u On cache miss need to lookup write buffer Processor Cache Write Buffer DRAM

Computer Structure 2015 – Caches 24 Cache Write Miss u The processor is not waiting for data  continues its work u Option 1: Write allocate: fetch the line into the cache  Goes with write back policy, assuming more writes are needed  hoping that subsequent writes to the line hit the cache u Option 2: Write no allocate: do not fetch line into cache  Goes with write through policy  subsequent writes would update memory anyhow

Computer Structure 2015 – Caches 25 Cache Line Size u Larger line size takes advantage of spatial locality  Too big blocks: may fetch unused data  Possibly evicting useful data  miss rate goes up u Larger line size means larger miss penalty  Takes longer time to perform a cache line fill Using critical chunk first reduces the issue  Takes longer to evict (when using write back update policy) Ave. Access Time = hit time×(1– miss rate) + miss penalty × miss rate

Computer Structure 2015 – Caches 26 Classifying Misses: 3 Cs u Compulsory  First access to a block which is not in the cache the block must be brought into the cache  Cache size does not matter  Solution: prefetching u Capacity  Cache cannot contain all blocks needed during program execution blocks are evicted and later retrieved  Solution: increase cache size, stream buffers u Conflict  Occurs in set associative or direct mapped caches when too many blocks are mapped to the same set  Solution: increase associativity, victim cache

Computer Structure 2015 – Caches 27 Conflict 3Cs in SPEC92

Computer Structure 2015 – Caches 28 Victim Cache u Per-set pressure may be non-uniform  Some sets may have more conflict misses than others  Solution: allocate ways to sets dynamically u Victim cache gives a 2 nd chance to evicted lines  A line evicted from L1 cache is placed in the victim cache  If victim cache is full  evict its LRU line  On L1 cache lookup, lookup victim cache in parallel u On victim cache hit  Line is moved back to cache  Evicted line moved to the victim cache  Same access time as cache hit u Especially effective for direct mapped cache  Enables to combine the fast hit time of a direct mapped cache and still reduce conflict misses

Computer Structure 2015 – Caches 29 Stream Buffers u Before inserting a new line into cache  Put new line in a stream buffer u If the line is expected to be accessed again  Move the line from the stream buffer into cache  E.g., if the line hits in the stream buffer u Example  Scanning a very large array (much larger than the cache)  Each item in the array is accessed just once  If the array elements are inserted into the cache The entire cache will be thrashed  If we detect that this is just a scan-once operation E.g., using a hint from the software Can avoid putting the array lines into the cache

Computer Structure 2015 – Caches 30 Prefetching u Hardware Data Prefetching – predict future data accesses  Next sequential / Streaming Triggered by an ascending access to very recently loaded data Fetches the next line, assuming a streaming load  Stride prefetcher Tracks individual load instructions, detecting a regular stride Prefetch address = current address + stride Detects strides of up to 2K bytes, both forward and backward  General pattern prefetcher Identifies patterns of Load address distances u Data has to be prefetched before it is needed  The prefetcher aggressiveness has to be tuned  How fast and how far ahead to issue the prefetch request, such that data arrives on time, before it is needed

Computer Structure 2015 – Caches 31 Prefetching (cont.) u Prefetching relies on extra memory bandwidth  Too aggressive / inaccurate prefetching slows down demand fetches Hurts performance u Software Prefetching  Special prefetching instructions that cannot cause faults u Instruction Prefetching  On a cache miss, prefetch sequential cache lines into stream buffers  Branch predictor directed prefetching Let branch predictor run ahead

Computer Structure 2015 – Caches 32 Critical Word First u Reduce Miss Penalty u Don’t wait for full block to be loaded before restarting CPU  Early restart As soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution  Critical Word First Request the missed word first from memory and send it to the CPU as soon as it arrives Let the CPU continue execution while filling the rest of the words in the block Also called wrapped fetch and requested word first u Example: Pentium  64 bit = 8 byte bus, 32 byte cache line  4 bus cycles to fill line  Fetch date from 95H 80H-87H88H-8FH90H-97H98H-9FH 1432

Computer Structure 2015 – Caches 33 Code Optimizations: Merging Arrays u Merge 2 arrays into a single array of compound elements /* Before: two sequential arrays */ int val[SIZE]; int key[SIZE]; /* After: One array of structures */ struct merge { int val; int key; } merged_array[SIZE];  Reduce conflicts between val and key u Improves spatial locality

Computer Structure 2015 – Caches 34 Code Optimizations: Loop Fusion u Combine 2 independent loops that have same looping and some variables overlap  Assume each element in a is 4 bytes, 32KB cache, 32 B / line for (i = 0; i < 10000; i++) a[i] = 1 / a[i]; for (i = 0; i < 10000; i++) sum = sum + a[i];  First loop: hit 7/8 of iterations  Second loop: array > cache  same hit rate as in 1 st loop u Fuse the loops for (i = 0; i < 10000; i++) { a[i] = 1 / a[i]; sum = sum + a[i]; }  First line: hit 7/8 of iterations  Second line: hit all In the example only load accesses are taken into account for hit rate calculation

Computer Structure 2015 – Caches 35 Code Optimizations : Loop Interchange u 2-dimension array in memory: x[0][0] x[0][1] … x[0][99] x[1][0] x[1][1] … /* Original program: access are 100 bytes apparat */ /* x[0][0] x[1][0] … x[4999][0] x[0][1] x[1][1] … */ for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* Reversing the loops order: access are adjacent */ /* Improved spatial locality */ for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

Computer Structure 2015 – Caches 36 Improving Cache Performance u Reduce cache miss rate  Larger cache  Reduce compulsory misses Larger Block Size HW Prefetching (Instr, Data) SW Prefetching (Data)  Reduce conflict misses Higher Associativity Victim Cache  Stream buffers Reduce cache thrashing  Compiler Optimizations u Reduce the miss penalty  Early Restart and Critical Word First on miss  Non-blocking Caches (Hit under Miss, Miss under Miss)  Second Level Cache u Reduce cache hit time  On-chip caches  Smaller size cache (hit time increases with cache size)  Direct map cache (hit time increases with associativity)

Computer Structure 2015 – Caches 37 Separate Code / Data Caches u Parallelize data access and instruction fetch u Code cache is a read only cache  No need to write back line into memory when evicted  Simpler to manage u Self modifying code  Whenever executing a memory write  snoop code cache Requires a dedicated snoop port: tag array read + tag match Otherwise snoops would stall fetch  If the code cache contains the written address Invalidate the line in which the address is contained Flush the pipeline – it main contain stale code

Computer Structure 2015 – Caches 38 Non-Blocking Cache u Hit Under Miss  Allow cache hits while one miss is in progress  Another miss has to wait  Relevant for an out-of-order execution CPU u Miss Under Miss, Hit Under Multiple Misses  Allow hits and misses when other misses in progress  Memory system must allow multiple pending requests  Manage a list of outstanding cache misses When miss is served and data gets back, update list

Computer Structure 2015 – Caches 39 Multi-ported Cache u N-ported cache enables n accesses in parallel  Parallelize cache access in different pipeline stages  Parallelize cache access in a super-scalar processors u About doubles the cache die size u Possible solution: banked cache  Each line is divided to n banks  Can fetch data from k  n different banks in possibly different lines

Computer Structure 2015 – Caches 40 L2 Cache u L2 cache is bigger, but with higher latency  Reduces L1 miss penalty – saves access to memory  L2 cache is located on-chip, but may be shared by a few cores  L2 is unified (code / data) u L2 inclusive of L1  All addresses in L1 are also contained in L2 Address evicted from L2  snoop invalidate it in L1  Data in L1 may be more updated than in L2 When evicting a modified line from L1  write to L2 When evicting a line from L2 which is modified in L1 Snoop invalidate to L1 generates a write from L1 to L2 Line marked as modified in L2  Line written to memory  Since L2 contains L1 it needs to be significantly larger e.g., if L2 is only 2× L1, half of L2 is duplicated in L1

Computer Structure 2015 – Caches 41 Core TM 2 Duo Die Photo L2 Cache

Computer Structure 2015 – Caches 42 Multi-processor System u A memory system is coherent if 1. If P1 writes to address X, and later on P2 reads X, and there are no other writes to X in between  P2’s read returns the value written by P1’s write 2. Writes to the same location are serialized: two writes to location X are seen in the same order by all processors Processor 1 L1 cache Processor 2 L1 cache L2 cache (shared) Memory Write X Read X Return the same value written to X by P1 X=0 X=1 X=0 X=1 All processors see the same order of values. E.g., of P1 sees 0 then 1, P2 cannot see 1 then 0

Computer Structure 2015 – Caches 43 MESI Protocol u Each cache line can be in one of 4 states  Invalid – Line’s data is not valid  Shared – Line is valid and not dirty, copies may exist in other processors  Exclusive – Line is valid and not dirty, other processors do not have the line in their local caches  Modified – Line is valid and dirty, other processors do not have the line in their local caches

Computer Structure 2015 – Caches 44 Processor 1 L1 cache Processor 2 L1 cache L2 cache (shared) Memory [1000]: 5 miss Multi-processor System: Example u P1 reads 1000 u P1 writes 1000 [1000]: 5 [1000] miss [1000]: 5 [1000]: 6 EM 00 10

Computer Structure 2015 – Caches 45 Processor 1 L1 cache Processor 2 L1 cache L2 cache (shared) Memory MS [1000]: 5 Multi-processor System: Example u P1 reads 1000 u P1 writes 1000 u P2 reads 1000 u L2 snoops 1000 u P1 writes back 1000 u P2 gets 1000 [1000]: 5 [1000]: 6 [1000] miss [1000]: 6 S 10 11

Computer Structure 2015 – Caches 46 Processor 1 L1 cache Processor 2 L1 cache L2 cache (shared) Memory MS [1000]: 5 Multi-processor System: Example u P1 reads 1000 u P1 writes 1000 u P2 reads 1000 u L2 snoops 1000 u P1 writes back 1000 u P2 gets 1000 [1000]: 6 S u P2 requests for ownership with write intent [1000] I 01 [1000] E

Computer Structure 2015 – Caches 47 Multi-processor System: Inclusion u L2 needs to keep track of the presence of each line in each of the processors  Determine if it needs to send a snoop to a processor  Determine in what state to provide a requested line (S,E)  Need to guarantee that the L1 caches in each processor are inclusive of the L2 cache u When L2 evicts a line  L2 sends a snoop invalidate to all processors that have it  If the line is modified in the L1 cache of one of the processors (in which case it exist only in that processor) The processor responds by sending the updated value to L2 When the line is evicted from L2, the updated value gets written to memory

Computer Structure 2015 – Caches 48 MESI Protocol States u A modified line must be exclusive  Otherwise, another processor which has the line will be using stale data  Therefore, before modifying a line, a processor must request ownership of the line StateValidModified Copies may exist in other processors InvalidNoN.A.N.A SharedYesNoYes ExclusiveYesNo ModifiedYes No

Computer Structure 2015 – Caches 49 Backup

Computer Structure 2015 – Caches 50 Cache Optimization Summary Miss Miss Hit Technique Rate Penalty Time Larger Block Size+– Higher Associativity+– Victim Caches+ Pseudo-Associative Caches + HW Prefetching of Instr/Data+ Compiler Controlled Prefetching+ Compiler Reduce Misses+ Priority to Read Misses+ Sub-block Placement ++ Early Restart & Critical Word 1st+ Non-Blocking Caches+ Second Level Caches+ miss rate miss penalty