1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper.

Slides:



Advertisements
Similar presentations
361 Computer Architecture Lecture 15: Cache Memory
Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 2, 2005 Mon, Nov 7, 2005 Topic: Caches (contd.)
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Nov. 10, 2003 Topic: Caches (contd.)
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
CS252/Culler Lec 4.1 1/31/02 CS203A Graduate Computer Architecture Lecture 14 Cache Design Taken from Prof. David Culler’s notes.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Chapter 7 Cache Memories.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
361 Computer Architecture Lecture 14: Cache Memory
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
1 Improving on Caches CS #4: Pseudo-Associative Cache Also called column associative Idea –start with a direct mapped cache, then on a miss check.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
Memory Hierarchy—Improving Performance Professor Alvin R. Lebeck Computer Science 220 Fall 2008.
Pradondet Nilagupta (Based on notes Robert F. Hodson --- CNU)
Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
CS203 – Advanced Computer Architecture Cache. Memory Hierarchy Design Memory hierarchy design becomes more crucial with recent multi-core processors:
CMSC 611: Advanced Computer Architecture
COMP 740: Computer Architecture and Implementation
Memory Hierarchy 3 Cs and 6 Ways to Reduce Misses
CSC 4250 Computer Architectures
Lecture 9: Memory Hierarchy (3)
5.2 Eleven Advanced Optimizations of Cache Performance
Cache Memory Presentation I
Lecture 14: Reducing Cache Misses
CPE 631 Lecture 05: Cache Design
Memory Hierarchy.
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Siddhartha Chatterjee
Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate
Cache Memory Rabi Mahapatra
Cache Performance Improvements
10/18: Lecture Topics Using spatial locality
Presentation transcript:

1 IBM 360 Model 85 (1968) had a cache, which helped it outperform the more complex Model 91 (Tomasulo’s algorithm) Maurice Wilkes published the first paper on cache memory in The first computer to actually include one was probably built at Cambridge (a direct mapped cache).

2 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Apr 7, 2009 Topic: Introduction to Caches (Will cover Caches, Main Memory and Virtual Memory)

3 3Outline  Cache Organization  Cache Read/Write Policies Block replacement policies Block replacement policies Write-back vs. write-through caches Write-back vs. write-through caches Write buffers Write buffers  Cache Performance Means of improving performance Means of improving performance Read Appendix C.1 through C.3

4  The Five Classic Components of a Computer  This lecture (and next few): Memory System Control Datapath Memory Processor Input Output The Big Picture: Where are We Now?

5  Motivation Large (cheap) memories (DRAM) are slow Large (cheap) memories (DRAM) are slow Small (costly) memories (SRAM) are fast Small (costly) memories (SRAM) are fast  Make the average access time small service most accesses from a small, fast memory service most accesses from a small, fast memory reduce the bandwidth required of the large memory reduce the bandwidth required of the large memory  Exploit: Locality of Reference Processor Memory System Cache DRAM The Motivation for Caches

6  The Principle of Locality Program accesses a relatively small portion of the address space at any instant of time Program accesses a relatively small portion of the address space at any instant of time Example: 90% of time in 10% of the code Example: 90% of time in 10% of the code  Two different types of locality Temporal Locality (locality in time): Temporal Locality (locality in time):  if an item is referenced, it will tend to be referenced again soon Spatial Locality (locality in space): Spatial Locality (locality in space):  if an item is referenced, items close by tend to be referenced soon Address Space02n2n Frequency of reference The Principle of Locality

7 CPU Registers 500 Bytes 0.25 ns ~$.01 Cache 16K-1M Bytes 1 ns ~$.0001 Main Memory 64M-2G Bytes 100ns ~$ Disk 100 G Bytes 5 ms cents Capacity Access Time Cost/bit Tape/Network “infinite” secs cents Registers L1, L2, … Cache Memory Disk Tape/Network Words Blocks Pages Files Staging Transfer Unit programmer/compiler 1-8 bytes cache controller bytes OS 4-64K bytes user/operator Mbytes Upper Level Lower Level Faster Larger Levels of the Memory Hierarchy

8 Lower Level (Memory) Upper Level (Cache) To Processor From Processor Blk X Blk Y Memory Hierarchy: Principles of Operation  At any given time, data is copied between only 2 adjacent levels Upper Level (Cache): the one closer to the processor Upper Level (Cache): the one closer to the processor  Smaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the processor Lower Level (Memory): the one further away from the processor  Bigger, slower, and uses less expensive technology  Block The smallest unit of information that can either be present or not present in the two-level hierarchy The smallest unit of information that can either be present or not present in the two-level hierarchy

9 Memory Hierarchy: Terminology  Hit: data appears in some block in the upper level (e.g.: Block X in previous slide) Hit Rate = fraction of memory access found in upper level Hit Rate = fraction of memory access found in upper level Hit Time = time to access the upper level Hit Time = time to access the upper level  memory access time + Time to determine hit/miss  Miss: data needs to be retrieved from a block in the lower level (e.g.: Block Y in previous slide) Miss Rate = 1 - (Hit Rate) Miss Rate = 1 - (Hit Rate) Miss Penalty: includes time to fetch a new block from lower level Miss Penalty: includes time to fetch a new block from lower level  Time to replace a block in the upper level from lower level + Time to deliver the block the processor  Hit Time: significantly less than Miss Penalty

10 Cache Addressing Set 0 Set j-1 Block 0Block k-1Replacement infoSector 0Sector m-1TagByte 0Byte n-1ValidDirtyShared  Block/line is unit of allocation  Sector/sub-block is unit of transfer and coherence  Cache parameters j, k, m, n are integers, and generally powers of 2

11 Cache Shapes

12 Cache Shapes Direct-mapped (A = 1, S = 16) 2-way set-associative (A = 2, S = 8) 4-way set-associative (A = 4, S = 4) 8-way set-associative (A = 8, S = 2) Fully associative (A = 16, S = 1)

13 Cache Organization  Direct Mapped Cache Each memory location can only mapped to 1 cache location Each memory location can only mapped to 1 cache location No need to make any decision :-) No need to make any decision :-)  Current item replaces previous item in that cache location  N-way Set Associative Cache Each memory location have a choice of N cache locations Each memory location have a choice of N cache locations  Fully Associative Cache Each memory location can be placed in ANY cache location Each memory location can be placed in ANY cache location  Cache miss in a N-way Set Associative or Fully Associative Cache Bring in new block from memory Bring in new block from memory Throw out a cache block to make room for the new block Throw out a cache block to make room for the new block Need to decide which block to throw out! Need to decide which block to throw out!

14 4 Questions for Mem Hierarchy  Where can a block be placed in the upper level? (Block placement)  How is a block found if it is in the upper level? (Block identification)  Which block should be replaced on a miss? (Block replacement)  What happens on a write? (Write strategy)

Cache Index : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 Byte Select Example 1: 1KB, Direct-Mapped, 32B Blocks  For a 1024 (2 10 ) byte cache with 32-byte blocks The uppermost 22 = ( ) address bits are the tag The uppermost 22 = ( ) address bits are the tag The lowest 5 address bits are the Byte Select (Block Size = 2 5 ) The lowest 5 address bits are the Byte Select (Block Size = 2 5 ) The next 5 address bits (bit5 - bit9) are the Cache Index The next 5 address bits (bit5 - bit9) are the Cache Index

Cache Index : Cache Tag 0x0002fe0x00 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x00 Byte Select = Cache Miss xxxxxxx 0x Example 1a: Cache Miss; Empty Block

Cache Index : Cache Tag 0x0002fe0x00 0x Valid Bit : : Cache Data 31 Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x00 Byte Select = x0002fe 0x New Block of data Example 1b: … Read in Data

Cache Index : Cache Tag 0x x01 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x08 Byte Select = Cache Hit x0002fe 0x Example 1c: Cache Hit

Cache Index : Cache Tag 0x x02 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x04 Byte Select = Cache Miss x0002fe 0x Example 1d: Cache Miss; Incorrect Block

Cache Index : Cache Tag 0x x02 0x Valid Bit : : Cache Data Byte 0 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select 0x04 Byte Select = x0002fe 0x002450New Block of data Example 1e: … Replace Block

21 : Entry 0 Entry 1 Entry 63 Replacement Pointer Cache Block Replacement Policies  Random Replacement Hardware randomly selects a cache item and throw it out Hardware randomly selects a cache item and throw it out  Least Recently Used Hardware keeps track of the access history Hardware keeps track of the access history Replace the entry that has not been used for the longest time Replace the entry that has not been used for the longest time For 2-way set-associative cache, need one bit for LRU repl. For 2-way set-associative cache, need one bit for LRU repl.  Example of a Simple “Pseudo” LRU Implementation Assume 64 Fully Associative entries Assume 64 Fully Associative entries Hardware replacement pointer points to one cache entry Hardware replacement pointer points to one cache entry Whenever access is made to the entry the pointer points to: Whenever access is made to the entry the pointer points to:  Move the pointer to the next entry Otherwise: do not move the pointer Otherwise: do not move the pointer

22 Replacement Policy  Random Easy to implement Easy to implement  LRU Hard to implement; often approximated Hard to implement; often approximated  FIFO Used as approximation to LRU Used as approximation to LRU  Little effect (below); most pronounced with small, low associativity caches

23 Cache Write Policy  Cache read is much easier to handle than cache write Instruction cache is much easier to design than data cache Instruction cache is much easier to design than data cache  Cache write How do we keep data in the cache and memory consistent? How do we keep data in the cache and memory consistent?  Two options (decision time again :-) Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss Write Back: write to cache only. Write the cache block to memory when that cache block is being replaced on a cache miss  Need a “dirty bit” for each cache block  Greatly reduce the memory bandwidth requirement  Control can be complex Write Through: write to cache and memory at the same time Write Through: write to cache and memory at the same time  What!!! How can this be? Isn’t memory too slow for this?

24 Processor Cache Write Buffer DRAM Write Buffer for Write Through  Write Buffer: needed between cache and main mem Processor: writes data into the cache and the write buffer Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Memory controller: write contents of the buffer to memory  Write buffer is just a FIFO Typical number of entries: 4 Typical number of entries: 4 Works fine if store freq. (w.r.t. time) << 1 / DRAM write cycle Works fine if store freq. (w.r.t. time) << 1 / DRAM write cycle  Memory system designer’s nightmare Store frequency (w.r.t. time) > 1 / DRAM write cycle Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Write buffer saturation

25 Processor Cache Write Buffer DRAMProcessor Cache Write Buffer DRAM L2 Cache Write Buffer Saturation  Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row) If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row)  Store buffer will overflow no matter how big you make it  CPU Cycle Time << DRAM Write Cycle Time  Solutions for write buffer saturation Use a write back cache Use a write back cache Install a second level (L2) cache Install a second level (L2) cache

26 On a Write Miss  Write allocate – block is allocated in cache  No-write allocate – no cache block is allocated. Write is only to main memory (or next level of hierarchy)

27 Opteron Cache 64K bytes in 64 byte blocks 40-bit physical address (1) 2-way set associative. LRU replacement Write back Write allocate on miss Dirty bit Victim buffer for replaced blocks 8 blocks Tags indexed (2) and compared (3). Note valid bit. 2 clock read on hit. Miss, 7 clks for 1 st 8 bytes, then 2 clk / 8-bytes

28 Separate I & D  Commonly done  Increases bandwidth to processor  Allows for the different access patterns of instructions and data

29 Cache Performance

30 Miss Penalty Block Size Miss Rate Exploits spatial locality Fewer blocks: compromises temporal locality Block Size Increased Miss Penalty & Miss Rate Average Access Time Block Size Block Size Tradeoff  In general, larger block size take advantage of spatial locality, BUT: Larger block size means larger miss penalty Larger block size means larger miss penalty  Takes longer time to fill up the block If block size is too big relative to cache size, miss rate will go up If block size is too big relative to cache size, miss rate will go up  Too few cache blocks  Average Access Time Hit Time + Miss Penalty x Miss Rate Hit Time + Miss Penalty x Miss Rate

31 Sources of Cache Misses Sources of Cache Misses  Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it “Cold” fact of life: not a whole lot you can do about it  Conflict/Collision/Interference Multiple mem locations mapped to the same cache location Multiple mem locations mapped to the same cache location Solution 1: Increase cache size Solution 1: Increase cache size Solution 2: Increase associativity Solution 2: Increase associativity  Capacity Cache cannot contain all blocks access by the program Cache cannot contain all blocks access by the program Solution 1: Increase cache size Solution 1: Increase cache size Solution 2: Restructure program Solution 2: Restructure program  Coherence/Invalidation Other process (e.g., I/O) updates memory Other process (e.g., I/O) updates memory

32 The 3C Model of Cache Misses  Based on comparison with another cache Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) Compulsory: The first access to a block is not in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. (Misses in Infinite Cache) Capacity: If the cache cannot contain all the blocks needed during execution of a program (its working set), capacity misses will occur due to blocks being discarded and later retrieved. (Misses in fully associative size X Cache) Capacity: If the cache cannot contain all the blocks needed during execution of a program (its working set), capacity misses will occur due to blocks being discarded and later retrieved. (Misses in fully associative size X Cache) Conflict: If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in A-way associative size X Cache but not in fully associative size X Cache) Conflict: If the block-placement strategy is set-associative or direct mapped, conflict misses (in addition to compulsory and capacity misses) will occur because a block can be discarded and later retrieved if too many blocks map to its set. These are also called collision misses or interference misses. (Misses in A-way associative size X Cache but not in fully associative size X Cache)  Also: Coherence/Invalidation Other process (e.g., I/O) updates memory Other process (e.g., I/O) updates memory

33 Possible Solutions  Compulsory (cold start or process migration, first reference): first access to a block “Cold” fact of life: not a whole lot you can do about it “Cold” fact of life: not a whole lot you can do about it  Conflict/Collision/Interference Multiple mem locations mapped to the same cache location Multiple mem locations mapped to the same cache location Solution 1: Increase cache size Solution 1: Increase cache size Solution 2: Increase associativity Solution 2: Increase associativity  Capacity Cache cannot contain all blocks access by the program Cache cannot contain all blocks access by the program Solution 1: Increase cache size Solution 1: Increase cache size Solution 2: Restructure program Solution 2: Restructure program

34 Sources of Cache Misses Direct MappedN-way Set AssociativeFully Associative Compulsory Miss Cache Size Capacity Miss Invalidation Miss BigMediumSmall If you are going to run “billions” of instruction, compulsory misses are insignificant. Same Conflict MissHighMediumZero Low(er)MediumHigh Same

35 3Cs Absolute Miss Rate Conflict

36 3Cs Relative Miss Rate Conflict

37 How to Improve Cache Performance  Latency Reduce miss rate Reduce miss rate Reduce miss penalty Reduce miss penalty Reduce hit time Reduce hit time  Bandwidth Increase hit bandwidth Increase hit bandwidth Increase miss bandwidth Increase miss bandwidth

38 1. Reduce Misses via Larger Block Size

39 2. Reduce Misses via Higher Associativity  2:1 Cache Rule Miss Rate DM cache size N  Miss Rate FA cache size N/2 Miss Rate DM cache size N  Miss Rate FA cache size N/2 Not merely empirical Not merely empirical  Theoretical justification in Sleator and Tarjan, “Amortized efficiency of list update and paging rules”, CACM, 28(2): ,1985  Beware: Execution time is only final measure! Will clock cycle time increase? Will clock cycle time increase? Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way Hill [1988] suggested hit time ~10% higher for 2-way vs. 1-way

40 Example: Ave Mem Access Time vs. Miss Rate Example: assume clock cycle time is 1.10 for 2-way, 1.12 for 4- way, 1.14 for 8-way vs. clock cycle time of direct mapped (Red means A.M.A.T. not improved by more associativity)

41 3. Reduce Conflict Misses via Victim Cache  How to combine fast hit time of direct mapped yet avoid conflict misses  Add small highly associative buffer to hold data discarded from cache  Jouppi [1990]: 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB direct mapped data cache TAG DATA ? TAG DATA ? CPU Mem

42 4. Reduce Conflict Misses via Pseudo-Assoc.  How to combine fast hit time of direct mapped and have the lower conflict misses of 2-way SA cache  Divide cache: on a miss, check other half of cache to see if there, if so have a pseudo-hit (slow hit)  Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles Better for caches not tied directly to processor Better for caches not tied directly to processor Hit Time Pseudo Hit Time Miss Penalty Time

43 5. Reduce Misses by Hardware Prefetching  Instruction prefetching Alpha fetches 2 blocks on a miss Alpha fetches 2 blocks on a miss Extra block placed in stream buffer Extra block placed in stream buffer On miss check stream buffer On miss check stream buffer  Works with data blocks too Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Jouppi [1990] 1 data stream buffer got 25% misses from 4KB cache; 4 stream buffers got 43% Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches Palacharla & Kessler [1994] for scientific programs for 8 streams got 50% to 70% of misses from 2 64KB, 4-way set associative caches  Prefetching relies on extra memory bandwidth that can be used without penalty e.g., up to 8 prefetch stream buffers in the UltraSPARC III e.g., up to 8 prefetch stream buffers in the UltraSPARC III

44 6. Reducing Misses by Software Prefetching  Data prefetch Compiler inserts special “prefetch” instructions into program Compiler inserts special “prefetch” instructions into program  Load data into register (HP PA-RISC loads)  Cache Prefetch: load into cache (MIPS IV,PowerPC,SPARC v9) A form of speculative execution A form of speculative execution  don’t really know if data is needed or if not in cache already Most effective prefetches are “semantically invisible” to prgm Most effective prefetches are “semantically invisible” to prgm  does not change registers or memory  cannot cause a fault/exception  if they would fault, they are simply turned into NOP’s  Issuing prefetch instructions takes time Is cost of prefetch issues < savings in reduced misses? Is cost of prefetch issues < savings in reduced misses?

45 7. Reduce Misses by Compiler Optzns.  Instructions Reorder procedures in memory so as to reduce misses Reorder procedures in memory so as to reduce misses Profiling to look at conflicts Profiling to look at conflicts McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache with 4 byte blocks  Data Merging Arrays Merging Arrays  Improve spatial locality by single array of compound elements vs. 2 arrays Loop Interchange Loop Interchange  Change nesting of loops to access data in order stored in memory Loop Fusion Loop Fusion  Combine two independent loops that have same looping and some variables overlap Blocking Blocking  Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows

46 Merging Arrays Example  Reduces conflicts between val and key  Addressing expressions are different /* Before */ int val[SIZE]; int key[SIZE]; /* Before */ int val[SIZE]; int key[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE]; /* After */ struct merge { int val; int key; }; struct merge merged_array[SIZE];

47 Loop Interchange Example  Sequential accesses instead of striding through memory every 100 words /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* Before */ for (k = 0; k < 100; k++) for (j = 0; j < 100; j++) for (i = 0; i < 5000; i++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j]; /* After */ for (k = 0; k < 100; k++) for (i = 0; i < 5000; i++) for (j = 0; j < 100; j++) x[i][j] = 2 * x[i][j];

48 Loop Fusion Example  Before: 2 misses per access to a and c  After: 1 miss per access to a and c /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i++) for (j = 0; j < N; j++) d[i][j] = a[i][j] + c[i][j]; /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];} /* After */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) {a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];}

49 Blocking Example  Two Inner Loops: Read all NxN elements of z[] Read all NxN elements of z[] Read N elements of 1 row of y[] repeatedly Read N elements of 1 row of y[] repeatedly Write N elements of 1 row of x[] Write N elements of 1 row of x[]  Capacity Misses a function of N and Cache Size 3 NxN  no capacity misses; otherwise... 3 NxN  no capacity misses; otherwise...  Idea: compute on BxB submatrix that fits /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; } /* Before */ for (i = 0; i < N; i++) for (j = 0; j < N; j++) { r = 0; for (k = 0; k < N; k++) r = r + y[i][k]*z[k][j]; x[i][j] = r; }

50 Blocking Example (contd.)  Age of accesses White means not touched yet White means not touched yet Light gray means touched a while ago Light gray means touched a while ago Dark gray means newer accesses Dark gray means newer accesses

51 Blocking Example (contd.)  Work with BxB submatrices smaller working set can fit within the cache smaller working set can fit within the cache  fewer capacity misses /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; } /* After */ for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B) for (i = 0; i < N; i++) for (j = jj; j < min(jj+B-1,N); j++) { r = 0; for (k = kk; k < min(kk+B-1,N); k++) r = r + y[i][k]*z[k][j]; x[i][j] = x[i][j] + r; }

52 Blocking Example (contd.)  Capacity reqd. goes from (2N 3 + N 2 ) to (2N 3 /B +N 2 )  B = “blocking factor”

53 Summary: Compiler Optimizations to Reduce Cache Misses

54 Reducing Miss Penalty 1. Read Priority over Write on Miss: Write through: Write through:  Using write buffers: RAW conflicts with reads on cache misses  If simply wait for write buffer to empty might increase read miss penalty by 50% (old MIPS 1000)  Check write buffer contents before read; if no conflicts, let the memory access continue Write Back? Write Back?  Read miss replacing dirty block  Normal: Write dirty block to memory, and then do the read  Instead copy the dirty block to a write buffer, then do the read, and then do the write  CPU stall less since restarts as soon as read completes

55 Valid Bits Fetching Subblocks to Reduce Miss Penalty  Don’t have to load full block on a miss  Have bits per subblock to indicate valid

56 3. Early Restart and Critical Word First  Don’t wait for full block to be loaded before restarting CPU Early Restart —As soon as the requested word of the block arrrives, send it to the CPU and let the CPU continue execution Early Restart —As soon as the requested word of the block arrrives, send it to the CPU and let the CPU continue execution Critical Word First —Request the missed word first from memory and send it to the CPU as soon as it arrives Critical Word First —Request the missed word first from memory and send it to the CPU as soon as it arrives  let the CPU continue while filling the rest of the words in the block.  also called “wrapped fetch” and “requested word first”  Generally useful only in large blocks  Spatial locality a problem tend to want next sequential word, so not clear if benefit by early restart tend to want next sequential word, so not clear if benefit by early restart

57 4. Non-blocking Caches  Non-blocking cache or lockup-free cache allows the data cache to continue to supply cache hits during a miss “Hit under miss” “Hit under miss”  reduces the effective miss penalty by being helpful during a miss instead of ignoring the requests of the CPU “Hit under multiple miss” or “miss under miss” “Hit under multiple miss” or “miss under miss”  may further lower the effective miss penalty by overlapping multiple misses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses Significantly increases the complexity of the cache controller as there can be multiple outstanding memory accesses

58 Value of Hit Under Miss for SPEC  FP programs on average: AMAT= > > > 0.26  Int programs on average: AMAT= > > > 0.19  8 KB Data Cache, Direct Mapped, 32B block, 16 cycle miss Integer Floating Point “Hit under i Misses”

59 5. Miss Penalty Reduction: L2 Cache L2 Equations: AMAT = Hit Time L1 + Miss Rate L1  Miss Penalty L1 Miss Penalty L1 = Hit Time L2 + Miss Rate L2  Miss Penalty L2 AMAT = Hit Time L1 + Miss Rate L1  (Hit Time L2 + Miss Rate L2  Miss Penalty L2 ) Definitions: Local miss rate — misses in this cache divided by the total number of memory accesses to this cache (Miss rate L2 ) Global miss rate —misses in this cache divided by the total number of memory accesses generated by the CPU (Miss Rate L1  Miss Rate L2 )

60 Reducing Misses: Which Apply to L2 Cache?  Reducing Miss Rate 1. Reduce Misses via Larger Block Size 1. Reduce Misses via Larger Block Size 2. Reduce Conflict Misses via Higher Associativity 2. Reduce Conflict Misses via Higher Associativity 3. Reducing Conflict Misses via Victim Cache 3. Reducing Conflict Misses via Victim Cache 4. Reducing Conflict Misses via Pseudo-Associativity 4. Reducing Conflict Misses via Pseudo-Associativity 5. Reducing Misses by HW Prefetching Instr, Data 5. Reducing Misses by HW Prefetching Instr, Data 6. Reducing Misses by SW Prefetching Data 6. Reducing Misses by SW Prefetching Data 7. Reducing Capacity/Conf. Misses by Compiler Optimizations 7. Reducing Capacity/Conf. Misses by Compiler Optimizations

61 L2 cache block size & A.M.A.T.  32KB L1, 8 byte path to memory

62 Reducing Miss Penalty Summary  Five techniques Read priority over write on miss Read priority over write on miss Subblock placement Subblock placement Early Restart and Critical Word First on miss Early Restart and Critical Word First on miss Non-blocking Caches (Hit Under Miss) Non-blocking Caches (Hit Under Miss) Second Level Cache Second Level Cache  reducing L2 miss rate effectively reduces L1 miss penalty!  Can be applied recursively to Multilevel Caches Danger is that time to DRAM will grow with multiple levels in between Danger is that time to DRAM will grow with multiple levels in between

63 Review: Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache.

64 1. Fast Hit Times via Small, Simple Caches  Simple caches can be faster cache hit time increasingly a bottleneck to CPU performance cache hit time increasingly a bottleneck to CPU performance  set associativity requires complex tag matching  slower  direct-mapped are simpler  faster  shorter CPU cycle times –tag check can be overlapped with transmission of data  Smaller caches can be faster can fit on the same chip as CPU can fit on the same chip as CPU  avoid penalty of going off-chip for L2 caches: compromise for L2 caches: compromise  keep tags on chip, and data off chip –fast tag check, yet greater cache capacity L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV L1 data cache reduced from 16KB in Pentium III to 8KB in Pentium IV

65 TechniqueMRMPHT Complexity Larger Block Size+–0 Higher Associativity+–1 Victim Caches+2 Pseudo-Associative Caches +2 HW Prefetching of Instr/Data+2 Compiler Controlled Prefetching+3 Compiler Reduce Misses+0 Priority to Read Misses+1 Subblock Placement ++1 Early Restart & Critical Word 1st +2 Non-Blocking Caches+3 Second Level Caches+2 Small & Simple Caches–+0 Avoiding Address Translation+2 Cache Optimization Summary

66 Impact of Caches  : Speed = ƒ(no. operations)  1997 Pipelined Execution & Fast Clock Rate Pipelined Execution & Fast Clock Rate Out-of-Order completion Out-of-Order completion Superscalar Instruction Issue Superscalar Instruction Issue  1999: Speed = ƒ(non-cached memory accesses)  Has impact on: Compilers, Architects, Algorithms, Data Structures? Compilers, Architects, Algorithms, Data Structures?

67 In Conclusion  Have looked at basic types of caches  Problems  How to improve performance  Next Methods to ensure cache consistency in SMPs Methods to ensure cache consistency in SMPs