CIS429/529 Cache Basics 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three.

Slides:



Advertisements
Similar presentations
361 Computer Architecture Lecture 15: Cache Memory
Advertisements

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
Appendix C: Review of Memory Hierarchy David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
ECE 232 L26.Cache.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 26 Caches.
Now, Review of Memory Hierarchy
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Chapter 7 Cache Memories.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
361 Computer Architecture Lecture 14: Cache Memory
CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Lec17.1 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr.
EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.
Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Levels of the Memory Hierarchy CPU Registers 100s Bytes
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
Memory Hierarchy. Since 1980, CPU has outpaced DRAM... CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs 10 DRAM CPU Performance (1/latency) 100.
Lecture 5 Review of Memory Hierarchy (Appendix C in textbook)
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
CPE 442 cache.1 Introduction To Computer Architecture CpE 442 Cache Memory Design.
1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
1 Computer Architecture Cache Memory. 2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable.
Memory Hierarchy— Motivation, Definitions, Four Questions about Memory Hierarchy, Improving Performance Professor Alvin R. Lebeck Computer Science 220.
1010 Caching ENGR 3410 – Computer Architecture Mark L. Chang Fall 2006.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.
For each of these, where could the data be and how would we find it? TLB hit – cache or physical memory TLB miss – cache, memory, or disk Virtual memory.
Summary of caches: The Principle of Locality: –Program likely to access a relatively small portion of the address space at any instant of time. Temporal.
CS 5513 Computer Architecture Lecture 4 – Memory Hierarchy Review.
CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:
CMSC 611: Advanced Computer Architecture
Soner Onder Michigan Technological University
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Lec 3 – Memory Hierarchy Review
Rose Liu Electrical Engineering and Computer Sciences
CPE 631 Lecture 05: Cache Design
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Electrical and Computer Engineering
September 1, 2000 Prof. John Kubiatowicz
Cache Memory Rabi Mahapatra
CPE 631 Lecture 04: Review of the ABC of Caches
Lecture 7 Memory Hierarchy and Cache Design
Presentation transcript:

CIS429/529 Cache Basics 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three basic models and how they work (in detail) Direct mapped, fully associative, set associative °How they interact with the pipeline °Performance analysis of caches Average Memory Access Time (AMAT) °Enhancements to caches

CIS429/529 Cache Basics : DRAM actually faster than microprocessors !! Apple ][ (1977) Steve Wozniak Steve Jobs CPU: 1000 ns DRAM: 400 ns

CIS429/529 Cache Basics 3 Since 1980, CPU has outpaced DRAM... CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs 10 DRAM CPU Performance (1/latency) Year Gap grew 50% per year

CIS429/529 Cache Basics 4 How do computer architectures address the gap? CPU 60% per yr 2X in 1.5 yrs DRAM 9% per yr 2X in 10 yrs 10 DRAM CPU Performance (1/latency) Year Gap grew 50% per year Put smaller, faster “cache” memories between CPU and DRAM. Create a “memory hierarchy”.

CIS429/529 Cache Basics 5 The Goal: illusion of large, fast, cheap memory °Fact: Large memories are slow, fast memories are small °How do we create a memory that is large, cheap and fast (most of the time)? Hierarchy Parallelism Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access

CIS429/529 Cache Basics 6 An Expanded View of the Memory System Control Datapath Memory Processor Memory Fastest Slowest Smallest Biggest Highest Lowest Speed: Size: Cost: Memory Hierarchy

CIS429/529 Cache Basics 7 Levels of the Memory Hierarchy CPU Registers 100s Bytes <10s ns Cache K Bytes ns cents/bit Main Memory M Bytes 200ns- 500ns $ cents /bit Disk G Bytes, 10 ms (10,000,000 ns) cents/bit Capacity Access Time Cost Tape infinite sec-min Registers Cache Memory Disk Tape Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl bytes OS 512-4K bytes user/operator Mbytes Upper Level Lower Level faster Larger

CIS429/529 Cache Basics 8 Memory Hierarchy: Apple iMac G5 iMac G5 1.6 GHz 07 RegL1 InstL1 DataL2DRAMDisk Size 1K64K32K512K256M80G Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 3, 1.9 ns 11, 6.9 ns 88, 55 ns 10 7, 12 ms Managed by compiler Managed by hardware Managed by OS, hardware, application

CIS429/529 Cache Basics 9 iMac’s PowerPC 970: All caches on-chip (1K) R eg ist er s 512K L2 L1 (64K Instruction) L1 (32K Data)

CIS429/529 Cache Basics 10 The Principle of Locality °The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. °Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) °Last 15 years, HW relied heavily on locality for speed Locality is a universal property of programs which is exploited in many aspects of HW and SW design.

CIS429/529 Cache Basics 11 Programs with locality cache well... Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): (1971) Time Memory Address (one dot per access) Spatial Locality Temporal Locality Bad locality behavior

CIS429/529 Cache Basics 12 Memory Hierarchy Use of Locality °Temporal Locality (Locality in Time): => Keep most recently accessed data items closer to the processor °Spatial Locality (Locality in Space): => Move blocks of contiguous words to the upper levels Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

CIS429/529 Cache Basics 13 Memory Hierarchy: Terminology °Hit: data appears in some block in the upper level Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of Time to determine hit/miss + time to deliver block to processor °Miss: data needs to be retrieve from a block in the lower level Miss Rate = 1 - (Hit Rate) Miss Time: Time to determine hit/miss + Time to replace a block in the upper level + Time to deliver the block to the processor Miss Penalty: Extra time incurred for a miss = Time to replace a block in the upper level °Hit Time << Miss Penalty and Miss Time Lower Level Memory Upper Level Memory To Processor From Processor Blk X Blk Y

CIS429/529 Cache Basics 14 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification) °Q3: Which block should be replaced on a miss? (Block replacement) °Q4: What happens on a write? (Write strategy) Four Organizing Principles for Caches and Memory Hierarchy

CIS429/529 Cache Basics 15 °Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative Block no. Fully associative: block 12 can go anywhere Block no. Direct mapped: block 12 can go only into block 4 (12 mod 8) Block no. Set associative: block 12 can go anywhere in set 0 (12 mod 4) Set 0 Set 1 Set 2 Set Block-frame address Block no. Q1: Where can a block be placed in the upper level?

CIS429/529 Cache Basics 16 Example: 1 KB Direct Mapped Cache with 32 B Blocks °For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index : Cache Data Byte : Cache TagExample: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9 Block address

CIS429/529 Cache Basics 17 Example: Fully Associative °Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators : Cache Data Byte : Cache Tag (27 bits long) Valid Bit : Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Cache Tag Byte Select Ex: 0x01 X X X X X

CIS429/529 Cache Basics 18 Example : Set Associative Cache °N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel °Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared to the input in parallel Data is selected based on the tag result Cache Data Cache Block 0 Cache TagValid ::: Cache Data Cache Block 0 Cache TagValid ::: Cache Index Mux 01 Sel1Sel0 Cache Block Compare Adr Tag Compare OR Hit

CIS429/529 Cache Basics 19 °Direct indexing (using index and block offset), tag compares, or combination °Increasing associativity shrinks index, expands tag Block offset Block Address Tag Index Q2: How is a block found if it is in the upper level?

CIS429/529 Cache Basics 20 °Easy for Direct Mapped °Set Associative or Fully Associative: Random LRU (Least Recently Used) Associativity:2-way4-way8-way SizeLRU Random LRU Random LRU Random 16 KB5.2%5.7% 4.7%5.3% 4.4%5.0% 64 KB1.9%2.0% 1.5%1.7% 1.4%1.5% 256 KB1.15%1.17% 1.13% 1.13% 1.12% 1.12% Q3: Which block should be replaced on a miss?

CIS429/529 Cache Basics 21 °Writes occur less frequently than reads: Under MIPS: 7% of all memory traffic are writes 25% of all data traffic are writes °Thus, Amdahl’s Law implies that caches should be optimized for reads. However, we cannot ignore writes. °Problems with writes: Must check tag BEFORE writing into the cache Only a portion of the cache block is modified Write stalls - CPU must wait until the write completes Q4: What happens on a write?

CIS429/529 Cache Basics 22 °Write through—The information is written to both the block in the cache and to the block in the lower- level memory. °Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? °Pros and Cons of each? WT: read misses don’t cause writes; easier to implement; copy of data always exists WB: write at the speed of the cache; multiple writes to cache before write to memory; less memory BW consumed Q4: What happens on a write: Design Options

CIS429/529 Cache Basics 23 °A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory °Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle °Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation Processor Cache Write Buffer DRAM Write Buffer for Write Through

CIS429/529 Cache Basics 24 Write Buffer Saturation °Store frequency (w.r.t. time) > 1 / DRAM write cycle If this condition exist for a long period of time (CPU cycle time too quick and/or too many store instructions in a row): -Store buffer will overflow no matter how big you make it -The CPU Cycle Time <= DRAM Write Cycle Time °Solution for write buffer saturation: Use a write back cache Install a second level (L2) cache: (does this always work?) Processor Cache Write Buffer DRAM Processor Cache Write Buffer DRAM L2 Cache

CIS429/529 Cache Basics 25 Write Miss Design Options °Write allocate (“fetch on write”) - block is loaded on a write miss, followed by the write. °No-write allocate (“write around”) - block is modified in the lower level, not loaded into the cache.

CIS429/529 Cache Basics 26 °Assume: a 16-bit write to memory location 0x0 and causes a miss Do we read in the block? -Yes: Write Allocate -No: Write Not Allocate Cache Index : Cache Data Byte : Cache TagExample: 0x00 Ex: 0x00 0x50 Valid Bit : 31 Byte 1Byte 31 : Byte 32Byte 33Byte 63 : Byte 992Byte 1023 : Cache Tag Byte Select Ex: 0x00 9 Write-miss Policy: Write Allocate versus Not Allocate

CIS429/529 Cache Basics 27 Impact on Cycle Time Example: direct map allows miss signal after data IR PC I -Cache D Cache AB R T IRex IRm IRwb miss invalid Miss Cache Hit Time: directly tied to clock rate increases with cache size increases with associativity Average Memory Access time (AMAT) = Hit Time + Miss Rate x Miss Penalty Compute Time = IC x CT x (ideal CPI + memory stalls)

CIS429/529 Cache Basics 28 °For in-order pipeline, 2 options: Freeze pipeline in Mem stage (popular early on: Sparc, R4000) IF ID EX Mem stall stall stall … stall Mem Wr IF ID EX stall stall stall … stall stall Ex Wr Use Full/Empty bits in registers + MSHR queue -MSHR = “Miss Status/Handler Registers” (Kroft) Each entry in this queue keeps track of status of outstanding memory requests to one complete memory line. –Per cache-line: keep info about memory address. –For each word: register (if any) that is waiting for result. –Used to “merge” multiple requests to one memory line -New load creates MSHR entry and sets destination register to “Empty”. Load is “released” from pipeline. -Attempt to use register before result returns causes instruction to block in decode stage. -Limited “out-of-order” execution with respect to loads. Popular with in-order superscalar architectures. °Out-of-order pipelines already have this functionality built in… (load queues, etc). What happens on a Cache miss?