CPE 631 Lecture 05: Cache Design

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Lecture 12 Reduce Miss Penalty and Hit Time

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

Cs 325 virtualmemory.1 Accessing Caches in Virtual Memory Environment.

ECE 232 L26.Cache.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 26 Caches.

CIS429/529 Cache Basics 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality °Three.

Now, Review of Memory Hierarchy

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

EENG449b/Savvides Lec /1/04 April 1, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

361 Computer Architecture Lecture 14: Cache Memory

CIS629 - Fall 2002 Caches 1 Caches °Why is caching needed? Technological development and Moore’s Law °Why are caches successful? Principle of locality.

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.

CS252/Kubiatowicz Lec 3.1 1/24/01 CS252 Graduate Computer Architecture Lecture 3 Caches and Memory Systems I January 24, 2001 Prof. John Kubiatowicz.

EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)

Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

Cps 220 Cache. 1 ©GK Fall 1998 CPS220 Computer System Organization Lecture 17: The Cache Alvin R. Lebeck Fall 1999.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Lecture slides originally adapted from Prof. Valeriu Beiu (Washington State University, Spring 2005, EE 334)

CPE 626 CPU Resources: Introduction to Cache Memories Aleksandar Milenkovic Web:

CMSC 611: Advanced Computer Architecture

Soner Onder Michigan Technological University

Yu-Lun Kuo Computer Sciences and Information Engineering

CS 704 Advanced Computer Architecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Lecture 13: Reduce Cache Miss Rates

Cache Memory Presentation I

Lecture 21: Memory Hierarchy

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

So we create a memory hierarchy:

Rose Liu Electrical Engineering and Computer Sciences

Lecture 23: Cache, Memory, Virtual Memory

Chapter 5 Memory CSE 820.

Lecture 08: Memory Hierarchy Cache Performance

CPE 631 Lecture 05: CPU Caches

Adapted from slides by Sally McKee Cornell University

CS 704 Advanced Computer Architecture

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CSC3050 – Computer Architecture

Lecture 21: Memory Hierarchy

September 1, 2000 Prof. John Kubiatowicz

Cache - Optimization.

Cache Memory Rabi Mahapatra

CPE 631 Lecture 04: Review of the ABC of Caches

Lecture 7 Memory Hierarchy and Cache Design

Memory & Cache.

10/18: Lecture Topics Using spatial locality

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

CPE 631 Lecture 05: Cache Design Electrical and Computer Engineering University of Alabama in Huntsville

Outline Review: the ABC of Caches Cache Performance 05/12/2018 UAH-CPE631

Processor-DRAM Latency Gap 2x/1.5 year Processor-Memory Performance Gap grows 50% / year Performance Memory: 2x/10 years Time 1980: no cache in µproc; 1995 2-level cache on chip (1989 first Intel µproc with a cache on chip) 05/12/2018 UAH-CPE631

Generations of Microprocessors Time of a full cache miss in instructions executed: 1st Alpha: 340 ns/5.0 ns = 68 clks x 2 or 136 2nd Alpha: 266 ns/3.3 ns = 80 clks x 4 or 320 3rd Alpha: 180 ns/1.7 ns =108 clks x 6 or 648 1/2X latency x 3X clock rate x 3X Instr/clock  5X 1st generation Latency 1/2 but Clock rate 3X and IPC is 3X Now move to other 1/2 of industry 05/12/2018 UAH-CPE631

What is a cache? Small, fast storage used to improve average access time to slow memory. Exploits spatial and temporal locality In computer architecture, almost everything is a cache! Registers “a cache” on variables – software managed First-level cache a cache on second-level cache Second-level cache a cache on memory Memory a cache on disk (virtual memory) TLB a cache on page table Branch-prediction a cache on prediction information? Proc/Regs L1-Cache Bigger Faster L2-Cache Memory Disk, Tape, etc. 05/12/2018 UAH-CPE631

Review: 4 Questions for Memory Hierarchy Designers Q#1: Where can a block be placed in the upper level?  Block placement direct-mapped, fully associative, set-associative Q#2: How is a block found if it is in the upper level?  Block identification Q#3: Which block should be replaced on a miss?  Block replacement Random, LRU (Least Recently Used) Q#4: What happens on a write?  Write strategy Write-through vs. write-back Write allocate vs. No-write allocate 05/12/2018 UAH-CPE631

Q1: Where can a block be placed in the upper level? Block 12 placed in 8 block cache: Fully associative, direct mapped, 2-way set associative S.A. Mapping = Block Number Modulo Number Sets Direct Mapped (12 mod 8) = 4 2-Way Assoc (12 mod 4) = 0 Full Mapped 01234567 01234567 00112233 Cache 1111111111222222222233 01234567890123456789012345678901 Memory 05/12/2018 UAH-CPE631

Q2: How is a block found if it is in the upper level? Tag on each block No need to check index or block offset Increasing associativity shrinks index expands tag Block Offset Block Address Index Tag 05/12/2018 UAH-CPE631

Fully Associative Cache 8KB with 4W blocks, W=32b => 512 bl. 31 3 Cache Tag (28 bits long) Byte Offset Valid Cache Data Cache Tag = = = = : : : : = Compare tags in parallel 05/12/2018 UAH-CPE631

1 KB Direct Mapped Cache, 32B blocks For a 2 ** N byte cache: The uppermost (32 - N) bits are always the Cache Tag The lowest M bits are the Byte Select (Block Size = 2 ** M) Cache Index 1 2 3 : Cache Data Byte 0 4 31 Cache Tag Example: 0x50 Ex: 0x01 0x50 Stored as part of the cache “state” Valid Bit Byte 1 Byte 31 Byte 32 Byte 33 Byte 63 Byte 992 Byte 1023 Byte Select Ex: 0x00 9 05/12/2018 UAH-CPE631

Two-way Set Associative Cache N-way set associative: N entries for each Cache Index N direct mapped caches operates in parallel (N typically 2 to 4) Example: Two-way set associative cache Cache Index selects a “set” from the cache The two tags in the set are compared in parallel Data is selected based on the tag result Cache Index Valid Cache Tag Cache Data Cache Data Cache Block 0 Cache Tag Valid : Cache Block 0 : : : Compare Adr Tag Compare 1 Sel1 Mux Sel0 OR Cache Block 05/12/2018 UAH-CPE631 Hit

Disadvantage of Set Associative Cache N-way Set Associative Cache v. Direct Mapped Cache: N comparators vs. 1 Extra MUX delay for the data Data comes AFTER Hit/Miss In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: Possible to assume a hit and continue. Recover later if miss. Cache Data Cache Block 0 Cache Tag Valid : Cache Index Mux 1 Sel1 Sel0 Cache Block Compare Adr Tag OR Hit 05/12/2018 UAH-CPE631

Q3: Which block should be replaced on a miss? Easy for Direct Mapped Set Associative or Fully Associative: Random LRU (Least Recently Used), Pseudo-LRU FIFO (Round-robin) Assoc: 2-way 4-way 8-way Size LRU Ran LRU Ran LRU Ran 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% 05/12/2018 UAH-CPE631

Q4: What happens on a write? Write through—The information is written to both the block in the cache and to the block in the lower-level memory. Write back—The information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. is block clean or dirty? Pros and Cons of each? WT: read misses cannot result in writes WB: no repeated writes to same location WT always combined with write buffers so that don’t wait for lower level memory 05/12/2018 UAH-CPE631

Write stall in write through caches When the CPU must wait for writes to complete during write through, the CPU is said to write stall Common optimization => Write buffer which allows the processor to continue as soon as the data is written to the buffer, thereby overlapping processor execution with memory updating However, write stalls can occur even with write buffer (when buffer is full) 05/12/2018 UAH-CPE631

Write Buffer for Write Through Processor Cache Write Buffer DRAM A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) -> 1 / DRAM write cycle Write buffer saturation 05/12/2018 UAH-CPE631

What to do on a write-miss? Write allocate (or fetch on write) The block is loaded on a write-miss, followed by the write-hit actions No-write allocate (or write around) The block is modified in the memory and not loaded into the cache Although either write-miss policy can be used with write through or write back, write back caches generally use write allocate and write through often use no-write allocate 05/12/2018 UAH-CPE631

An Example: The Alpha 21264 Data Cache (64KB, 64-byte blocks, 2w) <44> - physical address CPU Offset Address <29> <9> <6> Data in Tag Index Data out Valid<1> Tag<29> Data<512> ... ... 2:1 M U X =? 8:1 Mux Write buffer =? 8:1 Mux Lower level memory ... ... 05/12/2018 UAH-CPE631

Cache Performance Hit Time = time to find and retrieve data from current level cache Miss Penalty = average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy) Hit Rate = % of requests that are found in current level cache Miss Rate = 1 - Hit Rate 05/12/2018 UAH-CPE631

Cache Performance (cont’d) Average memory access time (AMAT) 05/12/2018 UAH-CPE631

An Example: Unified vs. Separate I&D Proc I-Cache-L1 Unified Cache-L1 Cache-L2 D-Cache-L1 Compare 2 design alternatives (ignore L2 caches)? 16KB I&D: Inst misses=3.82 /1K, Data miss rate=40.9 /1K 32KB unified: Unified misses = 43.3 misses/1K Assumptions: ld/st frequency is 36%  74% accesses from instructions (1.0/1.36) hit time = 1clock cycle, miss penalty = 100 clock cycles Data hit has 1 stall for unified cache (only one port) 05/12/2018 UAH-CPE631

Unified vs. Separate I&D (cont’d) Proc I-Cache-L1 Unified Cache-L1 Cache-L2 D-Cache-L1 Miss rate (L1I) = (# L1I misses) / (IC) #L1I misses = (L1I misses per 1k) * (IC /1000) Miss rate (L1I) = 3.82/1000 = 0.0038 Miss rate (L1D) = (# L1D misses) / (# Mem. Refs) #L1D misses = (L1D misses per 1k) * (IC /1000) Miss rate (L1D) = 40.9 * (IC/1000) / (0.36*IC) = 0.1136 Miss rate (L1U) = (# L1U misses) / (IC + Mem. Refs) #L1U misses = (L1U misses per 1k) * (IC /1000) Miss rate (L1U) = 43.3*(IC / 1000) / (1.36 * IC) = 0.0318 05/12/2018 UAH-CPE631

Unified vs. Separate I&D (cont’d) Proc I-Cache-L1 Unified Cache-L1 Cache-L2 D-Cache-L1 AMAT (split) = (% instr.) * (hit time + L1I miss rate * Miss Pen.) + (% data) * (hit time + L1D miss rate * Miss Pen.) = .74(1 + .0038*100) + .26(1+.1136*100) = 4.2348 clock cycles AMAT (unif.) = (% instr.) * (hit time + L1Umiss rate * Miss Pen.) + (% data) * (hit time + L1U miss rate * Miss Pen.) = .74(1 + .0318*100) + .26(1 + 1 + .0318*100) = 4.44 clock cycles 05/12/2018 UAH-CPE631

AMAT and Processor Performance Miss-oriented Approach to Memory Access CPIExec includes ALU and Memory instructions 05/12/2018 UAH-CPE631

AMAT and Processor Performance (cont’d) Separating out Memory component entirely AMAT = Average Memory Access Time CPIALUOps does not include memory instructions 05/12/2018 UAH-CPE631

Summary: Caches The Principle of Locality: Program access a relatively small portion of the address space at any instant of time. Temporal Locality: Locality in Time Spatial Locality: Locality in Space Three Major Categories of Cache Misses: Compulsory Misses: sad facts of life. Example: cold start misses. Capacity Misses: increase cache size Conflict Misses: increase cache size and/or associativity Write Policy: Write Through: needs a write buffer. Write Back: control can be complex Today CPU time is a function of (ops, cache misses) vs. just f(ops): What does this mean to Compilers, Data structures, Algorithms? 05/12/2018 UAH-CPE631

Summary: The Cache Design Space Cache Size Several interacting dimensions cache size block size associativity replacement policy write-through vs write-back The optimal choice is a compromise depends on access characteristics workload use (I-cache, D-cache, TLB) depends on technology / cost Simplicity often wins Associativity Block Size Bad Good Factor A Factor B Less More 05/12/2018 UAH-CPE631