COSC3330 Computer Architecture

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and
1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.
331 Week13.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
Memory Chapter 7 Cache Memories.
Memory Hierarchy.1 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Computing Systems Memory Hierarchy.
CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
CS1104: Computer Organisation School of Computing National University of Singapore.
ECM534 Advanced Computer Architecture
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.
EEE-445 Review: Major Components of a Computer Processor Control Datapath Memory Devices Input Output Cache Main Memory Secondary Memory (Disk)
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.
The Goal: illusion of large, fast, cheap memory Fact: Large memories are slow, fast memories are small How do we create a memory that is large, cheap and.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.
CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 14: Memory Hierarchy Chapter 5 (4.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
1 ECE3055 Computer Architecture and Operating Systems Lecture 8 Memory Subsystem Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 26 Memory Hierarchy Design (Concept of Caching and Principle of Locality)
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
Cache Memory and Performance
Yu-Lun Kuo Computer Sciences and Information Engineering
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
How will execution time grow with SIZE?
Cache Memory Presentation I
Morgan Kaufmann Publishers Memory & Cache
Cache Memories September 30, 2008
Adapted from slides by Sally McKee Cornell University
CMSC 611: Advanced Computer Architecture
Lecture 20: OOO, Memory Hierarchy
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Morgan Kaufmann Publishers Memory Hierarchy: Introduction
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Fundamentals of Computing: Computer Architecture
Memory & Cache.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

COSC3330 Computer Architecture Lecture 18. Cache Instructor: Weidong Shi (Larry), PhD Computer Science Department University of Houston

Topics Cache

CPU vs Memory Performance µProc 55%/year (2X/1.5yr) Moore’s Law The performance gap grows 50%/year DRAM 7%/year (2X/10yrs) 3 3

Memory Wall CPU vs DRAM speed disparity continues to grow The processor performance is limited by memory (memory wall) Good memory hierarchy design is important to the overall performance of a computer system Clocks per instruction Clocks per DRAM access Memory Technology Typical access time $ per GB in 2008 SRAM 0.5 ~ 2.5 ns $2000 ~ $5000 DRAM 50 ~ 70 ns $20 ~ $70 Magnetic disk 5,000,000 ~ 20,000,000 ns $0.20 ~ $2 4 4

Performance Consider basic 5 stage pipeline design CPU runs at 1GHz (1ns cycle time) Main memory takes 100ns to access CPU address Main memory (DRAM) data Fetch D E Fetch D E Fetch D The CPU effectively executes 1 instruction per 100 clock cycles The problem is getting worse as CPU speed grows much faster than DRAM 5 5

Typical Solution Cache (memory) is used to reduce the large speed gap between CPU and DRAM Cache is ultra-fast and small memory inside processor Cache is an SRAM-based memory Frequently used instructions and data in main memory are placed in cache by hardware Cache (L1) is typically accessed at 1 CPU cycle Processor address Main memory (DRAM) CPU Cache data F D E F D E Theoretically, the CPU is able to execute 1 instruction per 1 clock cycle F D E F D E F D E F D E 6 6

SRAM vs DRAM SRAM DRAM A bit is stored on a pair of inverting gates Very fast, but takes up more space (4 to 6 transistors) than DRAM Used to design cache DRAM A bit is stored as a charge on capacitor (must be refreshed) Very small, but slower than SRAM (factor of 5 to 10) Used for main memory (DDR SDRAM) W o r d l i n e P a s t B e bar Memory Technology Typical access time $ per GB in 2008 SRAM 0.5 ~ 2.5 ns $2000 ~ $5000 DRAM 50 ~ 70 ns $20 ~ $70 Magnetic disk 5,000,000 ~ 20,000,000 ns $0.20 ~ $2 W o r d l i n e P a s t C p c B 7 7

A Computer System Caches are located inside a processor North Bridge Main Memory (DDR2) FSB (Front-Side Bus) North Bridge Graphics card DMI (Direct Media I/F) South Bridge Hard disk USB PCIe card 8 8

Core i7 (Intel) 4 cores on one chip Three levels of caches (L1, L2, L3) on chip L1: 32KB, 8-way L2: 256KB, 8-way L3: 8MB, 16-way 731 million transistors in 263 mm2 with 45nm technology 9 9

Opteron (AMD) - Barcelona 4 cores on one chip Three levels of caches (L1, L2, L3) on chip L1: 64KB, L2: 512KB, L3: 2MB Integrated North Bridge 10 10

Memory Hierarchy

A Typical Memory Hierarchy Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology higher level lower level Secondary Storage (Disk) On-Chip Components Main Memory (DRAM) CPU Core L1I (Instr Cache) L2 (Second Level) Cache ITLB Reg File L1D (Data Cache) DTLB Speed (cycles): ½’s 1’s 10’s 100’s 10,000’s Size (bytes): 100’s 10K’s M’s G’s T’s Cost: highest lowest 12 12

Characteristics of Memory Hierarchy Processor Increasing access time from processor 4-8 bytes (word) L1$ 8-32 bytes (block) L2$ 1 to 4 blocks Main Memory 1,024+ bytes (disk sector = page) Secondary Memory (HDD) (Relative) size of the memory at each level 13 13

Why Caches Work? The size of cache is tiny compared to main memory How to make sure that the data CPU is going to access is in caches? Caches take advantage of the principle of locality in your program Temporal Locality (locality in time) If a memory location is referenced, then it will tend to be referenced again soon. So, keep most recently accessed data items closer to the processor Spatial Locality (locality in space) If a memory location is referenced, the locations with nearby addresses will tend to be referenced soon. So, move blocks consisting of contiguous words closer to the processor 14 14

Example of Locality int A[100], B[100], C[100], D; for (i=0; i<100; i++) { C[i] = A[i] * B[i] + D; } A[96] A[97] A[98] A[99] B[1] B[2] B[3] B[0] . . . . . . . . . . . . . . B[5] B[6] B[7] B[4] B[9] B[10] B[11] B[8] C[0] C[1] C[2] C[3] C[5] C[6] C[7] C[4] C[96] C[97] C[98] C[99] D Cache A[0] A[1] A[2] A[3] A[5] A[6] A[7] A[4] A Cache Line (block) 15 15

How is the Hierarchy Managed? Who manages data transfer between Main memory  Disks by the operating system (virtual memory) by the programmer (files) Registers  Main memory by compiler (and programmer) Cache  Main memory by hardware (cache controller) DDR3 HDD Pentium (1993) 16 16

Cache Terminology Block (or (cache) line): the minimum unit of information present in a cache For example, 64B / block in Core 2 Duo Hit: if the requested data is in cache, it is called a hit Hit rate: the fraction of memory accesses found in caches For example, CPU requested data from memory, and cache was able to supply data 90% of the requests. Then, the hit rate of the cache is 90% Hit time: the time required to access the data found in cache Miss: if the requested data is not in cache, it is called a miss Miss rate: the fraction of memory accesses not found in cache (= 1 - Hit Rate) Miss penalty: the time required to fetch a block into a level of the memory hierarchy from the lower level 17 17

Cache Basics Two issues in cache design: Q1: How do we know if data is in the cache? Q2: If it is, how do we find it? Our first cache structure: Direct mapped cache Each memory block is mapped to exactly one block in cache Lots of lower level blocks must share a block in cache Address mapping to cache (to answer Q2) (block address) modulo (# of blocks in cache) Have a tag associated with each cache block that contains the address information (the upper portion of the address) required to identify the block (to answer Q1) 18 18

Direct Mapped Cache Mapping to cache Cache structure (block address) modulo (# of blocks in the cache) Cache structure Data: actual data Tag: which block is mapped to the cache? Valid: Is the block in the cache valid? Data Tag Valid 19 19

Is this cache structure taking advantage of what kind of locality? Example 4KB direct-mapped cache with one word (32-bit) blocks How many blocks are there in the cache? 31 30 . . . 13 12 11 . . . 2 1 0 Byte offset Address from CPU address CPU Core Cache 20 Tag 10 Index data Index 1 2 . 1021 1022 1023 Valid Tag Data Hit 20 32 Is this cache structure taking advantage of what kind of locality? Data 20 20

Example: DM$, 8-Entry, 4B blocks Assume that address bus from CPU is 8-bit wide (256 bytes) lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 b 28 c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 21 21

Example: DM$, 8-Entry, 4B blocks 4-byte block, drop low 2 bits for byte offset! Only matters for byte-addressable systems lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #24 is 0001 1000 Index = log2(8) bits b 28 3 Index c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 Cache miss! 22 22

Example: DM$, 8-Entry, 4B blocks lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #24 is 0001 1000 b 28 To CPU (a) c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 Index 1 000 a 23 23

Example: DM$, 8-Entry, 4B blocks lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #28 is 0001 1100 b 28 To CPU (b) c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 Index 1 000 a 1 000 b Cache miss! 24 24

Example: DM$, 8-Entry, 4B blocks lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #60 is 0011 1100 b 28 c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 Index 1 000 a 1 It’s valid! Is it a hit or a miss? 000 b 25 25

Example: DM$, 8-Entry, 4B blocks lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #60 is 0011 1100 b 28 c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 Index 1 000 a The tags don’t match! It’s not what we want to access! Cache Miss! 1 000 b 26 26

Example: DM$, 8-Entry, 4B blocks Do we have to bring the block to the cache and write to cache? lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Let’s assume that (it’s called write-allocation policy) Main Memory Address a 24 #60 is 0011 1100 b 28 c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 Index 1 000 a 1 000 001 b c 27 27

Example: DM$, 8-Entry, 4B blocks Now, we can write a new value to the location lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #60 is 0011 1100 b 28 c 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 d 188 1 Do we update memory now? Or later? Index 000 a 1 001 new value ($3) c Assume later (it is called write-back cache) 28 28

Example: DM$, 8-Entry, 4B blocks How do we know which blocks in cache need to be written back to main memory? lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address Need extra state! The “dirty” bit! a 24 b 28 c (old) 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 Dirty 1 d 188 1 000 a 1 001 new value ($3) 29 29

Example: DM$, 8-Entry, 4B blocks lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #188 is 1011 1100 b 28 c (old) 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 Dirty 1 d 188 Index 1 000 a 1 001 new value ($3) Cache miss! 30 30

Example: DM$, 8-Entry, 4B blocks lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #188 is 1011 1100 b 28 c (old) new value ($3) 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 Dirty d 188 Dirty bit is set! Index 1 000 a 1 001 new value ($3) 1 So, we need to write the block to memory first 31 31

Example: DM$, 8-Entry, 4B blocks lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #188 is 1011 1100 b 28 new value ($3) 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 Dirty d 188 Index 1 000 a Now, we can bring the block to the cache 1 001 101 new value ($3) d 32 32

Example: DM$, 8-Entry, 4B blocks Now, we can write a new value to the location lw $1, 24($0) lw $2, 28($0) sw $3, 60($0) sw $4, 188($0) Main Memory Address a 24 #188 is 1011 1100 b 28 new value ($3) 60 Cache (32B) Valid Tag Data Index 1 2 3 4 5 6 7 Dirty d (old) 188 Index 1 000 a 1 101 new value ($4) d 1 33 33

Handling Writes When it comes to a read miss, bring the block to cache and supply data to CPU When it comes to a write miss, there are options you can choose from Upon a write-miss, Write-allocate: bring the block into cache and write Write no-allocate: write directly to main memory w/o bringing the block to the cache When writing, Write-back: update values only to the blocks in the cache and write the modified blocks to the lower level of the hierarchy when the block is replaced Write-through: update both the cache and the lower level of the memory hierarchy Write-allocate is usually associated with write-back policy Write no-allocate is usually associated with write-through policy 34

Hits vs. Misses Read hits Read misses Write hits Write misses This is what we want! Read misses Stall the CPU, fetch the corresponding block from memory, deliver to cache and CPU, and continue to run CPU Write hits Write-through cache: replace data in both cache and memory Write-back cache: write the data only into the cache and set the dirty bit (and write-back the block to memory later on when replaced) Write misses Write-allocate with write-back: read the entire block into the cache, then write the word only to the cache Write no-allocate with write-through: write the word only to the main memory 35 35

Direct Mapped Cache with 4-word Block Four words/block, cache size = 1K words (4KB) 31 30 . . . 13 12 11 . . . 3 2 1 0 Byte offset 20 Tag 8 Block offset Index Data Index Valid Tag 1 2 . 253 254 255 Hit 20 Data 32 Is this cache structure taking advantage of what kind of locality? 36 36

Miss Rate vs Block Size vs Cache Size Miss rate goes up if the block size becomes a significant fraction of the cache size because the number of blocks that can be held in the same size cache is smaller Stated alternatively, spatial locality among the words in a word decreases with a very large block; Consequently, the benefits in the miss rate become smaller 37 37