Some of the slides are adopted from David Patterson (UCB)

Slides:

Advertisements

Similar presentations

CS 430 – Computer Architecture

Advertisements

Modified from notes by Saeid Nooshabadi COMP3221: Microprocessors and Embedded Systems Lecture 25: Cache - I Lecturer:

CS 430 Computer Architecture 1 CS 430 – Computer Architecture Caches, Part II William J. Taffe using slides of David Patterson.

CS61C L32 Caches II (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #20: Caches Andy.

Computer ArchitectureFall 2007 © November 14th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CS61C L31 Caches II (1) Garcia, Fall 2006 © UCB GPUs >> CPUs?  Many are using graphics processing units on graphics cards for high-performance computing.

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

CS61C L23 Cache II (1) Chae, Summer 2008 © UCB Albert Chae, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 – Cache II.

Modified from notes by Saeid Nooshabadi

CS61C L22 Caches III (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #22: Caches Andy.

COMP3221 lec33-Cache-I.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 12: Cache Memory - I

CS61C L23 Caches I (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 Cache I.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 31 – Caches II In this week’s Science, IBM researchers describe a new class.

COMP3221: Microprocessors and Embedded Systems Lecture 26: Cache - II Lecturer: Hui Wu Session 2, 2005 Modified from.

CS61C L32 Caches II (1) Garcia, 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS61C L31 Caches I (1) Garcia 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS61C L20 Caches I (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #20: Caches Andy Carle.

361 Computer Architecture Lecture 14: Cache Memory

CS61C L32 Caches II (1) Garcia, Spring 2007 © UCB Experts weigh in on Quantum CPU  Most “profoundly skeptical” of the demo. D-Wave has provided almost.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

COMP3221 lec34-Cache-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 34: Cache Memory - II

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

CS 61C L21 Caches II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

CS61C L32 Caches III (1) Garcia, Fall 2006 © UCB Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C.

©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1 Computer Architecture Cache Memory. 2 Today is brought to you by cache What do we want? –Fast access to data from memory –Large size of memory –Acceptable.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

CS61C L17 Cache1 © UC Regents 1 CS61C - Machine Structures Lecture 17 - Caches, Part I October 25, 2000 David Patterson

CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Csci 211 Computer System Architecture – Review on Cache Memory Xiuzhen Cheng

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 – Caches III Google Glass may be one vision of the future of post-PC.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

CMSC 611: Advanced Computer Architecture

CSE 351 Section 9 3/1/12.

The Goal: illusion of large, fast, cheap memory

How will execution time grow with SIZE?

Mary Jane Irwin ( ) CSE 431 Computer Architecture Fall 2005 Lecture 19: Cache Introduction Review Mary Jane Irwin (

CS61C : Machine Structures Lecture 6. 2

Memristor memory on its way (hopefully)

Lecture 21: Memory Hierarchy

Lecture 08: Memory Hierarchy Cache Performance

Lecture 22: Cache Hierarchies, Memory

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Instructor Paul Pearce

Lecturer PSOE Dan Garcia

CMSC 611: Advanced Computer Architecture

ECE232: Hardware Organization and Design

How can we find data in the cache?

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

CS-447– Computer Architecture Lecture 20 Cache Memories

Lecturer PSOE Dan Garcia

Some of the slides are adopted from David Patterson (UCB)

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Csci 211 Computer System Architecture – Review on Cache Memory

Chapter Five Large and Fast: Exploiting Memory Hierarchy

COMP3221: Microprocessors and Embedded Systems

CS Computer Architecture Spring Lecture 19: Cache Introduction

10/18: Lecture Topics Using spatial locality

Presentation transcript:

Some of the slides are adopted from David Patterson (UCB) ELEC2041 Microprocessors and Interfacing Lectures 32: Cache Memory - II http://webct.edtec.unsw.edu.au/ May 2006 Saeid Nooshabadi saeid@unsw.edu.au Some of the slides are adopted from David Patterson (UCB)

Outline Direct-Mapped Cache Types of Cache Misses A (long) detailed example Peer - to - peer education example Block Size Tradeoff

Review: Memory Hierarchy Hard Disk RAM/ROM DRAM EEPROM Processor Control Memory Memory Memory Datapath Memory Memory Registers Slowest Speed: Fastest Biggest Size: Smallest Cache L1 SRAM Cache L2 SRAM Lowest Cost: Highest

Review: Direct-Mapped Cache In a direct-mapped cache, each memory address is associated with one possible block within the cache Therefore, we only need to look in a single location in the cache for the data if it exists in the cache Block is the unit of transfer between cache and memory

Review: Direct-Mapped Cache 1 Word Block Memory Memory Address 1 2 3 4 5 6 7 8 9 A B C D E F 8 Byte Direct Mapped Cache Cache Index 1 Block size = 4 bytes Cache Location 0 can be occupied by data from: Memory location 0 - 3, 8 - B, ... In general: any 4 memory locations that is 8*n (n=0,1,2,..) Cache Location 1 can be occupied by data from: Memory location 4 - 7, C - F, ... In general: any 4 memory locations that is 8*n + 4 (n=0,1,2,..)

Direct-Mapped with 1 word Blocks Example Address tag block index Byte offset

Review: Direct-Mapped Cache Terminology All fields are read as unsigned integers. Index: specifies the cache index (which “row” or “line” of the cache we should look in) Offset: once we’ve found correct block, specifies which byte within the block we want Tag: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location

Reading Material Steve Furber: ARM System On-Chip; 2nd Ed, Addison-Wesley, 2000, ISBN: 0-201- 67519-6. Chapter 10.

Accessing Data in a Direct Mapped Cache (#1/3) Memory Ex.: 16KB of data, direct-mapped, 4 word blocks Read 4 addresses 0x00000014, 0x0000001C, 0x00000034, 0x00008014 Only cache/memory level of hierarchy Address (hex) Value of Word 00000010 00000014 00000018 0000001C a b c d ... 00000030 00000034 00000038 0000003C e f g h 00008010 00008014 00008018 0000801C i j k l

Accessing Data in a Direct Mapped Cache (#2/3) 4 Addresses: 0x00000014, 0x0000001C, 0x00000034, 0x00008014 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields ttttttttttttttttt iiiiiiiiii oooo tag to check if have index to byte offset correct block select block within block 000000000000000000 0000000001 0100 000000000000000000 0000000001 1100 000000000000000000 0000000011 0100 000000000000000010 0000000001 0100 Tag Index Offset

Accessing Data in a Direct Mapped Cache (#3/3) So lets go through accessing some data in this cache 16KB data, direct-mapped, 4 word blocks Will see 3 types of events: cache miss: nothing in cache in appropriate block, so fetch from memory cache hit: cache block is valid and contains proper address, so read desired word cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory

16 KB Direct Mapped Cache, 16B blocks Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries are invalid) ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Example Block Index

Read 0x00000014 = 0…00 0..001 0100 000000000000000000 0000000001 0100 Index field Offset Tag field ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index

So we read block 1 (0000000001) 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index

No valid data 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index

So load that data into cache, setting tag, valid 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Read from cache at offset, return word b 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Read 0x0000001C = 0…00 0..001 1100 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Data valid, tag OK, so read offset return word d 000000000000000000 0000000001 1100 ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Read 0x00000034 = 0…00 0..011 0100 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

So read block 3 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

No valid data 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Load that cache block, return word f 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

Read 0x00008014 = 0…10 0..001 0100 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

So read Cache Block 1, Data is Valid 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

Cache Block 1 Tag does not match (0 != 2) 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

Miss, so replace block 1 with new data & tag 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 2 i j k l 1 e f g h

And return word j 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 2 i j k l 1 e f g h

Do an example yourself. What happens? Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l Read address 0x00000030 ? 000000000000000000 0000000011 0000 Read address 0x0000001c ? 000000000000000000 0000000001 1100 Cache Valid 0x0-3 0x4-7 0x8-b 0xc-f Index Tag 1 1 2 i j k l 2 3 1 e f g h 4 5 6 7 ... ...

Answers 0x00000030 a hit 0x0000001c a miss with replacment Index = 3, Tag matches, Offset = 0, value = e 0x0000001c a miss with replacment Index = 1, Tag mismatch, so replace from memory, Offset = 0xc, value = d The Values read from Cache must equal memory values whether or not cached: 0x00000030 = e 0x0000001c = d Memory Address Value of Word 00000010 00000014 00000018 0000001c a b c d ... 00000030 00000034 00000038 0000003c e f g h 00008010 00008014 00008018 0000801c i j k l

Block Size Tradeoff (#1/3) Benefits of Larger Block Size Spatial Locality: if we access a given word, we’re likely to access other nearby words soon (Another Big Idea) Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well Works nicely in sequential array accesses too As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example.

Block Size Tradeoff (#2/3) Drawbacks of Larger Block Size Larger block size means larger miss penalty on a miss, takes longer time to load a new block from next level If block size is too big relative to cache size, then there are too few blocks Result: miss rate goes up In general, minimize Average Access Time = Hit Time x Hit Rate + Miss Penalty x Miss Rate

Block Size Tradeoff (#3/3) Hit Time = time to find and retrieve data from current level cache Miss Penalty = average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy) Hit Rate = % of requests that are found in current level cache Miss Rate = 1 - Hit Rate

Extreme Example: One Big Block Cache Data Valid Bit B 0 B 1 B 3 Tag B 2 Cache Size = 4 bytes Block Size = 4 bytes Only ONE entry in the cache! If item accessed, likely accessed again soon But unlikely will be accessed again immediately! The next access will likely to be a miss again Continually loading data into the cache but discard data (force out) before use it again Nightmare for cache designer: Ping Pong Effect Let’s go back to our 4-byte direct mapped cache and increase its block size to 4 byte. Now we end up have one cache entries instead of 4 entries. What do you think this will do to the miss rate? Well the miss rate probably will go to hell. It is true that if an item is accessed, it is likely that it will be accessed again soon. But probably NOT as soon as the very next access so the next access will cause a miss again. So what we will end up is loading data into the cache but the data will be forced out by another cache miss before we have a chance to use it again. This is called the ping pong effect: the data is acting like a ping pong ball bouncing in and out of the cache. It is one of the nightmares scenarios cache designer hope never happens. We also defined a term for this type of cache miss, cache miss caused by different memory location mapped to the same cache index. It is called Conflict miss.

Block Size Tradeoff Conclusions Miss Rate Block Size Miss Penalty Block Size Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Block Size Increased Miss Penalty & Miss Rate

Things to Remember Cache Access involves 3 types of events: cache miss: nothing in cache in appropriate block, so fetch from memory cache hit: cache block is valid and contains proper address, so read desired word cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory