CS-447– Computer Architecture Lecture 20 Cache Memories

Slides:

Advertisements

Similar presentations

CS 430 – Computer Architecture

Advertisements

Modified from notes by Saeid Nooshabadi COMP3221: Microprocessors and Embedded Systems Lecture 25: Cache - I Lecturer:

CS61C L32 Caches II (1) A Carle, Summer 2005 © UCB inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #20: Caches Andy.

Computer ArchitectureFall 2007 © November 14th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CS61C L31 Caches II (1) Garcia, Fall 2006 © UCB GPUs >> CPUs?  Many are using graphics processing units on graphics cards for high-performance computing.

Memory Subsystem and Cache Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

CS61C L23 Cache II (1) Chae, Summer 2008 © UCB Albert Chae, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 – Cache II.

CS61C L23 Caches I (1) Beamer, Summer 2007 © UCB Scott Beamer, Instructor inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures Lecture #23 Cache I.

Computer ArchitectureFall 2008 © October 27th, 2008 Majd F. Sakr CS-447– Computer Architecture.

CS61C L21 Caches I (1) Garcia, Fall 2005 © UCB Lecturer PSOE, new dad Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 31 – Caches II In this week’s Science, IBM researchers describe a new class.

COMP3221: Microprocessors and Embedded Systems Lecture 26: Cache - II Lecturer: Hui Wu Session 2, 2005 Modified from.

CS61C L32 Caches II (1) Garcia, 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS61C L31 Caches I (1) Garcia 2005 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.

CS61C L20 Caches I (1) A Carle, Summer 2006 © UCB inst.eecs.berkeley.edu/~cs61c/su06 CS61C : Machine Structures Lecture #20: Caches Andy Carle.

CS61C L32 Caches II (1) Garcia, Spring 2007 © UCB Experts weigh in on Quantum CPU  Most “profoundly skeptical” of the demo. D-Wave has provided almost.

CS61C L30 Caches I (1) Garcia, Fall 2006 © UCB Shuttle can’t fly over Jan 1?  A computer bug has come up for the shuttle – its computers don’t reset to.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

Caching I Andreas Klappenecker CPSC321 Computer Architecture.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

COMP3221 lec34-Cache-II.1 Saeid Nooshabadi COMP 3221 Microprocessors and Embedded Systems Lectures 34: Cache Memory - II

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.

CS 61C L21 Caches II (1) Garcia, Spring 2004 © UCB Lecturer PSOE Dan Garcia inst.eecs.berkeley.edu/~cs61c CS61C : Machine.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

CMPE 421 Parallel Computer Architecture

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

CS61C L17 Cache1 © UC Regents 1 CS61C - Machine Structures Lecture 17 - Caches, Part I October 25, 2000 David Patterson

CML CML CS 230: Computer Organization and Assembly Language Aviral Shrivastava Department of Computer Science and Engineering School of Computing and Informatics.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Csci 211 Computer System Architecture – Review on Cache Memory Xiuzhen Cheng

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 30 – Caches I After more than 4 years C is back at position number 1 in.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

CSE 351 Section 9 3/1/12.

The Goal: illusion of large, fast, cheap memory

Improving Memory Access 1/3 The Cache and Virtual Memory

How will execution time grow with SIZE?

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

Instructor Paul Pearce

CS61C : Machine Structures Lecture 6. 2

Memristor memory on its way (hopefully)

Lecture 21: Memory Hierarchy

CS61C : Machine Structures Lecture 6. 2

Lecture 08: Memory Hierarchy Cache Performance

CS61CL Machine Structures Lec 11 – Introduction to Cache Design

Instructor Paul Pearce

Lecturer PSOE Dan Garcia

ECE232: Hardware Organization and Design

How can we find data in the cache?

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Lecturer PSOE Dan Garcia

Some of the slides are adopted from David Patterson (UCB)

Some of the slides are adopted from David Patterson (UCB)

Csci 211 Computer System Architecture – Review on Cache Memory

Chapter Five Large and Fast: Exploiting Memory Hierarchy

10/18: Lecture Topics Using spatial locality

Presentation transcript:

CS-447– Computer Architecture Lecture 20 Cache Memories October 29th, 2008 Majd F. Sakr msakr@qatar.cmu.edu www.qatar.cmu.edu/~msakr/15447-f08/ Greet class

Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality: it will tend to be referenced again soon spatial locality: nearby items will tend to be referenced soon.

A View of the Memory Hierarchy Regs Upper Level Instr. Operands Faster Cache Blocks L2 Cache Blocks Memory Pages Disk Files Larger Tape Lower Level

Cache Our initial focus: two levels (upper, lower) block: minimum unit of data hit: data requested is in the upper level miss: data requested is not in the upper level

Cache Design How do we organize cache? Where does each memory address map to? (Remember that cache is subset of memory, so multiple memory addresses map to the same cache location.) How do we know which elements are in cache? How do we quickly locate them?

Direct Mapped Cache Mapping: address is modulo the number of blocks in the cache

Direct-Mapped Cache (1/2) In a direct-mapped cache, each memory address is associated with one possible block within the cache Therefore, we only need to look in a single location in the cache for the data if it exists in the cache Block is the unit of transfer between cache and memory

Direct-Mapped Cache (2/2) 4 Byte Direct Mapped Cache Cache Index 1 2 3 Memory Memory Address 1 2 3 4 5 6 7 8 9 A B C D E F Cache Location 0 can be occupied by data from: Memory location 0, 4, 8, ... 4 blocks => any memory location that is multiple of 4 Let’s look at the simplest cache one can build. A direct mapped cache that only has 4 bytes. In this direct mapped cache with only 4 bytes, location 0 of the cache can be occupied by data form memory location 0, 4, 8, C, ... and so on. While location 1 of the cache can be occupied by data from memory location 1, 5, 9, ... etc. So in general, the cache location where a memory location can map to is uniquely determined by the 2 least significant bits of the address (Cache Index). For example here, any memory location whose two least significant bits of the address are 0s can go to cache location zero. With so many memory locations to chose from, which one should we place in the cache? Of course, the one we have read or write most recently because by the principle of temporal locality, the one we just touch is most likely to be the one we will need again soon. Of all the possible memory locations that can be placed in cache Location 0, how can we tell which one is in the cache? +2 = 22 min. (Y:02)

Issues with Direct-Mapped Tag Index Offset Since multiple memory addresses map to same cache index, how do we tell which one is in there? What if we have a block size > 1 byte? Answer: divide memory address into three fields HEIGHT WIDTH ttttttttttttttttt iiiiiiiiii oooo tag index byte to check to offset if have select within correct block block block

Direct-Mapped Cache Terminology All fields are read as unsigned integers. Index: specifies the cache index (which “row” of the cache we should look in) Offset: once we’ve found correct block, specifies which byte within the block we want -- I.e., which “column” Tag: the remaining bits after offset and index are determined; these are used to distinguish between all the memory addresses that map to the same location

Direct Mapped Cache (for MIPS)

Direct-Mapped Cache Example (1/3) Suppose we have a 16KB of data in a direct-mapped cache with 4 word blocks Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture Offset need to specify correct byte within a block block contains 4 words = 16 bytes = 24 bytes need 4 bits to specify correct byte

Direct-Mapped Cache Example (2/3) Index: (~index into an “array of blocks”) need to specify correct row in cache cache contains 16 KB = 214 bytes block contains 24 bytes (4 words) # blocks/cache = bytes/cache bytes/block = 214 bytes/cache 24 bytes/block = 210 blocks/cache need 10 bits to specify this many rows

Direct-Mapped Cache Example (3/3) Tag: use remaining bits as tag tag length = addr length - offset - index = 32 - 4 - 10 bits = 18 bits so tag is leftmost 18 bits of memory address Why not full 32 bit address as tag? All bytes within block need same address (4b) Index must be same for every address within a block, so its redundant in tag check, thus can leave off to save memory (10 bits in this example)

WIDTH (size of one block, B/block) TIO cache mnemonic 2(H+W) = 2H * 2W AREA (cache size, B) = HEIGHT (# of blocks) * WIDTH (size of one block, B/block) WIDTH (size of one block, B/block) Tag Index Offset HEIGHT (# of blocks) AREA (cache size, B)

Caching Terminology When we try to read memory, 3 things can happen: cache hit: cache block is valid and contains proper address, so read desired word cache miss: nothing in cache in appropriate block, so fetch from memory cache miss, block replacement: wrong data is in cache at appropriate block, so discard it and fetch desired data from memory (cache always copy)

Accessing data in a direct mapped cache Memory Ex.: 16KB of data, direct-mapped, 4 word blocks Read 4 addresses 0x00000014 0x0000001C 0x00000034 0x00008014 Memory values on right: only cache/ memory level of hierarchy Address (hex) Value of Word 00000010 00000014 00000018 0000001C a b c d ... 00000030 00000034 00000038 0000003C e f g h 00008010 00008014 00008018 0000801C i j k l

Accessing data in a direct mapped cache 4 Addresses: 0x00000014, 0x0000001C, 0x00000034, 0x00008014 4 Addresses divided (for convenience) into Tag, Index, Byte Offset fields 000000000000000000 0000000001 0100 000000000000000000 0000000001 1100 000000000000000000 0000000011 0100 000000000000000010 0000000001 0100 Tag Index Offset

16 KB Direct Mapped Cache, 16B blocks Valid bit: determines whether anything is stored in that row (when computer initially turned on, all entries invalid) ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index

1. Read 0x00000014 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index

So we read block 1 (0000000001) 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index

No valid data 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index

So load that data into cache, setting tag, valid 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Read from cache at offset, return word b 000000000000000000 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

2. Read 0x0000001C = 0…00 0..001 1100 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Index is Valid 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Index valid, Tag Matches 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Index Valid, Tag Matches, return d 000000000000000000 0000000001 1100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

3. Read 0x00000034 = 0…00 0..011 0100 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

So read block 3 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

No valid data 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d

Load that cache block, return word f 000000000000000000 0000000011 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

4. Read 0x00008014 = 0…10 0..001 0100 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

So read Cache Block 1, Data is Valid 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

Cache Block 1 Tag does not match (0 != 2) 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 a b c d 1 e f g h

Miss, so replace block 1 with new data & tag 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 2 i j k l 1 e f g h

And return word j 000000000000000010 0000000001 0100 Tag field Index field Offset ... Valid Tag 0x0-3 0x4-7 0x8-b 0xc-f 1 2 3 4 5 6 7 1022 1023 Index 1 2 i j k l 1 e f g h

Do an example yourself. What happens? Chose from: Cache: Hit, Miss, Miss w. replace Values returned: a ,b, c, d, e, ..., k, l Read address 0x00000030 ? 000000000000000000 0000000011 0000 Read address 0x0000001c ? 000000000000000000 0000000001 1100 Cache Valid 0x0-3 0x4-7 0x8-b 0xc-f Index Tag 1 1 2 i j k l 2 3 1 e f g h 4 5 6 7 ... ...

Since reads, values must = memory values whether or not cached: Answers 0x00000030 a hit Index = 3, Tag matches, Offset = 0, value = e 0x0000001c a miss Index = 1, Tag mismatch, so replace from memory, Offset = 0xc, value = d Since reads, values must = memory values whether or not cached: 0x00000030 = e 0x0000001c = d Memory Address Value of Word 00000010 00000014 00000018 0000001c a b c d ... 00000030 00000034 00000038 0000003c e f g h 00008010 00008014 00008018 0000801c i j k l

Hits vs. Misses Read hits Read misses this is what we want! stall the CPU, fetch block from memory, deliver to cache, restart

Hits vs. Misses Write hits: Write misses: can replace data in cache and memory (write-through) write the data only into the cache (write-back the cache later) Write misses: read the entire block into the cache, then write the word

Block Size Tradeoff (1/3) Benefits of Larger Block Size Spatial Locality: if we access a given word, we’re likely to access other nearby words soon Very applicable with Stored-Program Concept: if we execute a given instruction, it’s likely that we’ll execute the next few as well Works nicely in sequential array accesses too As I said earlier, block size is a tradeoff. In general, larger block size will reduce the miss rate because it take advantage of spatial locality. But remember, miss rate NOT the only cache performance metrics. You also have to worry about miss penalty. As you increase the block size, your miss penalty will go up because as the block gets larger, it will take you longer to fill up the block. Even if you look at miss rate by itself, which you should NOT, bigger block size does not always win. As you increase the block size, assuming keeping cache size constant, your miss rate will drop off rapidly at the beginning due to spatial locality. However, once you pass certain point, your miss rate actually goes up. As a result of these two curves, the Average Access Time (point to equation), which is really the more important performance metric than the miss rate, will go down initially because the miss rate is dropping much faster than the increase in miss penalty. But eventually, as you keep on increasing the block size, the average access time can go up rapidly because not only is the miss penalty is increasing, the miss rate is increasing as well. Let me show you why your miss rate may go up as you increase the block size by another extreme example. +3 = 33 min. (Y:13)

Block Size Tradeoff (2/3) Drawbacks of Larger Block Size Larger block size means larger miss penalty on a miss, takes longer time to load a new block from next level If block size is too big relative to cache size, then there are too few blocks Result: miss rate goes up In general, minimize Average Access Time = Hit Time x Hit Rate + Miss Penalty x Miss Rate

Block Size Tradeoff (3/3) Hit Time = time to find and retrieve data from current level cache Miss Penalty = average time to retrieve data on a current level miss (includes the possibility of misses on successive levels of memory hierarchy) Hit Rate = % of requests that are found in current level cache Miss Rate = 1 - Hit Rate

Block Size Tradeoff Conclusions Miss Rate Block Size Miss Penalty Block Size Exploits Spatial Locality Fewer blocks: compromises temporal locality Average Access Time Block Size Increased Miss Penalty & Miss Rate

Performance Increasing the block size tends to decrease miss rate:

Two ways of improving performance: Simplified model: execution time = (execution cycles + stall cycles) ´ cycle time stall cycles = # of instructions ´ miss ratio ´ miss penalty Two ways of improving performance: decreasing the miss ratio decreasing the miss penalty What happens if we increase block size?