55:035 Computer Architecture and Organization

Slides:

Advertisements

Similar presentations

55:035 Computer Architecture and Organization Lecture 7 155:035 Computer Architecture and Organization.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Processor - Memory Interface

Memory Hierarchy: The motivation

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Chapter 7 Large and Fast: Exploiting Memory Hierarchy Bo Cheng.

Memory Chapter 7 Cache Memories.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

Caching I Andreas Klappenecker CPSC321 Computer Architecture.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Memory Hierarchy: Motivation

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computing Systems Memory Hierarchy.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

CMPE 421 Parallel Computer Architecture

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

Computer Architecture Lecture Notes Spring 2005 Dr. Michael P. Frank Competency Area 6: Cache Memory.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.

Computer Organization & Programming

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Memory Hierarchy How to improve memory access. Outline Locality Structure of memory hierarchy Cache Virtual memory.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

SOFTENG 363 Computer Architecture Cache John Morris ECE/CS, The University of Auckland Iolanthe I at 13 knots on Cockburn Sound, WA.

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Fall 2006, Dec. 1, 4 ELEC / Lecture 13 1 ELEC / Computer Architecture and Design Fall 2006 Memory Organization (Chapter.

Memory Hierarchy Ideal memory is fast, large, and inexpensive

Yu-Lun Kuo Computer Sciences and Information Engineering

Improving Memory Access 1/3 The Cache and Virtual Memory

How will execution time grow with SIZE?

Appendix B. Review of Memory Hierarchy

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Lecture 08: Memory Hierarchy Cache Performance

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Memory & Cache.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

55:035 Computer Architecture and Organization Outline Cache Memory Introduction Memory Hierarchy Direct-Mapped Cache Set-Associative Cache Cache Sizes Cache Performance 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Introduction Memory access time is important to performance! Users want large memories with fast access times  ideally unlimited fast memory To use an analogy, think of a bookshelf containing many books: Suppose you are writing a paper on birds. You go to the bookshelf, pull out some of the books on birds and place them on the desk. As you start to look through them you realize that you need more references. So you go back to the bookshelf and get more books on birds and put them on the desk. Now as you begin to write your paper, you have many of the references you need on the desk in front of you. This is an example of the principle of locality: This principle states that programs access a relatively small portion of their address space at any instant of time. 55:035 Computer Architecture and Organization

Levels of the Memory Hierarchy Part of The On-chip CPU Datapath ISA 16-128 Registers One or more levels (Static RAM): Level 1: On-chip 16-64K Level 2: On-chip 256K-2M Level 3: On or Off-chip 1M-16M Registers Cache Level(s) Main Memory Magnetic Disc Optical Disk or Magnetic Tape Farther away from the CPU: Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput/ Bandwidth Dynamic RAM (DRAM) 256M-16G Interface: SCSI, RAID, IDE, 1394 80G-300G CPU 55:035 Computer Architecture and Organization

Memory Hierarchy Comparisons CPU Registers 100s Bytes <10s ns Cache K Bytes 10-100 ns 1-0.1 cents/bit Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit -5 -6 Capacity Access Time Cost Tape infinite sec-min 10 -8 Registers Memory Instr. Operands Blocks Pages Files Staging Xfer Unit prog./compiler 1-8 bytes cache cntl 8-128 bytes OS 4K-16K bytes user/operator Mbytes faster Larger 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Memory Hierarchy We can exploit the natural locality in programs by implementing the memory of a computer as a memory hierarchy. Multiple levels of memory with different speeds and sizes. The fastest memories are more expensive, and usually much smaller in size (see figure). The user has the illusion of a memory that is both large and fast. Accomplished by using efficient methods for memory structure and organization. 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Inventor of Cache M. V. Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Transactions on Electronic Computers, vol. EC-14, no. 2, pp. 270-271, April 1965. 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Cache Processor does all memory operations with cache. Miss – If requested word is not in cache, a block of words containing the requested word is brought to cache, and then the processor request is completed. Hit – If the requested word is in cache, read or write operation is performed directly in cache, without accessing main memory. Block – minimum amount of data transferred between cache and main memory. Processor words Cache small, fast memory blocks Main memory large, inexpensive (slow) 55:035 Computer Architecture and Organization

The Locality Principle A program tends to access data that form a physical cluster in the memory – multiple accesses may be made within the same block. Physical localities are temporal and may shift over longer periods of time – data not used for some time is less likely to be used in the future. Upon miss, the least recently used (LRU) block can be overwritten by a new block. P. J. Denning, “The Locality Principle,” Communications of the ACM, vol. 48, no. 7, pp. 19-24, July 2005. 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Temporal & Spatial Locality There are two types of locality: TEMPORAL LOCALITY (locality in time) If an item is referenced, it will likely be referenced again soon. Data is reused. SPATIAL LOCALITY (locality in space) If an item is referenced, items in neighboring addresses will likely be referenced soon Most programs contain natural locality in structure. For example, most programs contain loops in which the instructions and data need to be accessed repeatedly. This is an example of temporal locality. Instructions are usually accessed sequentially, so they contain a high amount of spatial locality. Also, data access to elements in an array is another example of spatial locality. 55:035 Computer Architecture and Organization

Data Locality, Cache, Blocks Increase block size to match locality size cache size to include most blocks Data needed by a program Block 1 Block 2 Memory Cache 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Basic Caching Concepts Memory system is organized as a hierarchy with the level closest to the processor being a subset of any level further away, and all of the data is stored at the lowest level (see figure). Data is copied between only two adjacent levels at any given time. We call the minimum unit of information contained in a two-level hierarchy a block or line. See the highlighted square shown in the figure. If data requested by the user appears in some block in the upper level it is known as a hit. If data is not found in the upper levels, it is known as a miss. 55:035 Computer Architecture and Organization

Basic Cache Organization Tags Data Array Block address Full byte address: Tag Idx Off Decode & Row Select Mux select Compare Tags ? Data Word Hit 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Direct-Mapped Cache Memory Cache LRU Swap-out Data needed by a program Block 1 Block 2 Data needed Swap-in 55:035 Computer Architecture and Organization

Set-Associative Cache Memory Swap-out Cache LRU Data needed by a program Block 1 Swap-in Swap-in Block 2 Data needed 55:035 Computer Architecture and Organization

Three Major Placement Schemes 55:035 Computer Architecture and Organization

Direct-Mapped Placement A block can only go into one place in the cache Determined by the block’s address (in memory space) The index number for block placement is usually given by some low-order bits of block’s address. This can also be expressed as: (Index) = (Block address) mod (Number of blocks in cache) Note that in a direct-mapped cache, Block placement & replacement choices are both completely determined by the address of the new block that is to be accessed. 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Direct-Mapped Cache 000 001 010 011 100 101 110 111 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Main memory Cache of 8 blocks 11 101 → memory address cache address: tag index 32-word word-addressable memory Block size = 1 word index (local address) 00 10 11 01 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Direct-Mapped Cache 00 01 10 11 00000 00001 00010 00011 00100 00101 00110 00111 01000 01001 01010 01011 01100 01101 01110 01111 10000 10001 10010 10011 10100 10101 10110 10111 11000 11001 11010 11011 11100 11101 11110 11111 Main memory Cache of 4 blocks 11 10 1 → memory address cache address: tag index block offset 32-word word-addressable memory Block size = 2 word index (local address) 1 55:035 Computer Architecture and Organization

Direct-Mapped Cache (Byte Address) 000 001 010 011 100 101 110 111 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Main memory Cache of 8 blocks 11 101 00 → memory address cache address: tag index 32-word byte-addressable memory Block size = 1 word 00 10 11 01 byte offset 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Finding a Word in Cache Valid 2-bit Index bit Tag Data 000 001 010 011 100 101 110 111 byte offset b6 b5 b4 b3 b2 b1 b0 = Data 1 = hit 0 = miss Tag Index Memory address Cache size 8 words Block size = 1 word 32 words byte-address 55:035 Computer Architecture and Organization

Miss Rate of Direct-Mapped Cache 000 001 010 011 100 101 110 111 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Main memory Cache of 8 blocks 11 101 00 → memory address cache address: tag index 32-word word-addressable memory Block size = 1 word 00 10 11 01 byte offset Least recently used (LRU) block This block is needed 55:035 Computer Architecture and Organization

Miss Rate of Direct-Mapped Cache 000 001 010 011 100 101 110 111 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Main memory Cache of 8 blocks 11 101 00 → memory address cache address: tag index 32-word word-addressable memory Block size = 1 word 00 / 01 / 00 / 10 xx 00 byte offset Memory references to addresses: 0, 8, 0, 6, 8, 16 1. miss 2. miss 4. miss 3. miss 5. miss 6. miss 55:035 Computer Architecture and Organization

Fully-Associative Cache (8-Way Set Associative) 000 001 010 011 100 101 110 01010 111 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Main memory Cache of 8 blocks 11101 00 → memory address cache address: tag 32-word word-addressable memory Block size = 1 word 00 10 11 01 byte offset LRU block This block is needed 55:035 Computer Architecture and Organization

Miss Rate: Fully-Associative Cache 00000 01000 00110 10000 xxxxx 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Main memory Cache of 8 blocks 11101 00 → memory address cache address: tag 32-word word-addressable memory Block size = 1 word byte offset Memory references to addresses: 0, 8, 0, 6, 8, 16 1. miss 2. miss 3. hit 4. miss 5. hit 6. miss 55:035 Computer Architecture and Organization

Finding a Word in Associative Cache Index Valid 5-bit Data bit Tag byte offset b6 b5 b4 b3 b2 b1 b0 = Data 1 = hit 0 = miss 5 bit Tag no index Memory address Cache size 8 words Block size = 1 word 32 words byte-address Must compare with all tags in the cache 55:035 Computer Architecture and Organization

Eight-Way Set-Associative Cache Cache size 8 words Memory address b31 b30 b29 b28 b27 index b1 b0 32 words byte-address Block size = 1 word 5 bit Tag byte offset V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data V | tag | data = = = = = = = = 8 to 1 multiplexer 1 = hit 0 = miss Data 55:035 Computer Architecture and Organization

Two-Way Set-Associative Cache 00 01 10 11 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Main memory Cache of 8 blocks 111 01 00 → memory address cache address: tag index 32-word word-addressable memory Block size = 1 word tags 000 | 011 100 | 001 110 | 101 010 | 111 byte offset LRU block This block is needed 55:035 Computer Architecture and Organization

Miss Rate: Two-Way Set-Associative Cache 00 01 10 11 00000 00 00001 00 00010 00 00011 00 00100 00 00101 00 00110 00 00111 00 01000 00 01001 00 01010 00 01011 00 01100 00 01101 00 01110 00 01111 00 10000 00 10001 00 10010 00 10011 00 10100 00 10101 00 10110 00 10111 00 11000 00 11001 00 11010 00 11011 00 11100 00 11101 00 11110 00 11111 00 Main memory Cache of 8 blocks 111 01 00 → memory address cache address: tag index 32-word word-addressable memory Block size = 1 word tags 000 | 010 xxx | xxx 001 | xxx byte offset Memory references to addresses: 0, 8, 0, 6, 8, 16 1. miss 2. miss 4. miss 3. hit 5. hit 6. miss 55:035 Computer Architecture and Organization

Two-Way Set-Associative Cache Memory address b6 b5 b4 b3 b2 b1 b0 Cache size 8 words 32 words byte-address 3 bit tag byte offset Block size = 1 word 2 bit index 00 01 10 11 V | tag | data V | tag | data = = 2 to 1 MUX Data 1 = hit 0 = miss 55:035 Computer Architecture and Organization

Using Larger Cache Block (4 Words) Memory address b31… b16 b15… b4 b3 b2 b1 b0 16 bit Tag 4GB = 1G words byte-address byte offset 12 bit Index Val. 16-bit Data Index bit Tag (4 words=128 bits) 2 bit block offset 0000 0000 0000 Cache size 16K words 4K Indexes Block size = 4 word 1111 1111 1111 = 1 = hit 0 = miss M U X Data 55:035 Computer Architecture and Organization

Number of Tag and Index Bits Main memory Size=W words Cache Size = w words Each word in cache has unique index (local addr.) Number of index bits = log2w Index bits are shared with block offset when a block contains more words than 1 Assume partitions of w words each in the main memory. W/w such partitions, each identified by a tag Number of tag bits = log2(W/w) 55:035 Computer Architecture and Organization

How Many Bits Does Cache Have? Consider a main memory: 32 words; byte address is 7 bits wide: b6 b5 b4 b3 b2 b1 b0 Each word is 32 bits wide Assume that cache block size is 1 word (32 bits data) and it contains 8 blocks. Cache requires, for each word: 2 bit tag, and one valid bit Total storage needed in cache = #blocks in cache × (data bits/block + tag bits + valid bit) = 8 (32+2+1) = 280 bits Physical storage/Data storage = 280/256 = 1.094 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization A More Realistic Cache Consider 4 GB, byte-addressable main memory: 1Gwords; byte address is 32 bits wide: b31…b16 b15…b2 b1 b0 Each word is 32 bits wide Assume that cache block size is 1 word (32 bits data) and it contains 64 KB data, or 16K words, i.e., 16K blocks. Number of cache index bits = 14, because 16K = 214 Tag size = 32 – byte offset – #index bits = 32 – 2 – 14 = 16 bits Cache requires, for each word: 16 bit tag, and one valid bit Total storage needed in cache = #blocks in cache × (data bits/block + tag size + valid bits) = 214(32+16+1) = 16×210×49 = 784×210 bits = 784 Kb = 98 KB Physical storage/Data storage = 98/64 = 1.53 But, need to increase the block size to match the size of locality. 55:035 Computer Architecture and Organization

Cache Bits for 4-Word Block Consider 4 GB, byte-addressable main memory: 1Gwords; byte address is 32 bits wide: b31…b16 b15…b2 b1 b0 Each word is 32 bits wide Assume that cache block size is 4 words (128 bits data) and it contains 64 KB data, or 16K words, i.e., 4K blocks. Number of cache index bits = 12, because 4K = 212 Tag size = 32 – byte offset – #block offset bits – #index bits = 32 – 2 – 2 – 12 = 16 bits Cache requires, for each word: 16 bit tag, and one valid bit Total storage needed in cache = #blocks in cache × (data bits/block + tag size + valid bit) = 212(4×32+16+1) = 4×210×145 = 580×210 bits =580 Kb = 72.5 KB Physical storage/Data storage = 72.5/64 = 1.13 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Cache size equation Simple equation for the size of a cache: (Cache size) = (Block size) × (Number of sets) × (Set Associativity) Can relate to the size of various address fields: (Block size) = 2(# of offset bits) (Number of sets) = 2(# of index bits) (# of tag bits) = (# of memory address bits)  (# of index bits)  (# of offset bits) Memory address 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Interleaved Memory Processor Reduces miss penalty. Memory designed to read words of a block simultaneously in one read operation. Example: Cache block size = 4 words Interleaved memory with 4 banks Suppose memory access ~15 cycles Miss penalty = 1 cycle to send address + 15 cycles to read a block + 4 cycles to send data to cache = 20 cycles Without interleaving, Miss penalty = 65 cycles words Cache Small, fast memory blocks Memory bank 0 Memory bank 1 Memory bank 2 Memory bank 3 Main memory 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Cache Design The level’s design is described by four behaviors: Block Placement: Where could a new block be placed in the given level? Block Identification: How is a existing block found, if it is in the level? Block Replacement: Which existing block should be replaced, if necessary? Write Strategy: How are writes to the block handled? 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Handling a Miss Miss occurs when data at the required memory address is not found in cache. Controller actions: Stall pipeline Freeze contents of all registers Activate a separate cache controller If cache is full select the least recently used (LRU) block in cache for over-writing If selected block has inconsistent data, take proper action Copy the block containing the requested address from memory Restart Instruction 55:035 Computer Architecture and Organization

Miss During Instruction Fetch Send original PC value (PC – 4) to the memory. Instruct main memory to perform a read and wait for the memory to complete the access. Write cache entry. Restart the instruction whose fetch failed. 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Writing to Memory Cache and memory become inconsistent when data is written into cache, but not to memory – the cache coherence problem. Strategies to handle inconsistent data: Write-through Write to memory and cache simultaneously always. Write to memory is ~100 times slower than to (L1) cache. Write-back Write to cache and mark block as “dirty”. Write to memory occurs later, when dirty block is cast-out from the cache to make room for another block 55:035 Computer Architecture and Organization

Writing to Memory: Write-Back Write-back (or copy back) writes only to cache but sets a “dirty bit” in the block where write is performed. When a block with dirty bit “on” is to be overwritten in the cache, it is first written to the memory. “Unnecessary” writes may occur for both write-through and write-back write-through has extra writes because each store instruction causes a transaction to memory (e.g. eight 32-bit transactions versus 1 32-byte burst transaction for a cache line) write-back has extra writes because unmodified words in a cache line get written even if they haven’t been changed penalty for write-through is much greater, thus write-back is far more popular 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Cache Hierarchy Average access time = T1 + (1 – h1) [ T2 + (1 – h2)Tm ] Where T1 = L1 cache access time (smallest) T2 = L2 cache access time (small) Tm = memory access time (large) h1, h2 = hit rates (0 ≤ h1, h2 ≤ 1) Average access time reduces by adding a cache. Processor L1 Cache (SRAM) Main memory large, inexpensive (slow) Access time = T1 Access time = Tm L2 Cache (DRAM) Access time = T2 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Average Access Time T1+T2+Tm T1 h1=1 1 h1=0 miss rate, 1- h1 Access time T1+T2+Tm / 2 T1+T2 T1 < T2 < Tm h2 = 0 h2 = 1 h2 = 0.5 T1 + (1 – h1) [ T2 + (1 – h2)Tm ] 55:035 Computer Architecture and Organization

Processor Performance Without Cache 5GHz processor, cycle time = 0.2ns Memory access time = 100ns = 500 cycles Ignoring memory access, Clocks Per Instruction (CPI) = 1 Assuming no memory data access: CPI = 1 + # stall cycles = 1 + 500 = 501 55:035 Computer Architecture and Organization

Performance with Level 1 Cache Assume hit rate, h1 = 0.95 L1 access time = 0.2ns = 1 cycle CPI = 1 + # stall cycles = 1 + 0.05 x 500 = 26 Processor speed increase due to cache = 501/26 = 19.3 55:035 Computer Architecture and Organization

Performance with L1 and L2 Caches Assume: L1 hit rate, h1 = 0.95 L2 hit rate, h2 = 0.90 (this is very optimistic!) L2 access time = 5ns = 25 cycles CPI = 1 + # stall cycles = 1 + 0.05 (25 + 0.10 x 500) = 1 + 3.75 = 4.75 Processor speed increase due to both caches = 501/4.75 = 105.5 Speed increase due to L2 cache = 26/4.75 = 5.47 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Cache Miss Behavior If the tag bits do not match, then a miss occurs. Upon a cache miss: The CPU is stalled Desired block of data is fetched from memory and placed in cache. Execution is restarted at the cycle that caused the cache miss. Recall that we have two different types of memory accesses: reads (loads) or writes (stores). Thus, overall we can have 4 kinds of cache events: read hits, read misses, write hits and write misses. 55:035 Computer Architecture and Organization

Fully-Associative Placement One alternative to direct-mapped is: Allow block to fill any empty place in the cache. How do we then locate the block later? Can associate each stored block with a tag Identifies the block’s home address in main memory. When the block is needed, we can use the cache as an associative memory, using the tag to match all locations in parallel, to pull out the appropriate block. 55:035 Computer Architecture and Organization

Set-Associative Placement The block address determines not a single location, but a set. A set is several locations, grouped together. (set #) = (Block address) mod (# of sets) The block can be placed associatively anywhere within that set. Where? This is part of the placement strategy. If there are n locations in each set, the scheme is called “n-way set-associative”. Direct mapped = 1-way set-associative. Fully associative = There is only 1 set. 55:035 Computer Architecture and Organization

Replacement Strategies Which existing block do we replace, when a new block comes in? With a direct-mapped cache: There’s only one choice! (Same as placement) With a (fully- or set-) associative cache: If any “way” in the set is empty, pick one of those Otherwise, there are many possible strategies: (Pseudo-) Random: Simple, fast, and fairly effective (Pseudo-) Least-Recently Used (LRU) Makes little difference in L2 (and higher) caches 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Write Strategies Most accesses are reads, not writes Especially if instruction reads are included Optimize for reads! Direct mapped can return value before valid check Writes are more difficult, because: We can’t write to cache till we know the right block Object written may have various sizes (1-8 bytes) When to synchronize cache with memory? Write through - Write to cache & to memory Prone to stalls due to high mem. bandwidth requirements Write back - Write to memory upon replacement Memory may be left out of date for a long time 55:035 Computer Architecture and Organization

Action on Cache Hits vs. Misses Read hits: Desirable Read misses: Stall the CPU, fetch block from memory, deliver to cache, restart Write hits: Write-through: replace data in cache and memory at same time Write-back: write the data only into the cache. It is written to main memory only when it is replaced Write misses: No write-allocate: write the data to memory only. Write-allocate: read the entire block into the cache, then write the word 55:035 Computer Architecture and Organization

Cache Hits vs. Cache Misses Consider the write-through strategy: every block written to cache is automatically written to memory. Pro: Simple; memory is always up-to-date with the cache No write-back required on block replacement. Con: Creates lots of extra traffic on the memory bus. Write hit time may be increased if CPU must wait for bus. One solution to write time problem is to use a write buffer to store the data while it is waiting to be written to memory. After storing data in cache and write buffer, processor can continue execution. Alternately, a write-back strategy writes data to main memory only a block is replaced. Pros: Reduces memory bandwidth used by writes. Cons: Complicates multi-processor systems 55:035 Computer Architecture and Organization

Hit/Miss Rate, Hit Time, Miss Penalty The hit rate or hit ratio is fraction of memory accesses found in upper level. The miss rate (= 1 – hit rate) is fraction of memory accesses not found in upper levels. The hit time is the time to access the upper level of the memory hierarchy, which includes the time needed to determine whether the access is a hit or miss. The miss penalty is the time needed to replace a block in the upper level with a corresponding block from the lower level. may include the time to write back an evicted block. 55:035 Computer Architecture and Organization

Cache Performance Analysis Performance is always a key issue for caches. We consider improving cache performance by: (1) reducing the miss rate, and (2) reducing the miss penalty. For (1) we can reduce the probability that different memory blocks will contend for the same cache location. For (2), we can add additional levels to the hierarchy, which is called multilevel caching. We can determine the CPU time as 55:035 Computer Architecture and Organization

55:035 Computer Architecture and Organization Cache Performance The memory-stall clock cycles come from cache misses. It can be defined as the sum of the stall cycles coming from writes + those coming from reads: Memory-Stall CC = Read-stall cycles + Write-stall cycles, where 55:035 Computer Architecture and Organization

Cache Performance Formulas Useful formulas for analyzing ISA/cache interactions : (CPU time) = [(CPU cycles) + (Memory stall cycles)] × (Clock cycle time) (Memory stall cycles) = (Instruction count) × (Accesses per instruction) × (Miss rate) × (Miss penalty) But, are not the best measure for cache design by itself: Focus on time per-program, not per-access But accesses-per-program isn’t up to the cache design We can limit our attention to individual accesses Neglects hit penalty Cache design may affect #cycles taken even by a cache hit Neglects cycle length May be impacted by a poor cache design 55:035 Computer Architecture and Organization

More Cache Performance Metrics Can split access time into instructions & data: Avg. mem. acc. time = (% instruction accesses) × (inst. mem. access time) + (% data accesses) × (data mem. access time) Another simple formula: CPU time = (CPU execution clock cycles + Memory stall clock cycles) × cycle time Useful for exploring ISA changes Can break stalls into reads and writes: Memory stall cycles = (Reads × read miss rate × read miss penalty) + (Writes × write miss rate × write miss penalty) 55:035 Computer Architecture and Organization

Factoring out Instruction Count Gives (lumping together reads & writes): May replace: So that miss rates aren’t affected by redundant accesses to same location within an instruction. 55:035 Computer Architecture and Organization

Improving Cache Performance Consider the cache performance equation: It obviously follows that there are three basic ways to improve cache performance: A. Reducing miss rate B. Reducing miss penalty C. Reducing hit time Note that by Amdahl’s Law, there will be diminishing returns from reducing only hit time or amortized miss penalty by itself, instead of both together. (Average memory access time) = (Hit time) + (Miss rate)×(Miss penalty) “Amortized miss penalty” Reducing amortized miss penalty 55:035 Computer Architecture and Organization

AMD Opteron Microprocessor L2 1MB Block 64B Write-back L1 (split 64KB each) Block 64B Write-back 55:035 Computer Architecture and Organization