Memory Hierarchy Please if anyone has additional comments please speak up.

Slides:

Advertisements

Similar presentations

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Advertisements

Performance of Cache Memory

1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.

The Memory Hierarchy & Cache

EECC551 - Shaaban #1 lec # 8 Fall Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.

Memory Hierarchy: The motivation

Memory Hierarchy: Motivation

EECC550 - Shaaban #1 Lec # 8 Spring Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.

EECC550 - Shaaban #1 Lec # 9 Spring Memory Hierarchy: Motivation The gap between CPU performance and main memory speed has been widening.

EECC551 - Shaaban #1 lec # 8 Spring The Memory Hierarchy & Cache Memory Hierarchy & Cache Basics (from 550):Review of Memory Hierarchy &

Now, Review of Memory Hierarchy

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

EECC551 - Shaaban #1 lec # 8 Spring Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.

EECC551 - Shaaban #1 lec # 8 Fall Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

EECC550 - Shaaban #1 Lec # 8 Winter The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache

CIS °The Five Classic Components of a Computer °Today’s Topics: Memory Hierarchy Cache Basics Cache Exercise (Many of this topic’s slides were.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Memory Hierarchy: Motivation

EECC550 - Shaaban #1 Lec # 8 Winter The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact.

EECC551 - Shaaban #1 lec # 8 Winter Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.

CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson

EECC551 - Shaaban #1 lec # 8 Winter The Memory Hierarchy & Cache Memory Hierarchy & Cache Basics (from 550):Review of Memory Hierarchy &

EECC551 - Shaaban #1 lec # 8 Fall Main Memory Main memory generally utilizes Dynamic RAM (DRAM), which use a single transistor to store.

EECC551 - Shaaban #1 lec # 7 Winter Memory Hierarchy: The motivation The gap between CPU performance and main memory has been widening with.

EEM 486 EEM 486: Computer Architecture Lecture 6 Memory Systems and Caches.

EECC550 - Shaaban #1 Lec # 9 Winter Memory Hierarchy: Motivation The gap between CPU performance and realistic (non-ideal) main memory speed.

Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.

DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.

Computing Systems Memory Hierarchy.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Systems I Locality and Caching

Memory/Storage Architecture Lab Computer Architecture Memory Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Lecture 14 Memory Hierarchy and Cache Design Prof. Mike Schulte Computer Architecture ECE 201.

Lecture 19 Today’s topics Types of memory Memory hierarchy.

10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems

1  1998 Morgan Kaufmann Publishers Recap: Memory Hierarchy of a Modern Computer System By taking advantage of the principle of locality: –Present the.

CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.

Computer Organization & Programming

December 18, Digital System Architecture Memory Hierarchy Design Pradondet Nilagupta Spring 2005 (original notes from Prof. Shaaban)

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

CMSC 611: Advanced Computer Architecture

Soner Onder Michigan Technological University

Yu-Lun Kuo Computer Sciences and Information Engineering

The Goal: illusion of large, fast, cheap memory

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

Cache Memory Rabi Mahapatra

CPE 631 Lecture 04: Review of the ABC of Caches

Memory & Cache.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Memory Hierarchy Please if anyone has additional comments please speak up

Processor-Memory Gap Memory is an active area of computer architecture. Due to the expanding gap between advances in processor speed and memory speed. µProc 60%/yr. DRAM 7%/yr. 1 10 100 1000 1980 1981 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 CPU 1982 Processor-Memory Performance Gap: (grows 50% / year) Performance

Technology Trends Capacity Speed (latency) DRAM Generations Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 2x in 10 years Disk: 4x in 3 years 2x in 10 years DRAM Generations Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 120 ns 1996 64 Mb 110 ns 1998 128 Mb 100 ns 2000 256 Mb 90 ns 2002 512 Mb 80 ns 2006 1024 Mb 60ns 16000:1 4:1 (Capacity) (Latency)

Processor-DRAM Performance Gap Impact: Example To illustrate the performance impact, assume a single-issue pipelined CPU with CPI = 1 using non-ideal memory. The minimum cost of a full memory access in terms of number of wasted CPU cycles: CPU CPU Memory Minimum CPU cycles or Year speed cycle Access instructions wasted MHZ ns ns 1986: 8 125 190 190/125 - 1 = 0.5 1989: 33 30 165 165/30 -1 = 4.5 1992: 60 16.6 120 120/16.6 -1 = 6.2 1996: 200 5 110 110/5 -1 = 21 1998: 300 3.33 100 100/3.33 -1 = 29 2000: 1000 1 90 90/1 - 1 = 89 2003: 2000 .5 80 80/.5 - 1 = 159 2006: 3700 0.27 60 60/.27 – 1 = 221

Pentium 4 Cache hierarchy Processor Cycles: 2 L1 I (12Ki) L1 D (8KiB) Cycles: 19 L2 cache (512 KiB) Cycles: 43 L3 cache (2 MiB) Cycles: 206 Memory

Main Memory Main memory generally uses Dynamic RAM (DRAM), which uses a single transistor to store a bit, but requires a periodic data refresh (~every 8 msec). Cache uses SRAM: Static Random Access Memory No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM) Size: DRAM/SRAM 4-8, Cost & Cycle time: SRAM/DRAM 8-16 Main memory performance: Access time: The time it takes between a memory access request and the time the requested information is available to cache/CPU. Memory bandwidth: The maximum sustained data transfer rate between main memory and cache/CPU.

Conventional DRAM Organization d x w DRAM: dw total bits organized as d supercells of size w bits 16 x 8 DRAM chip cols 1 2 3 memory controller 2 bits / addr 1 rows 2 supercell (2,1) (to CPU) 3 8 bits / data internal row buffer

Reading DRAM Supercell (2,1) Step 1(a): Row access strobe (RAS) selects row 2. Step 1(b): Row 2 copied from DRAM array to row buffer. 16 x 8 DRAM chip cols memory controller 1 2 3 RAS = 2 2 / addr 1 rows 2 3 8 / data internal row buffer

Reading DRAM Supercell (2,1) Step 2(a): Column access strobe (CAS) selects column 1. Step 2(b): Supercell (2,1) copied from buffer to data lines, and eventually back to the CPU. 16 x 8 DRAM chip cols memory controller 1 2 3 CAS = 1 2 / addr supercell (2,1) To CPU 1 rows 2 3 8 / data supercell (2,1) internal row buffer internal buffer

Memory Modules addr (row = i, col = j) : supercell (i,j) 64 MB DRAM 0 64 MB memory module consisting of eight 8Mx8 DRAMs 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 64-bit doubleword at main memory address A bits 0-7 8-15 16-23 24-31 32-39 40-47 48-55 56-63 DRAM 7 64-bit doubleword 31 7 8 15 16 23 24 32 63 39 40 47 48 55 56 64-bit double word at main memory address A Memory controller

Impact on Performance Suppose a processor executes at Clock Rate = 200 MHz (5 ns per cycle) CPI = 1.1 50% arith/logic, 30% ld/st, 20% control Suppose that 10% of memory operations get 50 cycle miss penalty CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 58 % of the time the processor is stalled waiting for memory! a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

Memory Hierarchy The idea is to build a memory subsystem that consists of: Very small, very fast, very expensive memory “close” to the processor. Larger, slower, but more affordable memory “further away” from the processor. Hence, provide the appearance of virtually unlimited memory while minimizing delays to the processor. The memory hierarchy is organized into levels of memory with the smaller, more expensive, and faster memory levels closer to the CPU: registers, then primary Cache Level (L1), then additional secondary cache levels (L2, L3…), then main memory, then mass storage (virtual memory).

Levels of The Memory Hierarchy Part of The On-chip CPU Datapath 16-256 Registers Registers Cache Main Memory Magnetic Disc Optical Disk or Magnetic Tape Farther away from The CPU Lower Cost/Bit Higher Capacity Increased Access Time/Latency Lower Throughput One or more levels (Static RAM): Level 1: On-chip 16-64K Level 2: On or Off-chip 128-512K Level 3: Off-chip 1M-64M DRAM, RDRAM 16M-16G Interface: SCSI, RAID, IDE, 1394 4G-100G

A Typical Memory Hierarchy (With Two Levels of Cache) Faster Larger Capacity Tertiary Storage (Tape) Control Datapath Processor Registers On-Chip Level One Cache L1 Virtual Memory, Secondary Storage (Disk) Second Level Cache (SRAM) L2 Main Memory (DRAM) Speed (ns): 1s 10s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks Ms Gs Ts

Levels of The Memory Hierarchy

Memory Hierarchy: Apple iMac G5 Managed by compiler Managed by hardware Managed by OS, hardware, application 07 Reg L1 Inst L1 Data L2 DRAM Disk Size 1K 64K 32K 512K 256M 80G Latency Cycles, Time 1, 0.6 ns 3, 1.9 ns 11, 6.9 ns 88, 55 ns 107, 12 ms Goal: Illusion of large, fast, cheap memory Let programs address a memory space that scales to the disk size, at a speed that is usually as fast as register access

iMac’s PowerPC 970: All caches on-chip L1 (64K Instruction) (1K) Registers 512K L2 L1 (32K Data)

Case study: Intel Core2 Duo L2 Cache Core0 Core1 L1 32 KB, 8-Way, 64 Byte/Line, LRU, WB 3 Cycle Latency L2 4.0 MB, 16-Way, 64 Byte/Line, LRU, WB 14 Cycle Latency Source: http://www.sandpile.org

Memory Hierarchy: Motivation The Principle Of Locality Programs usually access a relatively small portion of their address space (instructions/data) at any instant of time (loops, data arrays). Two Types of locality: Temporal Locality: If an item is referenced, it will tend to be referenced again soon. Spatial locality: If an item is referenced, items whose addresses are close by will tend to be referenced soon . The presence of locality in program behavior (e.g., loops, data arrays), makes it possible to satisfy a large percentage of program access needs (both instructions and operands) using memory levels with much less capacity than program address space.

Locality Example Locality Example: Data Reference array elements in succession (stride-1 reference pattern): Reference sum each iteration: Instructions Reference instructions in sequence: Cycle through loop repeatedly: Spatial locality Temporal locality Spatial locality Temporal locality sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;

Memory Hierarchy Operation If an instruction or operand is required by the CPU, the levels of the memory hierarchy are searched for the item starting with the level closest to the CPU (Level 1 cache): If the item is found, it’s delivered to the CPU resulting in a cache hit. If the item is missing from an upper level, resulting in a miss, the level just below is searched. For systems with several levels of cache, the search continues with cache level 2, 3 etc. If all levels of cache report a miss then main memory is accessed. CPU « cache « memory: Managed by hardware. If the item is not found in main memory resulting in a page fault, then disk (virtual memory), is accessed for the item. Memory « disk: Managed by hardware and the operating system.

Memory Hierarchy: Terminology A Block: The smallest unit of information transferred between two levels. Hit: Item is found in some block in the upper level (example: Block X) Hit Rate: The fraction of memory access found in the upper level. Hit Time: Time to access the upper level which consists of RAM access time + Time to determine hit/miss Miss: Item needs to be retrieved from a block in the lower level (Block Y) Miss Rate = 1 - (Hit Rate) Miss Penalty: Time to replace a block in the upper level + Time to deliver the block to the processor Hit Time << Miss Penalty Lower Level Memory Upper Level Memory From Processor Blk X Blk Y To Processor

Caching in a Memory Hierarchy 8 9 14 3 Smaller, faster, more expensive device at level k caches a subset of the blocks from level k+1 Level k: 4 10 Data is copied between levels in block-sized transfer units 4 10 1 2 3 4 4 5 6 7 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks. Level k+1: 8 9 10 10 11 12 13 14 15

General Caching Concepts Program needs object d, which is stored in some block b. Cache hit Program finds b in the cache at level k. E.g., block 14. Cache miss b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. If level k cache is full, then some current block must be replaced (evicted). Which one is the “victim”? Placement policy: where can the new block go? E.g., b mod 4 Replacement policy: which block should be evicted? E.g., LRU Request 12 Request 14 12 14 1 2 3 Level k: 4* 4* 12 9 14 14 3 Request 12 12 4* 1 2 3 Level k+1: 4 4* 5 6 7 8 9 10 11 12 12 13 14 15

Cache Design & Operation Issues Q1: Where can a block be placed in cache? (Block placement strategy & Cache organization) Fully Associative, Set Associative, Direct Mapped. Q2: How is a block found if it is in cache? (Block identification) Tag/Block. Q3: Which block should be replaced on a miss? (Block replacement) Random, LRU. Q4: What happens on a write? (Cache write policy) Write through, write back.

Types of Caches: Organization Type of cache Mapping of data from memory to cache Complexity of searching the cache Direct mapped (DM) A memory value can be placed at a single corresponding location in the cache Easy search mechanism Set-associative (SA) A memory value can be placed in any of a set of locations in the cache Slightly more involved search mechanism Fully-associative (FA) A memory value can be placed in any location in the cache Extensive hardware resources required to search (CAM) DM and FA can be thought as special cases of SA DM  1-way SA FA  All-way SA

Cache Organization & Placement Strategies Placement strategies or mapping of a main memory data block onto cache block frame addresses divide cache into three organizations: Direct mapped cache: A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) Advantage: It is easy to locate blocks in the cache (only one possibility) Disadvantage: Certain blocks cannot be simultaneously present in the cache (they can only have the same location)

Cache Organization: Direct Mapped Cache A block can be placed in one location only, given by: (Block address) MOD (Number of blocks in cache) In this case: (Block address) MOD (8) 1 C a c h e M m o r y 8 cache block frames (11101) MOD (1000) = 101 32 memory blocks cacheable

Direct Mapping Direct mapping: Tag Index Data 00000 00000 0x55 0x55 0x0F 00000 00000 1 0x0F 1 00001 Direct mapping: A memory value can only be placed at a single corresponding location in the cache 11111 11111 0xAA 0xAA 11111 0xF0 11111 1 0xF0

Cache Organization & Placement Strategies Fully associative cache: A block can be placed anywhere in cache. Advantage: No restriction on the placement of blocks. Any combination of blocks can be simultaneously present in the cache. Disadvantage: Costly (hardware and time) to search for a block in the cache Set associative cache: A block can be placed in a restricted set of places, or cache block frames. A set is a group of block frames in the cache. A block is first mapped onto the set and then it can be placed anywhere within the set. The set in this case is chosen by: (Block address) MOD (Number of sets in cache) If there are n blocks in a set the cache placement is called n-way set-associative. A good compromise between direct mapped and fully associative caches (most processors use this method).

Cache Organization Example

Set Associative Mapping (2-Way) Index Tag Data 0000 0000 0x55 00 0x55 0000 0x0F 0000 01 0x0F 1 0001 10 Set-associative mapping: A memory value can be placed in any of a set of corresponding locations in the cache 1111 0xAA 1111 10 0xAA 1111 0xF0 1111 11 0xF0

Fully Associative Mapping Tag Data 000000 0xF0 1111 0xAA 0x0F 0000 0x55 000110 000001 000000 111110 111111 0000 0x55 0x55 000001 0x0F 0000 0x0F 000110 Fully-associative mapping: A memory value can be anywhere in the cache 111110 1111 0xAA 0xAA 111111 0xF0 1111 0xF0

Cache Organization Tradeoff For a given cache size, we can trade off between hit rate and complexity If L = number of lines (blocks) in the cache, L = Cache Size / Block Size How many places Name of Number of Sets for a block to go cache type 1 Direct Mapped L n n-way set associative L/n L Fully Associative 1 Number of comparators needed to compare tags

An Example Assume a direct mapped cache with 4-word blocks and a total size of 16 words. Consider the following string of address references given as word addresses: 1, 4, 8, 5, 20, 17, 19, 56, 9, 11, 4, 43, 5, 6, 9, 17 Show the hits and misses and final cache contents.

Main memory block no in cache

Main memory block no in cache 1

Main memory block no in cache 1 2

Address 5: Hit, access block 1 of cache 14 13 12 11 10 9 8 7 6 5 4 3 2 1 21 20 19 18 17 16 15 Memory Block # Address 5: Hit, access block 1 of cache Main memory block no in cache 1 2

Main memory block no in cache 5 2

Main memory block no in cache 4 5 2

Main memory block no in cache 4 5 2

Main memory block no in cache 4 5 14

Main memory block no in cache 4 5 2

Main memory block no in cache 4 5 2

Main memory block no in cache 4 1 2

Main memory block no in cache 4 1 10

Main memory block no in cache 4 1 10

Main memory block no in cache 4 1 10

Main memory block no in cache 4 1 2

Main memory block no in cache 4 1 2

Summary Number of Hits = 6 Number of Misses = 10 Hit Ratio: 6/16 = 37.5%  Unacceptable Typical Hit ratio: > 90%

Locating A Data Block in Cache Each block in the cache has an address tag. The tags of every cache block that might contain the required data are checked in parallel. A valid bit is added to the tag to indicate whether this cache entry is valid or not (dirty). The address from the CPU to the cache is divided into: A block address, further divided into: An index field to choose a block set in the cache. (no index field when fully associative). A tag field to search and match addresses in the selected set. A block offset to select the data from the block. Block Address Block Offset Tag Index

Address Field Sizes Block Address Block Offset Tag Index Physical Address Generated by CPU Block Address Block Offset Tag Index Block offset size = log2(block size) Index size = log2(Total number of blocks/associativity) Tag size = address size - index size - offset size Number of Sets Mapping function: Cache set or block frame number = Index = = (Block Address) MOD (Number of Sets)

Locating A Data Block in Cache Increasing associativity shrinks index, expands tag Block index not needed for fully associative cache Block Offset, m bits Block Address Tag – r bits Index – k bits 2k addressable blocks in the cache Tag to identify a unique block 2m bytes in a block

Direct-Mapped Cache Example Suppose we have a 16KB of data in a direct-mapped cache with 4 word blocks Determine the size of the tag, index and offset fields if we’re using a 32-bit architecture Offset need to specify correct byte within a block block contains 4 words = 16 bytes = 24 bytes need 4 bits to specify correct byte

Direct-Mapped Cache Example Index: (~index into an “array of blocks”) need to specify correct row in cache cache contains 16 KB = 214 bytes block contains 24 bytes (4 words) # rows/cache =# blocks/cache (since there’s one block/row) = bytes/cache bytes/row = 214 bytes/cache 24 bytes/row = 210 rows/cache need 10 bits to specify this many rows

Direct-Mapped Cache Example Tag: use remaining bits as tag tag length = mem addr length - offset - index = 32 - 4 - 10 bits = 18 bits so tag is leftmost 18 bits of memory address

4KB Direct Mapped Cache Example s ( h o w i n g b t p ) 2 1 B y f V a l T D I x 3 H Index field Tag field 1K = 1024 Blocks Each block = one word Can cache up to 232 bytes = 4 GB of memory Mapping function: Cache Block frame number = (Block address) MOD (1024) Block offset = 2 bits Block Address = 30 bits Tag = 20 bits Index = 10 bits

64KB Direct Mapped Cache Example Tag field A d r e s ( h o w i n g b t p ) 1 6 2 B y f V T a D H 3 4 K 8 M u x l c k I 5 Index field 4K= 4096 blocks Each block = four words = 16 bytes Can cache up to 232 bytes = 4 GB of memory Word select Block Address = 28 bits Tag = 16 bits Index = 12 bits Block offset = 4 bits Mapping Function: Cache Block frame number = (Block address) MOD (4096) Larger blocks take better advantage of spatial locality

Cache Organization: Set Associative Cache D t E i h - w y s e o c v ( f u l ) F r S 1 O n d m p B k 7 2 3 4 5 6

4K Four-Way Set Associative Cache: MIPS Implementation Example Tag Field A d r e s 2 8 V T a g I n x 1 5 3 4 D t - o m u l i p H 9 Index Field 1024 block frames Each block = one word 4-way set associative 256 sets Can cache up to 232 bytes = 4 GB of memory Block Address = 30 bits Tag = 22 bits Index = 8 bits Block offset = 2 bits Mapping Function: Cache Set Number = (Block address) MOD (256)

Another Extreme Example: Fully Associative Fully Associative Cache Forget about the Cache Index Compare the Cache Tags of all cache entries in parallel Example: Block Size = 32 B blocks, we need N 27-bit comparators 31 4 Cache Tag (27 bits long) Byte Select Ex: 0x01 Cache Tag Valid Bit Cache Data : X Byte 31 Byte 1 Byte 0 : X Byte 63 Byte 33 Byte 32 X X : : : X

Cache Replacement Policy When a cache miss occurs the cache controller may have to select a block of cache data to be removed from a cache block frame and replaced with the requested data, such a block is selected by one of two methods (for direct mapped cache, there is only one choice): Random: Any block is randomly selected for replacement providing uniform allocation. Simple to build in hardware. The most widely used cache replacement strategy. Least-recently used (LRU): Accesses to blocks are recorded and and the block replaced is the one that was not used for the longest period of time. LRU is expensive to implement, as the number of blocks to be tracked increases, and is usually approximated.

Miss Rates for Caches with Different Size, Associativity & Replacement Algorithm Sample Data Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64 KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Cache Write Strategies Need to keep cache consistent with the main memory Reads are easy - require no modification Writes- when does the update occur Write Though: Data is written to both the cache block and to a block of main memory. The lower level always has the most updated data; an important feature for I/O and multiprocessing. Easier to implement than write back. A write buffer is often used to reduce CPU write stall while data is written to memory. Write back: Data is written or updated only to the cache block. The modified or dirty cache block is written to main memory when it’s being replaced from cache. Writes occur at the speed of cache A status bit called a dirty bit, is used to indicate whether the block was modified while in cache; if not the block is not written to main memory. Uses less memory bandwidth than write through.

Write-through Policy Processor Cache Memory 0x1234 0x1234 0x1234

Write-back Policy Processor Cache Memory 0x1234 0x1234 0x1234 0x9ABC

Write Buffer for Write Through Processor Cache Write Buffer DRAM A Write Buffer is needed between the Cache and Memory Processor: writes data into the cache and the write buffer Memory controller: write contents of the buffer to memory Write buffer is just a FIFO: Typical number of entries: 4 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle Memory system designer’s nightmare: Store frequency (w.r.t. time) > 1 / DRAM write cycle Write buffer saturation You are right, memory is too slow. We really didn't writ e to the memory directly. We are writing to a write buffer. Once the data is written into the write buffer and assuming a cache hit, the CPU is done with the write. The memory controller will then move the write buffer’s contents to the real memory behind the scene. The write buffer works as long as the frequency of store is not too high. Notice here, I am referring to the frequency with respect to time, not with respect to number of instructions. Remember the DRAM cycle time we talked about last time. It sets the upper limit on how frequent you can write to the main memory. If the store are too close together or the CPU time is so much faster than the DRAM cycle time, you can end up overflowing the write buffer and the CPU must stop and wait. +2 = 60 min. (Y:40)

Unified vs.Separate Level 1 Cache Unified Level 1 Cache (Princeton Memory Architecture). A single level 1 cache is used for both instructions and data. Separate instruction/data Level 1 caches (Harvard Memory Architecture): The level 1 (L1) cache is split into two caches, one for instructions (instruction cache, L1 I-cache) and the other for data (data cache, L1 D-cache). Control Datapath Processor Registers Unified Level One Cache L1 Control Datapath Processor Registers L1 I-cache D-cache Unified Level 1 Cache (Princeton Memory Architecture) Separate Level 1 Caches (Harvard Memory Architecture)