1 1999 ©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 15 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan.

Slides:



Advertisements
Similar presentations
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Advertisements

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014
Performance of Cache Memory
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache II Steve Ko Computer Sciences and Engineering University at Buffalo.
CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.
11/29/2004EE 42 fall 2004 lecture 371 Lecture #37: Memory Last lecture: –Transmission line equations –Reflections and termination –High frequency measurements.
Modified from notes by Saeid Nooshabadi
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
DAP Spr.‘98 ©UCB 1 Lecture 13: Memory Hierarchy—Ways to Reduce Misses.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 16 Instructor: L.N. Bhuyan
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 3, 2003 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CS 61C L35 Caches IV / VM I (1) Garcia, Fall 2004 © UCB Andy Carle inst.eecs.berkeley.edu/~cs61c-ta inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures.
331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
Caching I Andreas Klappenecker CPSC321 Computer Architecture.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Cs 61C L17 Cache.1 Patterson Spring 99 ©UCB CS61C Cache Memory Lecture 17 March 31, 1999 Dave Patterson (http.cs.berkeley.edu/~patterson) www-inst.eecs.berkeley.edu/~cs61c/schedule.html.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.
1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 14 Instructor: L.N. Bhuyan
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
DAP Spr.‘98 ©UCB 1 Lecture 11: Memory Hierarchy—Ways to Reduce Misses.
Computing Systems Memory Hierarchy.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff Case.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
10/18: Lecture topics Memory Hierarchy –Why it works: Locality –Levels in the hierarchy Cache access –Mapping strategies Cache performance Replacement.
CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
3-May-2006cse cache © DW Johnson and University of Washington1 Cache Memory CSE 410, Spring 2006 Computer Systems
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
EEL-4713 Ann Gordon-Ross 1 EEL-4713 Computer Architecture Memory hierarchies.
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.
Review °We would like to have the capacity of disk at the speed of the processor: unfortunately this is not feasible. °So we create a memory hierarchy:
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 CMPE 421 Parallel Computer Architecture PART3 Accessing a Cache.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1  1998 Morgan Kaufmann Publishers Chapter Seven.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
Improving Memory Access 2/3 The Cache and Virtual Memory
Chapter 5 Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
COMP 3221: Microprocessors and Embedded Systems Lectures 27: Cache Memory - III Lecturer: Hui Wu Session 2, 2005 Modified.
Memory Hierarchy and Cache. A Mystery… Memory Main memory = RAM : Random Access Memory – Read/write – Multiple flavors – DDR SDRAM most common 64 bit.
CMSC 611: Advanced Computer Architecture
COSC3330 Computer Architecture
The Goal: illusion of large, fast, cheap memory
Cache Memory Presentation I
CPE 631 Lecture 05: Cache Design
Morgan Kaufmann Publishers
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Memory & Cache.
Presentation transcript:

©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 15 Instructor: L.N. Bhuyan

©UCB Direct-mapped Cache Contd. °The direct mapped cache is simple to design and its access time is fast (Why?) °Good for L1 (on-chip cache) °Problem: Conflict Miss, so low hit ratio Conflict Misses are misses caused by accessing different memory locations that are mapped to the same cache index In direct mapped cache, no flexibility in where memory block can be placed in cache, contributing to conflict misses

©UCB Another Extreme: Fully Associative °Fully Associative Cache (8 word block) Omit cache index; place item in any block! Compare all Cache Tags in parallel °By definition: Conflict Misses = 0 for a fully associative cache Byte Offset : Cache Data B : Cache Tag (27 bits long) Valid : B 1B 31 : Cache Tag = = = = = :

©UCB Fully Associative Cache °Must search all tags in cache, as item can be in any cache block °Search for tag must be done by hardware in parallel (other searches too slow) °But, the necessary parallel comparator hardware is very expensive °Therefore, fully associative placement practical only for a very small cache

©UCB Compromise: N-way Set Associative Cache °N-way set associative: N cache blocks for each Cache Index Like having N direct mapped caches operating in parallel Select the one that gets a hit °Example: 2-way set associative cache Cache Index selects a “set” of 2 blocks from the cache The 2 tags in set are compared in parallel Data is selected based on the tag result (which matched the address)

©UCB Example: 2-way Set Associative Cache Cache Data Block 0 Cache Tag Valid ::: Cache Data Block 0 Cache Tag Valid ::: Cache Block Hit mux tag index offset address = =

©UCB Set Associative Cache Contd. °Direct Mapped, Fully Associative can be seen as just variations of Set Associative block placement strategy °Direct Mapped = 1-way Set Associative Cache °Fully Associative = n-way Set associativity for a cache with exactly n blocks

©UCB

9 Block Replacement Policy °N-way Set Associative or Fully Associative have choice where to place a block, (which block to replace) Of course, if there is an invalid block, use it °Whenever get a cache hit, record the cache block that was touched °When need to evict a cache block, choose one which hasn't been touched recently: “Least Recently Used” (LRU) Past is prologue: history suggests it is least likely of the choices to be used soon Flip side of temporal locality

©UCB Block Replacement Policy: Random °Sometimes hard to keep track of the LRU block if lots of choices °How hard for 2-way associativity? °Second Choice Policy: pick one at random and replace that block °Advantages Very simple to implement Predictable behavior No worst case behavior

©UCB What about Writes to Memory? °Suppose write data only to cache? Main memory and cache would then be inconsistent - cannot allow °Simplest Policy: The information is written to both the block in the cache and to the block in the lower-level memory (write-through) °Problem: Writes operate at speed of lower level memory!

©UCB Improving Cache Performance: Write Buffer °A Write Buffer is added between Cache and Memory Processor: writes data into cache & write buffer Controller: write buffer contents to memory °Write buffer is just a First-In First-Out queue: Typical number of entries: 4 to 10 Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle time Processor Cache DRAM Write Buffer

©UCB Improving Cache Performance: Write Back °Option 2: data is written only to cache block °Modified cache block is written to main memory only when it is replaced Block is unmodified (clean) or modified (dirty) °This scheme is called “Write Back” Advantage? Repeated writes to same block stay in cache Disadvantage? More complex to implement °Write Back is standard for Pentium Pro, optional for PowerPC 604

©UCB Improving Caches °In general, want to minimize Average Access Time: = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate (recall Hit Time << Miss Penalty) °So far, have looked at Larger Block Size Larger Cache Higher Associativity °What else to reduce miss penalty? Add a second level (L2) cache. Reduce Miss Rate

©UCB Current Memory Hierarchy Control Data- path Processor regs Secon- dary Mem- ory L2 Cache Speed(ns):0.5ns2ns6ns100ns10,000,000ns Size (MB): ,000 Cost ($/MB):--$100$30$1 $0.05 Technology:RegsSRAMSRAMDRAMDisk L1 cache Main Mem- ory

©UCB How do we calculate the miss penalty? °Access time = L1 hit time * L1 hit rate + L1 miss penalty * L1 miss rate °We simply calculate the L1 miss penalty as being the access time for the L2 cache °Access time = L1 hit time * L1 hit rate + (L2 hit time * L2 hit rate + L2 miss penalty * L2 miss rate) * L1 miss rate

©UCB Do the numbers for L2 Cache °Assumptions: L1 hit time = 1 cycle, L1 hit rate = 90% L2 hit time (also L1 miss penalty) = 4 cycles, L2 miss penalty= 100 cycles, L2 hit rate = 90% °Access time = L1 hit time * L1 hit rate + (L2 hit time * L2 hit rate + L2 miss penalty * (1 - L2 hit rate) ) * L1 miss rate = 1*0.9 + (4* *0.1) *(1-0.9) = (13.6) * 0.1 = 2.26 clock cycles

©UCB What would it be without the L2 cache? °Assume that the L1 miss penalty would be 100 clock cycles °1 *0.9 + (100)* 0.1 °10.9 clock cycles vs with L2 °So gain a benefit from having the second, larger cache before main memory °Today’s L1 cache sizes: 16 KB-64 KB; L2 cache may be 512 KB to 4096 KB

©UCB An Example Q: Suppose we have a processor with a base CPI of 1.0 assuming all references hit in the primary cache and a clock rate of 500 MHz. The main memory access time is 200 ns. Suppose the miss rate per instn is 5%. What is the revised CPI? How much faster will the machine run if we put a secondary cache (with 20-ns access time) that reduces the miss rate to memory to 2%? Assume same access time for hit or miss. A: Miss penalty to main memory = 200 ns = 100 cycles. Total CPI = Base CPI + Memory-stall cycles per instn. Hence, revised CPI = % x 100 = 6.0 When an L2 with 20-ns (10 cycles) access time is put, the miss rate to memory is reduced to 2%. So, out of 5% L1 miss, L2 hit is 3% and miss is 2%. The CPI is reduced to % x ( % x 100) = 3.5. Thus, the m/c with secondary cache is faster by 6.0/3.5 = 1.7

©UCB The Three Cs in Memory Hierarchy °The cache miss consists of three classes. - Compulsory misses: Caused due to first access to the block from memory – small but fixed independent of cache size. - Capacity misses: Because the cache cannot contain all the blocks for its limited size – reduces by increasing cache size - Conflict misses: Because multiple blocks compete for the same block or set in the cache. Also called collision misses. – reduces by increasing associativity

©UCB 3Cs Absolute Miss Rate (SPEC92) Conflict

©UCB

©UCB Unified vs Split Caches Example: 16KB I&D: Inst miss rate=0.64%, Data miss rate=6.47% 32KB unified: Aggregate miss rate=1.99% °Which is better (ignore L2 cache)? Assume 33% data ops  75% accesses from instructions (1.0/1.33) hit time=1, miss time=50 Note that data hit has 1 stall for unified cache (only one port) AMAT Harvard =75%x(1+0.64%x50)+25%x(1+6.47%x50) = 2.05 AMAT Unified =75%x(1+1.99%x50)+25%x( %x50)= 2.24 Proc I-Cache-1 Proc Unified Cache-1 Unified Cache-2 D-Cache-1 Proc Unified Cache-2

©UCB Static RAM (SRAM) °Six transistors in cross connected fashion Provides regular AND inverted outputs Implemented in CMOS process Single Port 6-T SRAM Cell

©UCB Dynamic Random Access Memory - DRAM °DRAM organization is similar to SRAM except that each bit of DRAM is constructed using a pass transistor and a capacitor, shown in next slide °Less number of transistors/bit gives high density, but slow discharge through capacitor. °Capacitor needs to be recharged or refreshed giving rise to high cycle time. °Uses a two-level decoder as shown later. Note that 2048 bits are accessed per row, but only one bit is used.

©UCB °SRAM cells exhibit high speed/poor density °DRAM: simple transistor/capacitor pairs in high density form Dynamic RAM Word Line Bit Line C Sense Amp......

©UCB DRAM logical organization (4 Mbit) °Square root of bits per RAS/CAS Column Decoder SenseAmps & I/O MemoryArray (2,048 x 2,048) A0…A10 … 11 D Q Word Line Storage Cell Row Decoder … Access time of DRAM = Row access time + column access time + refreshing

©UCB Main Memory Organizations Fig CPU Cache Bus Memory CPU Bus Memory Multiplexor Cache CPU Cache Bus Memory bank 1 Memory bank 2 Memory bank 3 Memory bank 0 one-word wide memory organization wide memory organization interleaved memory organization DRAM access time >> bus transfer time

©UCB Memory Access Time Example °Assume that it takes 1 cycle to send the address, 15 cycles for each DRAM access and 1 cycle to send a word of data. °Assuming a cache block of 4 words and one-word wide DRAM (fig. 7.13a), miss penalty = 1 + 4x15 + 4x1 = 65 cycles °With main memory and bus width of 2 words (fig. 7.13b), miss penalty = 1 + 2x15 + 2x1 = 33 cycles. For 4-word wide memory, miss penalty is 17 cycles. Expensive due to wide bus and control circuits. °With interleaved memory of 4 memory banks and same bus width (fig. 7.13c), the miss penalty = 1 + 1x15 + 4x1 = 20 cycles. The memory controller must supply consecutive addresses to different memory banks. Interleaving is universally adapted in high-performance computers.