Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine)

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Performance of Cache Memory

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

1 Lecture 20 – Caching and Virtual Memory  2004 Morgan Kaufmann Publishers Lecture 20 Caches and Virtual Memory.

Review of Mem. HierarchyCSCE430/830 Review of Memory Hierarchy CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 20 - Memory.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Chapter 7 Cache Memories.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

331 Lec20.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

ENGS 116 Lecture 121 Caches Vincent H. Berk Wednesday October 29 th, 2008 Reading for Friday: Sections C.1 – C.3 Article for Friday: Jouppi Reading for.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

331 Lec20.1Spring :332:331 Computer Architecture and Assembly Language Spring 2005 Week 13 Basics of Cache [Adapted from Dave Patterson’s UCB CS152.

1 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value is stored as a charge.

1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.

Computing Systems Memory Hierarchy.

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.

CMPE 421 Parallel Computer Architecture

Chapter 5 Large and Fast: Exploiting Memory Hierarchy CprE 381 Computer Organization and Assembly Level Programming, Fall 2013 Zhao Zhang Iowa State University.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Lecture 19 Today’s topics Types of memory Memory hierarchy.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

CSIE30300 Computer Architecture Unit 08: Cache Hsin-Chou Chi [Adapted from material by and

EEL5708/Bölöni Lec 4.1 Fall 2004 September 10, 2004 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Review: Memory Hierarchy.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

December 18, Digital System Architecture Memory Hierarchy Design Pradondet Nilagupta Spring 2005 (original notes from Prof. Shaaban)

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Lecture 15 Calculating and Improving Cache Perfomance

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

CS.305 Computer Architecture Memory: Caches Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available.

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

CPE232 Cache Introduction1 CPE 232 Computer Organization Spring 2006 Cache Introduction Dr. Gheith Abandah [Adapted from the slides of Professor Mary Irwin.

1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

1  1998 Morgan Kaufmann Publishers Chapter Seven.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

The Memory Hierarchy (Lectures #17 - #20) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

Chapter 5 Large and Fast: Exploiting Memory Hierarchy.

1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Computer Organization CS224 Fall 2012 Lessons 37 & 38.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Address – 32 bits WRITE Write Cache Write Main Byte Offset Tag Index Valid Tag Data 16K entries 16.

Soner Onder Michigan Technological University

Yu-Lun Kuo Computer Sciences and Information Engineering

The Goal: illusion of large, fast, cheap memory

Improving Memory Access 1/3 The Cache and Virtual Memory

How will execution time grow with SIZE?

Morgan Kaufmann Publishers Memory & Cache

Morgan Kaufmann Publishers

ECE 445 – Computer Organization

Systems Architecture II

CPE 631 Lecture 05: Cache Design

EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007

If a DRAM has 512 rows and its refresh time is 9ms, what should be the frequency of row refresh operation on the average?

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Lecture 7 Memory Hierarchy and Cache Design

Memory & Cache.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Memory: PerformanceCSCE430/830 Memory Hierarchy: Performance CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu (U. Maine) Fall, 2006 Portions of these slides are derived from: Dave Patterson © UCB

Memory: PerformanceCSCE430/830 CPU Hit: Data in Cache (no penalty) Miss: Data not in Cache (miss penalty) Cache Memory DRAM Memory Processor addrdata addrdata Cache Operation Insert between CPU and Main Memory Implement with fast Static RAM Holds some of a program’s –data –instructions Operation:

Memory: PerformanceCSCE430/830 Cache Performance Measures Hit rate: fraction found in the cache –So high that we usually talk about Miss rate = 1 - Hit Rate Hit time: time to access the cache Miss penalty: time to replace a block from lower level, including time to replace in CPU –access time : time to access lower level –transfer time : time to transfer block Average memory-access time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks)

Memory: PerformanceCSCE430/830 Memory Hierarchy Motivation: The Principle Of Locality Programs usually access a relatively small portion of their address space (instructions/data) at any instant of time (program working set) as a result of access locality. Two Types of access locality: –Temporal Locality: If an item is referenced, it will tend to be referenced again soon. »e.g. instructions in a body of a loop –Spatial locality: If an item is referenced, items whose addresses are close will tend to be referenced soon. »e.g. sequential instruction execution, sequential access to elements of array The presence of locality in program behavior makes it possible to satisfy a large percentage of program memory access needs (both instructions and operands) using faster memory levels with much less capacity than program address space.

Memory: PerformanceCSCE430/830 Fundamental Questions Q1: Where can a block be placed in the upper level? (Block placement) Q2: How is a block found if it is in the upper level? (Block identification) Q3: Which block should be replaced on a miss? (Block replacement) Q4: What happens on a write? (Write strategy)

Memory: PerformanceCSCE430/830 Basic Cache Design Organized into blocks or lines Block Contents –tag - extra bits to identify block (part of block address) –data - data or instruction words - contiguous memory locations Our example: –One-word (4 byte) block size –30-bit tag –Two blocks in cache CPU tag 0data 0 CPU tag 1data 1 0x x x x C 0x b0 b1 Cache Main Memory

Memory: PerformanceCSCE430/830 Cache Example (2) Assume: –r1==0, r2==1, r4==2 –1 cycle for cache access –5 cycles for main. mem. access –1 cycle for instr. execution At cycle 1 - PC=0x00 –Fetch instruction from memory »look in cache »MISS - fetch from main mem (5 cycle penalty) CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L MISSMISS

Memory: PerformanceCSCE430/830 Cache Example (3) At cycle 6 –Execute instr. add r1,r1,r2 CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…000 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x…0

Memory: PerformanceCSCE430/830 Cache Example (4) At cycle 6 - PC=0x04 –Fetch instruction from memory »look in cache »MISS - fetch from main mem (5 cycle penalty) CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x…0 MISSMISS 6-10 FETCH 0x…4

Memory: PerformanceCSCE430/830 Cache Example (5) At cycle 11 –Execute instr. bne r4,r1,L CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…000 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…004 bne r4,r1,L 0x…1 110x…4 bne r4,r1,L 1

Memory: PerformanceCSCE430/830 Cache Example (6) At cycle 11 - PC=0x00 –Fetch instruction from memory –HIT - instruction in cache CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r4,r1,L 0x…1 HITHIT 110x…4 bne r4,r1,L 1 11 FETCH 0x…0 1

Memory: PerformanceCSCE430/830 Cache Example (7) At cycle 12 –Execute add r1, r1, 2 CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r1,r2,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r2 2

Memory: PerformanceCSCE430/830 Cache Example (8) At cycle 12 - PC=0x04 –Fetch instruction from memory –HIT - instruction in cache CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r4,r1,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r FETCH 0x04 HITHIT

Memory: PerformanceCSCE430/830 Cache Example (9) At cycle 13 –Execute instr. bne r4, r1, L –Branch not taken CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r4,r1,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r FETCH 0x04 13 bne r4, r1, L

Memory: PerformanceCSCE430/830 Cache Example (10) At cycle 13 - PC=0x08 –Fetch Instruction from Memory –MISS - not in cache CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r4,r1,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r FETCH 0x04 13 bne r4, r1, L 13 FETCH 0x08 MISSMISS

Memory: PerformanceCSCE430/830 Cache Example (11) At cycle 17 - PC=0x08 –Put instruction into cache –Replace existing instruction CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r4,r1,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r FETCH 0x04 13 bne r4, r1, L FETCH 0x08 sub r1,r1,r1 0x…2

Memory: PerformanceCSCE430/830 Cache Example (12) At cycle 18 –Execute sub r1, r1, r1 CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r FETCH 0x…4 bne r4,r1,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r FETCH 0x bne r4, r1, L FETCH 0x sub r1, r1, r1 0 sub r1,r1,r1 0x…2

Memory: PerformanceCSCE430/830 Cache Example (13) At cycle 18 –Fetch instruction from memory –MISS - not in cache CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r4,r1,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r FETCH 0x bne r4, r1, L FETCH 0x08 2 sub r1,r1,r1 18 sub r1, r1, r FETCH 0x0C MISSMISS

Memory: PerformanceCSCE430/830 Cache Example (14) At cycle 22 –Put instruction into cache –Replace existing instruction CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r2 1 L: add r1,r1,r2 0x… FETCH 0x…4 bne r1,r2,L 0x…1 110x…4 bne r4,r1,L 1 12 FETCH 0x… add r1,r1,r FETCH 0x bne r4, r1, L FETCH 0x sub r1, r1, r FETCH 0x0C j L 0x…3 sub r1,r1,r1 0x…2

Memory: PerformanceCSCE430/830 Cache Example (15) CycleAddressOp/Instr. r1 1-5 FETCH 0x…0 60x…0 add r1,r1,r FETCH 0x…4 110x…4 bne r3,r1,L 11 FETCH 0x…0 120x…8 add r1,r1,r FETCH 0x…4 130x…4 bne r4,r1,L FETCH 0x…8 180x…8 sub r1,r1,r FETCH 0x..C 230x…8 j L CPU (empty) CPU (empty) L: add r1,r1,r2 0x x x x C 0x b0 b1 Cache Main Memory bne r4,r1,L sub r1,r1,r1 L: j L At cycle 23 –Execute j L j L 0x…3 sub r1,r1,r1 0x…2

Memory: PerformanceCSCE430/830 Compare No-cache vs. Cache CycleAddressOp/Instr. 1-5 FETCH 0x…0 60x…0 add r1,r1,r FETCH 0x…4 110x…4 bne r4,r1,L FETCH 0x…0 160x…0 add r1,r1,r FETCH 0x…4 210x…4 bne r4,r1,L FETCH 0x…8 260x…8 sub r1,r1,r FETCH 0x..C 310x…C j L CycleAddressOp/Instr. 1-5 FETCH 0x…0 60x…0 add r1,r1,r FETCH 0x…4 110x…4 bne r4,r1,L 11 FETCH 0x…0 120x…0 add r1,r1,r2 12 FETCH 0x…4 130x…4 bne r4,r1,L FETCH 0x…8 180x…8 sub r1,r1,r FETCH 0x..C 230x…C j L NO CACHE CACHE M M H H M M

Memory: PerformanceCSCE430/830 Cache Miss and the MIPS Pipeline Compare in Cycle 1 Fetch Completes (Pipeline Restarts) Miss Detected in Cycle 2 Instruction Fetch Clock Cycle 1 Clock Cycle 2+N Clock Cycle 3+N Clock Cycle 4+N Clock Cycle 5+N Clock Cycle 6+N

Memory: PerformanceCSCE430/830 Cache Miss and the MIPS Pipeline Compare in Cycle 4 Miss Detected in Cycle 5 Load Completes (Pipeline Restarts) Load Instruction Clock Cycle 1 Clock Cycle 2 Clock Cycle 3 Clock Cycle 4 Clock Cycle 5 Clock Cycle 5+N Clock Cycle 6+N

Memory: PerformanceCSCE430/830 Cache Performance Measures Hit rate: fraction found in the cache –So high that we usually talk about Miss rate = 1 - Hit Rate Hit time: time to access the cache Miss penalty: time to replace a block from lower level, including time to replace in CPU –access time : time to access lower level –transfer time : time to transfer block Average memory-access time (AMAT) = Hit time + Miss rate x Miss penalty (ns or clocks)

Memory: PerformanceCSCE430/830 Miss-oriented Approach to Memory Access: –CPI Execution includes ALU and Memory instructions Cache performance Separating out Memory component entirely –AMAT = Average Memory Access Time –CPI ALUOps does not include memory instructions

Memory: PerformanceCSCE430/830 Cache Performance Example Assume we have a computer where the clock per instruction (CPI) is 1.0 when all memory accesses hit in the cache. The only data accesses are loads and stores, and these total 50% of the instructions. If the miss penalty is 25 clock cycles and the miss rate is 2% (Unified instruction cache and data cache), how much faster would the computer be if all instructions and data were cache hit? When all instructions are hit In reality:

Memory: PerformanceCSCE430/830 Performance Example Problem Assume: –For gcc, the frequency for all loads and stores is 36%. –instruction cache miss rate for gcc = 2% –data cache miss rate for gcc = 4%. –If a machine has a CPI of 2 without memory stalls –and the miss penalty is 40 cycles for all misses, how much faster is a machine with a perfect cache? Instruction miss cycles =IC x 2% x 40 = 0.80 x IC Data miss cycles = IC x 36% x 4% x 40 = x IC CPIstall = 2 + ( ) = = IC x CPIstall x Clock period3.376 IC x CPIperfect x Clock period 2 == 1.69

Memory: PerformanceCSCE430/830 Performance Example Problem For gcc, the frequency for all loads and stores is 36% Instruction miss cycles = IC x 2% x 80 = x IC Data miss cycles = IC x 36% x 4% x 80 = x IC x IC I x CPI slowClk x Clock period I x CPI fastClk x Clock period x 0.5 = 1.42 (not 2)= Assume: we increase the performance of the previous machine by doubling its clock rate. Since the main memory speed is unlikely to change, assume that the absolute time to handle a cache miss does not change. How much faster will the machine be with the faster clock?

Memory: PerformanceCSCE430/830 Four Key Cache Questions: 1.Where can block be placed in cache? (block placement) 2.How can block be found in cache? …using a tag (block identification) 3.Which block should be replaced on a miss? (block replacement) 4.What happens on a write? (write strategy)