M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Slides:

Advertisements

Similar presentations

Anshul Kumar, CSE IITD CSL718 : Memory Hierarchy Cache Performance Improvement 23rd Feb, 2006.

Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

Computation I pg 1 Embedded Computer Architecture Memory Hierarchy: Cache Recap Course 5KK73 Henk Corporaal November 2014

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

Cache Here we focus on cache improvements to support at least 1 instruction fetch and at least 1 data access per cycle – With a superscalar, we might need.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Caches Vincent H. Berk October 21, 2005

1 Lecture 12: Cache Innovations Today: cache access basics and innovations (Sections )

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 31, 2005 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.

Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.

1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 Lecture 13: Cache Innovations Today: cache access basics and innovations, DRAM (Sections )

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.

Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.

EENG449b/Savvides Lec /7/05 April 7, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

Lec17.1 °Q1: Where can a block be placed in the upper level? (Block placement) °Q2: How is a block found if it is in the upper level? (Block identification)

Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

1 Caching Basics CS Memory Hierarchies Takes advantage of locality of reference principle –Most programs do not access all code and data uniformly,

CSE 378 Cache Performance1 Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache /

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Computer Organization & Programming

Lecture 08: Memory Hierarchy Cache Performance Kai Bu

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

Computer Organization CS224 Fall 2012 Lessons 45 & 46.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.

DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%

Caches Hiding Memory Access Times. PC Instruction Memory 4 MUXMUX Registers Sign Ext MUXMUX Sh L 2 Data Memory MUXMUX CONTROLCONTROL ALU CTL INSTRUCTION.

Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache And Pefetch Buffers Norman P. Jouppi Presenter:Shrinivas Narayani.

CS6290 Caches. Locality and Caches Data Locality –Temporal: if data item needed now, it is likely to be needed again in near future –Spatial: if data.

Computer Organization CS224 Fall 2012 Lessons 41 & 42.

Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,

1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

CACHE MEMORY CS 147 October 2, 2008 Sampriya Chandra.

Microprocessor Microarchitecture Memory Hierarchy Optimization Lynn Choi Dept. Of Computer and Electronics Engineering.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.

Chapter 5 Memory II CSE 820. Michigan State University Computer Science and Engineering Equations CPU execution time = (CPU cycles + Memory-stall cycles)

CMSC 611: Advanced Computer Architecture

ECE232: Hardware Organization and Design

Associativity in Caches Lecture 25

CSC 4250 Computer Architectures

Multilevel Memories (Improving performance using alittle “cash”)

Cache Memory Presentation I

Lecture 14: Reducing Cache Misses

Chapter 5 Memory CSE 820.

Lecture 08: Memory Hierarchy Cache Performance

CPE 631 Lecture 05: Cache Design

Adapted from slides by Sally McKee Cornell University

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Cache - Optimization.

Cache Memory Rabi Mahapatra

Memory & Cache.

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

M E M O R Y

Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per instruction IPC = Instructions per cycle

Program locality Temporal locality (Data) –Recently referenced information Spatial locality –Same address space will be referenced Sequential locality (instructions) –(spatial locality) next address will be used

Caches Instruction cache Data cache Unified cache Split cache: I-cache & D-cache

Tradeoffs Usually cache (L1 and L2) is within CPU chip. Thus, area is a concern Design alternatives: –Large I-cache –Equal area: I- and D-cache –Large unified cache –Multi-level split

Some terms Read: Any read access (by the processor) to cache –instruction (inst. fetch) or data Write: Any write access (by the processor) into cache Fetch: read to main mem. to load a block Access: any reference to memory

Miss Ratio Number of references to cache not found in cache/total references to cache Number of I-cache references not found in cache/ instructions executed

Placement Problem Main Memory Cache Memory

Placement Policies WHERE to put a block in cache Mapping between main and cache memories. Main memory has a much larger capacity than cache memory.

Memory Block number Fully Associative Cache Block can be placed in any location in cache.

Memory Block number Direct Mapped Cache (Block address) MOD (Number of blocks in cache) 12 MOD 8 = 4 Block can be placed ONLY in a single location in cache.

Memory Block number Set Associative Cache (Block address) MOD (Number of sets in cache) 12 MOD 4 = Set no. Block number Block can be placed in one of n locations in n-way set associative cache.

Mem. Addr. cache addr. CPU Name trans. Cache address

Cache organization TagIndexblk :. Valid Tag Data CPU address Data = MUX

TAG Contains the “sector name” of block Datatag = Block Main mem addr

Valid bit(s) Indicates if block contains valid data –initial program load into main memory. No valid data –Cache coherency in multiprocessors Data in main memory could have been chanced by other processor (or process).

Dirty bit(s) Indicates if the block has been written to. –No need in I-caches. –No need in write through D-cache. –Write back D-cache needs it.

Write back C P U Main memory cache D

Write through C P U Main memory cache

Cache Performance Access time (in cycles) T ea = Hit time + Miss rate * Miss penalty

Associativity (D-cache misses per 1000 instructions) 2-way4-way8-way Size (KB) LRURdmFIFOLRURdmFIFOLRURdmFIFO

CPU Cache memory Main memory

Replacement Policies FIFO (First-In First-Out) Random Least-recently used (LRU) –Replace the block that has been unused for the longest time.

Reducing Cache Miss Penalties Multilevel cache Critical word first Priority to Read over Write Victim cache

Multilevel caches CPU L1 L2 L3 Main

Multilevel caches Average access time = Hit time L1 + Miss rate L1 * Miss penalty L1 Hit time L2 + Miss rate L2 * Miss penalty L2

Critical word first The missed word is requested to be sent first from next level. Writes are not as critical as reads. CPU cannot continue if data or instruction is not read. Priority to Read over Write misses

Victim cache Small fully associative cache. Contains blocks that have been discarded (“victims”) Four entry cache removed 20-90% of conflict misses (4KB direct-mapped).

Pseudo Associative Cache Direct-mapped cache hit time If miss –Check other entry in cache –(change the most significant bit) If found no long wait

Reducing Cache Miss Rate Larger block size Larger caches Higher Associativity Pseudo-associative cache Prefetching

Larger Block Size Large block take advantage of spatial locality. On the other hand, less blocks in cache

Miss rate versus block size Block size Cache size 4K16K64K256K %3.94%2.04%1.09% %2.87%1.35%0.70% %2.64%1.06%0.51% %2.77%1.02%0.49% %3.29%1.15%0.49%

Example Memory takes 80 clock cycles of overhead Delivers 16 bytes every 2 clock cycles (cc) Mem. System supplies 16 bytes in 82 cc 32 bytes in 84 cc 1 cycle (miss) + 1 cycle (read)

Example (cont’d) Average access time = Hit time L1 + Miss rate L1 * Miss penalty L1 For a 16-byte block in a 4 KB cache Ave. access time = 1 + (8.57% * 82)=8.0274

Example (cont’d) Average access time = Hit time L1 + Miss rate L1 * Miss penalty L1 For a 64-byte block in a 256 KB cache (miss rate 0.51%) Ave. access time = 1 + (0.51% * 88)= 1.449

Ave. mem. access time versus block size Block size Miss penalty Cache size 4K16K64K256K

Two-way cache (Alpha)

Higher Associativity Having higher associativity is NOT for free Slower clock may be required Thus, there is a trade-off between associativity (with higher hit rate) and faster clocks (direct-mapped).

Associativity example If clock cycle time (cct) increases as follows: –cct 2-way = 1.1cct 1-way –cct 4-way = 1.12cct 1-way –cct 8-way = 1.14cct 1-way Miss penalty is 50 cycles Hit time is 1 cycle

Pseudo Associative Cache Direct-mapped cache hit time If miss –Check other entry in cache –(change the most significant bit) If found no long wait

Pseudo Associative time Hit time Pseudo hit timeMiss penalty

Hardware prefetching What about fetching two blocks in a miss. –Alpha AXP Problem: real misses may be delayed due to prefetching

Fetch Policies Memory references: –90% reads –10% writes Thus, read has higher priority

Fetch Fetch on miss –Demand fetching Prefetching –Instructions (I-Cache)

Hardware prefetching What about fetching two blocks in a miss. –Alpha AXP Problem: real misses may be delayed due to prefetching

Compiler Controlled Prefetching Register prefetching –Load values in regs before they are needed Cache prefetching –Loads data in cache only This is only possible if we have nonblocking or lockup-free caches.

Order Main Memory to Cache Needed AU is in the middle of the block Block load –The complete block is loaded Load forward –The needed AU is forwarded –AUs behind the miss are not loaded Fetch bypass –Start with needed AU and AUs behind are loaded later AU=Addressable Unit

Compiler optimization: Merging Arrays The goal is to improve locality Rather than having two independent arrays –we could have only one.

Compiler optimization: Loop interchange Nested loops with nonsequential memory access. A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 Array A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33

A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 Array A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 Memory

Priority to Read over Write If two (or more) misses occur give priority to read. M[512]  R3 R1  M[1024] R1  M[512] Write buffer

Non-blocking caches On a miss the cache should not block any access This is important for an out-of-order execution machine (Tomasulo approach).

Access time Mem_ access_time L1 = Hit_rate L1 + Miss_rate L1 X Miss_penalty L1 Mem_ access_time = Hit_rate + Miss_rate X Miss_penalty Miss_penalty L1 = Hit_rate L2 + Miss_rate L2 X Miss_penalty L2

Reducing Hit Time Critical in increasing the processor clock frequency. “smaller” hardware is faster –Less sequential operations Direct mapping is simpler Small & simple caches

Avoid address translation Virtual address  Physical address.

Example Cache SizeI-cacheD-cache Unified Cache 8 KB0.82%12.22%4.63% 16 KB0.38%11.36%3.75% 32 KB0.16%11.73%3.18% 64 KB0.065%10.25%2.89% 128 KB0.03%9.81%2.66% 256 KB0.002%9.06%2.42%