Overview Problem Solution CPU vs Memory performance imbalance

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing
Advertisements

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 Lecture 13: Cache and Virtual Memroy Review Cache optimization approaches, cache miss classification, Adapted from UCB CS252 S01.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 12 Reduce Miss Penalty and Hit Time
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
1 Recap: Memory Hierarchy. 2 Memory Hierarchy - the Big Picture Problem: memory is too slow and or too small Solution: memory hierarchy Fastest Slowest.
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Cache Memory Adapted from lectures notes of Dr. Patterson and Dr. Kubiatowicz of UC Berkeley.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
COEN 180 Main Memory Cache Architectures. Basics Speed difference between cache and memory is small. Therefore:  Cache algorithms need to be implemented.
Caches – basic idea Small, fast memory Stores frequently-accessed blocks of memory. When it fills up, discard some blocks and replace them with others.
Memory Hierarchy and Cache Design The following sources are used for preparing these slides: Lecture 14 from the course Computer architecture ECE 201 by.
Storage HierarchyCS510 Computer ArchitectureLecture Lecture 12 Storage Hierarchy.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Computer Organization & Programming
Lecture 08: Memory Hierarchy Cache Performance Kai Bu
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Nov. 15, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 8: Memory Hierarchy Design * Jeremy R. Johnson Wed. Nov. 15, 2000 *This lecture.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Memory Hierarchy and Caches. Who Cares about Memory Hierarchy? Processor Only Thus Far in Course CPU-DRAM Gap 1980: no cache in µproc; level cache,
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
CMSC 611: Advanced Computer Architecture
Cache Memory.
Soner Onder Michigan Technological University
CSE 351 Section 9 3/1/12.
The Goal: illusion of large, fast, cheap memory
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
The Hardware/Software Interface CSE351 Winter 2013
Appendix B. Review of Memory Hierarchy
Cache Memory Presentation I
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers
William Stallings Computer Organization and Architecture 7th Edition
Cache Memories September 30, 2008
Chapter 5 Memory CSE 820.
Systems Architecture II
Lecture 08: Memory Hierarchy Cache Performance
CPE 631 Lecture 05: Cache Design
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter 6 Memory System Design
Adapted from slides by Sally McKee Cornell University
CMSC 611: Advanced Computer Architecture
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 22: Cache Hierarchies, Memory
CSC3050 – Computer Architecture
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Cache Memory Rabi Mahapatra
Lecture 7 Memory Hierarchy and Cache Design
Principle of Locality: Memory Hierarchies
Memory & Cache.
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Overview Problem Solution CPU vs Memory performance imbalance Driven by temporal and spatial locality Memory hierarchies Fast L1, L2, L3 caches Larger but slower memories Even larger but even slower secondary storage Keep most of the action in the higher levels

Locality of Reference Temporal and Spatial Sequential access to memory Unit-stride loop (cache lines = 256 bits) Non-unit stride loop (cache lines = 256 bits) for (i = 1; i < 100000; i++) sum = sum + a[i]; for (i = 0; i <= 100000; i = i+8) sum = sum + a[i];

Cache Systems Main Memory CPU Cache CPU 400MHz Main Memory 10MHz Bus 66MHz Bus 66MHz Data object transfer Block transfer Main Memory CPU Cache

Example: Two-level Hierarchy Access Time T1+T2 T1 1 Hit ratio

Basic Cache Read Operation CPU requests contents of memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot

Elements of Cache Design Cache size Line (block) size Number of caches Mapping function Block placement Block identification Replacement Algorithm Write Policy

Cache Size Cache size << main memory size Small enough Minimize cost Speed up access (less gates to address the cache) Keep cache on chip Large enough Minimize average access time Optimum size depends on the workload Practical size?

Line Size Optimum size depends on workload Small blocks do not use locality of reference principle Larger blocks reduce the number of blocks Replacement overhead Practical sizes? Tag Cache Main Memory

Number of Caches Increased logic density => on-chip cache Internal cache: level 1 (L1) External cache: level 2 (L2) Unified cache Balances the load between instruction and data fetches Only one cache needs to be designed / implemented Split caches (data and instruction) Pipelined, parallel architectures

Mapping Function Cache lines << main memory blocks Direct mapping Maps each block into only one possible line (block address) MOD (number of lines) Fully associative Block can be placed anywhere in the cache Set associative Block can be placed in a restricted set of lines (block address) MOD (number of sets in cache)

Cache Addressing Block offset – selects data object from the block Block address Block offset Tag Index Block offset – selects data object from the block Index – selects the block set Tag – used to detect a hit

Direct Mapping

Associative Mapping

K-Way Set Associative Mapping

Replacement Algorithm Simple for direct-mapped: no choice Random Simple to build in hardware LRU Associativity Two-way Four-way Eight-way Size LRU Random LRU Random LRU Random 16KB 5.18% 5.69% 4.67% 5.29% 4.39% 4.96% 64KB 1.88% 2.01% 1.54% 1.66% 1.39% 1.53% 256KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%

Write Policy Write is more complex than read Write policies Write and tag comparison can not proceed simultaneously Only a portion of the line has to be updated Write policies Write through – write to the cache and memory Write back – write only to the cache (dirty bit) Write miss: Write allocate – load block on a write miss No-write allocate – update directly in memory

Alpha AXP 21064 Cache CPU 21 8 5 Address Data data Tag Index offset 21 8 5 Address Data data In out Tag Index offset Valid Tag Data (256) Write buffer =? Lower level memory

Write Merging Write address V V V V 1 100 104 1 108 1 1 112 104 1 108 1 1 112 Write address V V V V 100 1 1 1 1

DECstation 5000 Miss Rates Direct-mapped cache with 32-byte blocks Percentage of instruction references is 75%

Cache Performance Measures Hit rate: fraction found in that level So high that usually talk about Miss rate Miss rate fallacy: as MIPS to CPU performance, Average memory-access time = Hit time + Miss rate x Miss penalty (ns) Miss penalty: time to replace a block from lower level, including time to replace in CPU access time to lower level = f(latency to lower level) transfer time: time to transfer block =f(bandwidth)

Cache Performance Improvements Average memory-access time = Hit time + Miss rate x Miss penalty Cache optimizations Reducing the miss rate Reducing the miss penalty Reducing the hit time

Example Which has the lower average memory access time: A 16-KB instruction cache with a 16-KB data cache or A 32-KB unified cache Hit time = 1 cycle Miss penalty = 50 cycles Load/store hit = 2 cycles on a unified cache Given: 75% of memory accesses are instruction references. Overall miss rate for split caches = 0.75*0.64% + 0.25*6.47% = 2.10% Miss rate for unified cache = 1.99% Average memory access times: Split = 0.75 * (1 + 0.0064 * 50) + 0.25 * (1 + 0.0647 * 50) = 2.05 Unified = 0.75 * (1 + 0.0199 * 50) + 0.25 * (2 + 0.0199 * 50) = 2.24