Cache Memory Presentation I

Slides:



Advertisements
Similar presentations
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Advertisements

Lecture 8: Memory Hierarchy Cache Performance Kai Bu
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
Memory Hierarchy Design Chapter 5 Karin Strauss. Background 1980: no caches 1995: two levels of caches 2004: even three levels of caches Why? Processor-Memory.
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Cache Memories Effectiveness of cache is based on a property of computer programs called locality of reference Most of programs time is spent in loops.
Computing Systems Memory Hierarchy.
CMPE 421 Parallel Computer Architecture
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 10 Memory Hierarchy.
L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
Outline Cache writes DRAM configurations Performance Associative caches Multi-level caches.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
DECStation 3100 Block Instruction Data Effective Program Size Miss Rate Miss Rate Miss Rate 1 6.1% 2.1% 5.4% 4 2.0% 1.7% 1.9% 1 1.2% 1.3% 1.2% 4 0.3%
1 Chapter Seven CACHE MEMORY AND VIRTUAL MEMORY. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
1 Chapter Seven. 2 SRAM: –value is stored on a pair of inverting gates –very fast but takes up more space than DRAM (4 to 6 transistors) DRAM: –value.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
1 Memory Hierarchy Design Chapter 5. 2 Cache Systems CPUCache Main Memory Data object transfer Block transfer CPU 400MHz Main Memory 10MHz Bus 66MHz CPU.
CMSC 611: Advanced Computer Architecture
CS 704 Advanced Computer Architecture
Cache Memory.
CSE 351 Section 9 3/1/12.
Improving Memory Access 1/3 The Cache and Virtual Memory
CSC 4250 Computer Architectures
Multilevel Memories (Improving performance using alittle “cash”)
CS 704 Advanced Computer Architecture
The Hardware/Software Interface CSE351 Winter 2013
Consider a Direct Mapped Cache with 4 word blocks
Morgan Kaufmann Publishers Memory & Cache
William Stallings Computer Organization and Architecture 7th Edition
Lecture 21: Memory Hierarchy
CMSC 611: Advanced Computer Architecture
Lecture 23: Cache, Memory, Virtual Memory
Chapter 5 Memory CSE 820.
Lecture 08: Memory Hierarchy Cache Performance
Lecture 22: Cache Hierarchies, Memory
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Chapter 6 Memory System Design
Performance metrics for caches
Performance metrics for caches
Adapted from slides by Sally McKee Cornell University
Performance metrics for caches
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE108B Review Session #6 Daxia Ge Friday February 23rd, 2007
Lecture 22: Cache Hierarchies, Memory
CS-447– Computer Architecture Lecture 20 Cache Memories
CSC3050 – Computer Architecture
Lecture 21: Memory Hierarchy
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Performance metrics for caches
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Cache - Optimization.
Main Memory Background
Cache Memory Rabi Mahapatra
Principle of Locality: Memory Hierarchies
Memory & Cache.
Performance metrics for caches
10/18: Lecture Topics Using spatial locality
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Cache Memory Presentation I CSE 675.02: Introduction to Computer Architecture Cache Memory Presentation I Gojko Babić 08/12/2005

The Levels in Memory Hierarchy Higher the level, smaller and faster the memory. Try to keep most of the action in the higher levels. g. babic Presentation I

Principle of Locality The principle of locality is the most important program property that is exploited in many parts of memory hierarchy. The principle of locality states that programs tend to reuse instructions and data they have used recently. There are two different types of locality: – temporal locality – recently accessed locations in the main memory are likely to be accessed in the near future, – spatial locality – locations in the main memory near one another tend to be referenced close together in time. Principle of locality applies more strongly to code accesses than data accesses. An implication of principle of locality is that we can predict with reasonable accuracy what instructions and data a program will use in near future based on its accesses in the recent past. g. babic Presentation I

Memory and Cache Systems Main memory: DRAM technology Cache memory: SRAM technology Cache is fast but because of that it has to be small. Why? Locality of reference and a limit of a size for very fast memory have led to the concept of cache memory. Main Memory 10MHz Bus 66MHz CPU Cache CPU 400MHz Main Memory 10MHz Bus 66MHz CPU Cache Main Memory Data object transfer Block transfer g. babic Presentation I

Basics of Cache Operation Since it is more important, we first consider a cache read operation: CPU requests a content of a given memory location, check cache for the content; cache includes tags to identify which block from main memory is in each cache entry. – if present, this is a hit, and get the content (fast), – if not present, this is a miss, and read a block with the required content from main memory to cache, Miss penalty: time to replace a block from a lower level. then deliver the content from cache to CPU. g. babic Presentation I

2-level Hierarchy: Performance View T2: main memory access time T1: cache access time Hit ratio is a ratio of a number of hits and a total number of memory accesses 1 T1+T2 T1 Hit ratio Access Time Hit rates are normally well over 90%. Miss ratio = 1 – Hit ratio g. babic Presentation I

One-Word-Wide Memory Bus Let us assume the performance of the main memory to be: – 1 clock cycle to send address, – 14 clock cycles for the access time per word, – 1 clock cycle to send word of data. C P U a c h e B u s M m o r y Given a cache block of 1 word (1 word=4 bytes): Miss penalty = 1+14+1 = 16 cycles Throughput = 4 bytes / 16 cycles = ¼ bytes per cycle Although caches are interested in low–latency main memory, it is generally easier to improve memory bandwidth with new organization than it is to reduce latency. g. babic Presentation I

Wider Main Memory Bus Given 4-word wide memory bus and a cache block of four words: Miss penalty4-word memory bus = 1 + 14 + 1 = 16 cycles (as before) Throughput = 16 bytes / 16 cycles = 1 byte per cycle C P U a c h e B u s M m o r y . O n - w d i g z t C P U B u s b . W i d e m o r y g a n z t M l p x c h CPU will still access cache a word at a time, so there is a need for a multiplexer. Figure 7.11 a. & b. g. babic Presentation I

Wider Main Memory Bus & Level 2 Cache P U B u s M e m o r y l t i p x a c h – But the multiplexer may be on the critical timing path. – Here, the second level cache can help since multiplexing can be between first- and second-level caches, not on critical timing path. g. babic Presentation I

Interleaved Memory Organization C P U a c h e B u s M m o r y b n k 1 2 3 Memory bus width is 1 word. This is four-way interleaved memory. Assuming cache block of four words: Miss penalty = 1 + 14 + 4×1 = 19 cycles, Throughput = 16 bytes / 19 cycles = 0.84 bytes per cycle The example assumes word addressing. With byte addressing and 4 bytes per word, each of addresses would be multiple of 4. One address is sent to all banks, and each bank sends its data in its clock cycle.

Basics of Cache Design Elements of cache design: cache size, i.e. a number of entries, block (line) size, i.e. a number of data elements per entry, a number of caches, mapping functions: block placement and block identification, replacement algorithm, write policy. g. babic Presentation I

Cache Size and Block Size Cache size << main memory size; Cache size small enough to: minimize cost, speed up access (less gates to address the cache), and keep cache on chip; Cache size large enough: Minimize average access time Average access time = hit time + (1 – hit rate) × Miss penalty Smaller blocks do not take advantage of spatial locality; Larger blocks reduce the number of blocks replacement overhead; g. babic Presentation I

Number of Caches Increased logic density => on-chip cache internal cache: level 1 (L1), external or internal cache: level 2 (L2); Unified cache balances the load between instruction and data fetches, only one cache needs to be designed and implemented; Split caches (data cache and instruction cache) pipelined, parallel architectures; g. babic Presentation I

Mapping Functions Mapping function determines basic cache organization Direct mapping cache: maps each block into only one possible entry entry number = (block address) modulo (number of entries) Fully associative cache: block can be placed anywhere in the cache; Set associative cache: block can be placed in a restricted set of entries set number = (block address) modulo (number of sets in cache) g. babic Presentation I

Cache Organizations g. babic Presentation I

Direct Mapped Cache Mapping: cache address is memory address modulo the number of blocks in the cache 1 C a c h e M m o r y Figure 7.5 g. babic Presentation I

Direct Mapping Cache: 1 × 32-bit Data s ( h o w i n g b t p ) 1 6 4 B y f V a l T D H 3 2 K 7 16 5 Byte offset = 2 bits since 1 word = 22 = 4 bytes Index = 14 bits since 214 = 16K number of cache entries What kind of locality are we taking advantage of? Similar to Figure 7.7 g. babic Presentation I

Direct Mapping Cache: 4 × 32-bit Data s ( h o w i n g b t p ) 1 6 2 B y f V T a D H 3 4 K 8 M u x l c k I 5 index tag Similar to Figure 7.9 Taking advantage of spatial locality g. babic Presentation I

Performance Results In the previous two slides: – Cache with 1 word block size has 16K entries, i.e. total of 64KB – Cache with 4 word block size has 4K entries, i.e. total of 64KB Cache that takes advantages of spatial locality has much better performance g. babic Presentation I

4-Way Set Associate Cache d r e s 2 8 V T a g I n x 1 5 3 4 D t - o m u l i p H 9 index tag byte offset Figure 7.17 g. babic Presentation I

Replacement Algorithms Simple for direct-mapped caches: no choice Random simple to build in hardware Least Recently Used – LRU since cannot be efficiently implemented, some approximations of LRU normally used First In First Out – FIFO g. babic Presentation I

Write Policy Write is more complex than read: write and tag comparison can not proceed simultaneously, only a portion of the block has to be updated; Write policies: write through – write to the cache and memory; write back – write only to the cache, and write in memory when block is replaced (dirty bit); Write through is usually found today only in the first-level data caches backed by level 2 cache that uses write back. Write hits: – write-through: replace data in cache and memory – write-back: write the data only into the cache and write-back into the memory later Write misses: – read the entire block into the cache, then write the word g. babic Presentation I

Cache Coherency Problem This problem occurs with a write back policy. I/O gets old value I/O value will be lost g. babic Presentation I

Solutions to Cache Coherency Problem If write through were used, then memory would have an up-to- date copy of the information, and there would be no stale-data issue for output. When write back used, the stale-data for output is not critical. For input, the software solution is to guarantee that no blocks of the I/O buffer designated for input are in cache. Thus, the operating system always inputs to memory (pages) marked as non-cacheable. The hardware solution is to check the I/O addresses on input to see if they are in the cache, and if there is a match, the cache entries are invalidated to avoid stale data. Note that the cache coherency problem applies to multiproces- sors as well as I/O. g. babic Presentation I

Summer Quarter 2005 CSE675.02 Introduction to Computer Architecture The End g. babic Presentation I