Cache Memory Presentation I

Cache Memory Presentation I
CSE : Introduction to Computer Architecture Cache Memory Presentation I Gojko Babić 08/12/2005

The Levels in Memory Hierarchy
Higher the level, smaller and faster the memory. Try to keep most of the action in the higher levels. g. babic Presentation I

Principle of Locality The principle of locality is the most important program property that is exploited in many parts of memory hierarchy. The principle of locality states that programs tend to reuse instructions and data they have used recently. There are two different types of locality: – temporal locality – recently accessed locations in the main memory are likely to be accessed in the near future, – spatial locality – locations in the main memory near one another tend to be referenced close together in time. Principle of locality applies more strongly to code accesses than data accesses. An implication of principle of locality is that we can predict with reasonable accuracy what instructions and data a program will use in near future based on its accesses in the recent past. g. babic Presentation I

Memory and Cache Systems
Main memory: DRAM technology Cache memory: SRAM technology Cache is fast but because of that it has to be small. Why? Locality of reference and a limit of a size for very fast memory have led to the concept of cache memory. Main Memory 10MHz Bus 66MHz CPU Cache CPU 400MHz Main Memory 10MHz Bus 66MHz CPU Cache Main Memory Data object transfer Block transfer g. babic Presentation I

Basics of Cache Operation
Since it is more important, we first consider a cache read operation: CPU requests a content of a given memory location, check cache for the content; cache includes tags to identify which block from main memory is in each cache entry. – if present, this is a hit, and get the content (fast), – if not present, this is a miss, and read a block with the required content from main memory to cache, Miss penalty: time to replace a block from a lower level. then deliver the content from cache to CPU. g. babic Presentation I

2-level Hierarchy: Performance View
T2: main memory access time T1: cache access time Hit ratio is a ratio of a number of hits and a total number of memory accesses 1 T1+T2 T1 Hit ratio Access Time Hit rates are normally well over 90%. Miss ratio = 1 – Hit ratio g. babic Presentation I

One-Word-Wide Memory Bus
Let us assume the performance of the main memory to be: – 1 clock cycle to send address, – 14 clock cycles for the access time per word, – 1 clock cycle to send word of data. C P U a c h e B u s M m o r y Given a cache block of 1 word (1 word=4 bytes): Miss penalty = = 16 cycles Throughput = 4 bytes / 16 cycles = ¼ bytes per cycle Although caches are interested in low–latency main memory, it is generally easier to improve memory bandwidth with new organization than it is to reduce latency. g. babic Presentation I

Wider Main Memory Bus Given 4-word wide memory bus and a cache block of four words: Miss penalty4-word memory bus = = 16 cycles (as before) Throughput = 16 bytes / 16 cycles = 1 byte per cycle C P U a c h e B u s M m o r y . O n - w d i g z t C P U B u s b . W i d e m o r y g a n z t M l p x c h CPU will still access cache a word at a time, so there is a need for a multiplexer. Figure 7.11 a. & b. g. babic Presentation I

Wider Main Memory Bus & Level 2 Cache
P U B u s M e m o r y l t i p x a c h – But the multiplexer may be on the critical timing path. – Here, the second level cache can help since multiplexing can be between first- and second-level caches, not on critical timing path. g. babic Presentation I

Interleaved Memory Organization
C P U a c h e B u s M m o r y b n k 1 2 3 Memory bus width is 1 word. This is four-way interleaved memory. Assuming cache block of four words: Miss penalty = ×1 = 19 cycles, Throughput = 16 bytes / 19 cycles = 0.84 bytes per cycle The example assumes word addressing. With byte addressing and 4 bytes per word, each of addresses would be multiple of 4. One address is sent to all banks, and each bank sends its data in its clock cycle.

Basics of Cache Design Elements of cache design:
cache size, i.e. a number of entries, block (line) size, i.e. a number of data elements per entry, a number of caches, mapping functions: block placement and block identification, replacement algorithm, write policy. g. babic Presentation I

Cache Size and Block Size
Cache size << main memory size; Cache size small enough to: minimize cost, speed up access (less gates to address the cache), and keep cache on chip; Cache size large enough: Minimize average access time Average access time = hit time + (1 – hit rate) × Miss penalty Smaller blocks do not take advantage of spatial locality; Larger blocks reduce the number of blocks replacement overhead; g. babic Presentation I

Number of Caches Increased logic density => on-chip cache
internal cache: level 1 (L1), external or internal cache: level 2 (L2); Unified cache balances the load between instruction and data fetches, only one cache needs to be designed and implemented; Split caches (data cache and instruction cache) pipelined, parallel architectures; g. babic Presentation I

Mapping Functions Mapping function determines basic cache organization
Direct mapping cache: maps each block into only one possible entry entry number = (block address) modulo (number of entries) Fully associative cache: block can be placed anywhere in the cache; Set associative cache: block can be placed in a restricted set of entries set number = (block address) modulo (number of sets in cache) g. babic Presentation I

Cache Organizations g. babic Presentation I

Direct Mapped Cache Mapping: cache address is memory address modulo the number of blocks in the cache 1 C a c h e M m o r y Figure 7.5 g. babic Presentation I

Direct Mapping Cache: 1 × 32-bit Data
s ( h o w i n g b t p ) 1 6 4 B y f V a l T D H 3 2 K 7 16 5 Byte offset = 2 bits since 1 word = 22 = 4 bytes Index = 14 bits since 214 = 16K number of cache entries What kind of locality are we taking advantage of? Similar to Figure 7.7 g. babic Presentation I

Direct Mapping Cache: 4 × 32-bit Data
s ( h o w i n g b t p ) 1 6 2 B y f V T a D H 3 4 K 8 M u x l c k I 5 index tag Similar to Figure 7.9 Taking advantage of spatial locality g. babic Presentation I

Performance Results In the previous two slides:
– Cache with 1 word block size has 16K entries, i.e. total of 64KB – Cache with 4 word block size has 4K entries, i.e. total of 64KB Cache that takes advantages of spatial locality has much better performance g. babic Presentation I

4-Way Set Associate Cache
d r e s 2 8 V T a g I n x 1 5 3 4 D t - o m u l i p H 9 index tag byte offset Figure 7.17 g. babic Presentation I

Replacement Algorithms
Simple for direct-mapped caches: no choice Random simple to build in hardware Least Recently Used – LRU since cannot be efficiently implemented, some approximations of LRU normally used First In First Out – FIFO g. babic Presentation I

Write Policy Write is more complex than read:
write and tag comparison can not proceed simultaneously, only a portion of the block has to be updated; Write policies: write through – write to the cache and memory; write back – write only to the cache, and write in memory when block is replaced (dirty bit); Write through is usually found today only in the first-level data caches backed by level 2 cache that uses write back. Write hits: – write-through: replace data in cache and memory – write-back: write the data only into the cache and write-back into the memory later Write misses: – read the entire block into the cache, then write the word g. babic Presentation I

Cache Coherency Problem
This problem occurs with a write back policy. I/O gets old value I/O value will be lost g. babic Presentation I

Solutions to Cache Coherency Problem
If write through were used, then memory would have an up-to- date copy of the information, and there would be no stale-data issue for output. When write back used, the stale-data for output is not critical. For input, the software solution is to guarantee that no blocks of the I/O buffer designated for input are in cache. Thus, the operating system always inputs to memory (pages) marked as non-cacheable. The hardware solution is to check the I/O addresses on input to see if they are in the cache, and if there is a match, the cache entries are invalidated to avoid stale data. Note that the cache coherency problem applies to multiproces- sors as well as I/O. g. babic Presentation I

Summer Quarter 2005 CSE675.02 Introduction to Computer Architecture The End
g. babic Presentation I

Cache Memory Presentation I

Similar presentations

Presentation on theme: "Cache Memory Presentation I"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cache Memory Presentation I

Similar presentations

Presentation on theme: "Cache Memory Presentation I"— Presentation transcript:

Similar presentations

About project

Feedback