1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity.

1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity

Fully Associative Cache Reducing Cache Misses by More Flexible Placement Blocks  Instead of direct mapped, we allow any memory block to be placed in any cache slot. l There are many different potential addresses that mapped to each index l Use any available entry to store memory elements l Remember: Direct memory caches are more rigid, any cache data goes directly where the index says to, even if the rest of the cache is empty l But in Fully associative cache, nothing gets “thrown out” of the cache until it is completely full.  It’s harder to check for a hit (hit time will increase).  Requires lots more hardware (a comparator for each cache slot).  Each tag will be a complete block address (No index bits are used). 2

Fully Associative Cache  Must compare tags of all entries in parallel to find the desired one (if there is a hit) l But Direct mapped cache only need to look one place  No conflict misses, only capacity misses  Practical only for caches with small number of blocks, since searching increases the hardware cost 3

4 Fully Associative Cache

5 Direct Mapped vs Fully Associative Direct Mapped 0: 1: 2 3: 4: 5: 6: 7: 8 9: 10: 11: 12: 13: 14: 15: VTagData Index Address = Tag | Index | Block offset Fully Associative No Index Address = Tag | Block offset Each address has only one possible location TagDataV

Trade off  Fully Associate is much more flexible, so the miss rate will be lower.  Direct Mapped requires less hardware (cheaper). – will also be faster!  Tradeoff of miss rate vs. hit time.  Therefore we might be able to compromise to find best solution between direct mapped cache and fully associative cache  We can also provide more flexibility without going to a fully associative placement policy.  For each memory location, provide a small number of cache slots that can hold the memory element.  This is much more flexible than direct-mapped, but requires less hardware than fully associative. SOLUTION: Set Associative 6

SET Associative Cache  A fixed number of locations where each block can be placed.  N-way set associative means there are N places (slots) where each block can be placed.  Divide the cache into a number of sets each set is of size N “ways” (N way set associative)  Therefore, A memory block maps to unique set (specified by index field) and can be placed in any “way” of that set l So there N choices l A memory block can be mapped is Set-accociative cache -(Block address) modulo (Number of set in the cache) -Remember that in a direct mapped cache the position of memory block is given by (Block address) modulo (Number of cache blocks) 7

8 A Compromise 2-Way set associative Address = Tag | Index | Block offset 4-Way set associative Address = Tag | Index | Block offset 0: 1: 2: 3: 4: 5: 6: 7: VTagData Each address has two possible locations with the same index Each address has two possible locations with the same index One fewer index bit: 1/2 the indexes 0: 1: 2: 3: VTagData Each address has four possible locations with the same index Each address has four possible locations with the same index Two fewer index bits: 1/4 the indexes

9 Range of Set Associative Caches Block offsetByte offsetIndexTag Decreasing associativity Fully associative (only one set) Tag is all the bits except block and byte offset Direct mapped (only one way) Smaller tags Increasing associativity Selects the setUsed for tag compareSelects the word in the block Index is the set number is used to determine which set the block can be placed in

Range of Set Associative Caches  For a fixed size cache, l each increase by a factor of two in associativity doubles the number of blocks per set (i.e. the numbers or ways) l And halves the number of sets, l Decreases the size of the index by 1 bit l And increases the size of the tag by 1 bit 10 Block offsetByte offsetIndexTag

11 Set Associative Cache 0 Cache Main Memory Q1: How do we find it? Use next 1 low order memory address bit to determine which cache set (i.e., modulo the number of sets in the cache) TagData Q2: Is it there? Compare all the cache tags in the set to the high order 3 memory address bits to tell if the memory block is in the cache V 0000xx 0001xx 0010xx 0011xx 0100xx 0101xx 0110xx 0111xx 1000xx 1001xx 1010xx 1011xx 1100xx 1101xx 1110xx 1111xx Two low order bits define the byte in the word (32-b words) One word blocks 1 0 1 0 1 (block address) modulo (# set in the cache) Valid bit indicates whether an entry contains valid information – if the bit is not set, there cannot be a match for this block

12 Set Associative Cache Organization FIGURE 7.17 The implementation of a four-way set-associative cache requires four comparators and a 4-to-1 multiplexor. The comparators determine which element of the selected set (if any) matches the tag. The output of the comparators is used to select the data from one of the four blocks of the indexed set, using a multiplexor with a decoded select signal. In some implementations, the Output enable signals on the data portions of the cache RAMs can be used to select the entry in the set that drives the output. The Output enable signal comes from the comparators, causing the element that matches to drive the data outputs.

Set Associative Cache Organization  This is called a 4-way set associative cache because there are four cache entries for each cache index. Essentially, you have four direct mapped cache working in parallel.  This is how it works: the cache index selects a set from the cache. The four tags in the set are compared in parallel with the upper bits of the memory address.  If no tags match the incoming address tag, we have a cache miss.  Otherwise, we have a cache hit and we will select the data from the way where the tag matches occur.  This is simple enough. What is its disadvantages? 13

N-way Set Associative Cache versus Direct Mapped Cache:  N way set associative cache will also be slower than a direct mapped cache because l N comparators vs. 1 l Extra MUX delay for the data l Data comes AFTER Hit/Miss decision and set selection l In a direct mapped cache, Cache Block is available BEFORE Hit/Miss: l Possible to assume a hit and continue. Recover later if miss. 14

15 Remember the Example for Direct Mapping (ping pong effect) 0404 0 404  Consider the main memory word reference string 0 4 0 4 0 4 0 4 miss 00 Mem(0) 01 4 01 Mem(4) 0 00 00 Mem(0) 01 4 00 Mem(0) 01 4 00 Mem(0) 01 4 01 Mem(4) 0 00 01 Mem(4) 0 00 Start with an empty cache - all blocks initially marked as not valid  Ping pong effect due to conflict misses - two memory locations that map into the same cache block l 8 requests, 8 misses

16 Solution: Use set associative cache 0404  Consider the main memory word reference string 0 4 0 4 0 4 0 4 miss hit 000 Mem(0) Start with an empty cache - all blocks initially marked as not valid 010 Mem(4) 000 Mem(0) 010 Mem(4)  Solves the ping pong effect in a direct mapped cache due to conflict misses since now two memory locations that map into the same cache set can co-exist! l 8 requests, 2 misses

17 Set Associative Example VTagData Index 0 0 0 0 0 0 0 0 000: 001: 010: 011: 100: 101: 110: 111: 0100111000 1100110100 0100111100 0110110000 1100111000 Miss Index VTagData 0 0 0 0 0 0 0 0 00: 01: 10: 11: VTagData Index 0 0 0 0 0 0 0 0 0: 1: Direct-Mapped2-Way Set Assoc.4-Way Set Assoc. 0100111000 1100110100 0100111100 0110110000 1100111000 Miss Hit Miss 0100111000 1100110100 0100111100 0110110000 1100111000 Miss Hit Miss Hit Byte offset (2 bits) Block offset (2 bits) Index (1-3 bits) Tag (3-5 bits) 010 - 1 110 010 0100 - 1 1100 - 1 011 110 0110 1100 1 01001 1 11001 1 01101 - - -

18 New Performance Numbers Miss rates for DEC 3100 (MIPS machine) spiceDirect0.3%0.6%0.4% gccDirect2.0%1.7%1.9% spice2-way0.3%0.6%0.4% gcc4-way1.6%1.4%1.5% BenchmarkAssociativityInstructionData missCombined ratemiss rate Separate 64KB Instruction/Data Caches gcc2-way1.6%1.4%1.5% spice4-way0.3%0.6%0.4%

19 Benefits of Set Associative Caches  The choice of direct mapped or set associative depends on the cost of a miss versus the cost of implementation Data from Hennessy & Patterson, Computer Architecture, 2003  Largest gains are in going from direct mapped to 2-way (20%+ reduction in miss rate)

Benefits of Set Associative Caches  As the cache size grow, the relative improvement from associativity increases only slightly  Since overall miss rate of a larger cache is lower, the opportunity for improving the miss rate decreases  And the obsolete improvement in miss rate from associativity shrinks significantly 20

Cache Block Replacement Policy For deciding which block to replace when a new entry is coming  Random Replacement: l Hardware randomly selects a cache item and throw it out  First in First Out (FIFO) l Equally fair / equally unfair to all frames  Least Recently Used (LRU) strategy: l Use idea of temporal locality to select the entry that has not been accessed recently l Additional bit(s) required in the cache entry to track access order -Must update on each access, must scan all on a replace l For two way set associative cache one needs one bit for LRU replacement.  Common approach is to use pseudo LRU strategy  Example of a Simple “Pseudo” Least Recently Used Implementation:  Assume 64 Fully Associative Entries  Hardware replacement pointer points to one cache entry  Whenever an access is made to the entry the pointer points to: - Move the pointer to the next entry -Otherwise: do not move the pointe r 21

Source of Cache Misses Direct Mapped N way Set Associative Fully Associative Cache SizeBigMediumSmall Compulsory Miss Same Conflict MissHighMediumZero Capacity Miss Low(er)MediumHigh

Designing a cache Design CacheEffect on Miss rateNegative performance effect Increase sizeDecrease Capacity MissesMay increase Access time Increase Associativity Decrease conflict MissesMay increase Access time Increase Block Size May decrease compulsory misses May increase miss penalty May Increase Capacity Misses Not: If you are running “billions” of instructions compulsory misses are insignificand

Key Cache Design Parameters L1 typicalL2 typical Total size (blocks)250 to 20004000 to 250,000 Total size (KB)16 to 64500 to 8000 Block size (B)32 to 6432 to 128 Miss penalty (clocks)10 to 25100 to 1000 Miss rates (global for L2) 2% to 5%0.1% to 2%

Two Machines’ Cache Parameters Intel P4AMD Opteron L1 organizationSplit I$ and D$ L1 cache size8KB for D$, 96KB for trace cache (~I$) 64KB for each of I$ and D$ L1 block size64 bytes L1 associativity4-way set assoc.2-way set assoc. L1 replacement~ LRULRU L1 write policywrite-throughwrite-back L2 organizationUnified L2 cache size512KB1024KB (1MB) L2 block size128 bytes64 bytes L2 associativity8-way set assoc.16-way set assoc. L2 replacement~LRU L2 write policywrite-back

Where can a block be placed/found? # of setsBlocks per set Direct mapped# of blocks in cache1 Set associative(# of blocks in cache)/ associativity Associativity (typically 2 to 16) Fully associative1# of blocks in cache Location method# of comparisons Direct mappedIndex1 Set associativeIndex the set; compare set’s tags Degree of associativity Fully associativeCompare all blocks tags# of blocks

Multilevel caches  Two level cache structure allows the primary cache (L1) to focus on reducing hit time to yield a shorter clock cycle.  The second level cache (L2) focuses on reducing the penalty of long memory access time.  Compared to the cache of a single cache machine, L1 on a multilevel cache machine is usually smaller, has a smaller block size, and has a higher miss rate.  Compared to the cache of a single cache machine, L2 on a multilevel cache machine is often larger with a larger block size.  The access time of L2 is less critical than that of the cache of a single cache machine. 27

1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity.

Similar presentations

Presentation on theme: "1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity.

Similar presentations

Presentation on theme: "1 CMPE 421 Parallel Computer Architecture PART4 Caching with Associativity."— Presentation transcript:

Similar presentations

About project

Feedback