Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.

Similar presentations


Presentation on theme: "1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank."— Presentation transcript:

1 1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank Nagari

2 2 Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

3 3 Base Line Design

4 4 Base Line Design Contd.. Size of on-chip caches usually varies High speed technologies result in smaller on chip caches L1 caches are assumed to be direct mapped L1 cache line sizes – 16 - 32 B L2 cache line sizes – 128-256B

5 5 Parameters assumed Processor Speed: 1000 MIPS L1 Inst and Data Cache Size : 4Kb Line Size : 16B L2 Inst and Data Cache Size : 1MB Line Size : 128B

6 6 Parameters assumed Contd.. Miss Penalty L1- 24 Inst times L2- 320 Inst times

7 7 Test Program Characteristics

8 8 Base Line system L1 Cache Miss Rates

9 9 Base Line Design Performance

10 10 Inferences Potential performance loss in memory hierarchy Improving performance of memory hierarchy rather than CPU performance H/w Techniques are used for improving the performance of the baseline M-H

11 11 How Direct Mapped Cache works Main Memory Tag Data Block Number 01 110 101 100 011 010 100 011 010 001 000 111 010 000 111 110 101 000 111 110 101 100 011 000 010 011 101 100 00 01 10 11 011 010 001 000 111 110 101 100 Direct Mapped Cache with 8 Blocks How to identify? Match the Tag Tag 01 in block 001 means address 01001 is there How to search 00101, 01101, 10101, 11101 maps to block 101 001

12 12 How Fully-associative Cache works Main Memory Tag Data Block Number 110 101 100 011 010 001 100 011 010 001 000 111 010 001 000 111 110 101 000 111 110 101 100 011 000 001 010 011 101 100 00 01 10 11 011 010 001 000 111 110 101 100 Fully Associative Cache with 8 Blocks Where to search? Every Block in Cache Very Expensive

13 13 Cache Misses Three Kinds - Instruction read miss: Causes most delay, CPU has to wait until the instruction Is fetched from the DRAM - Data read miss: Causes less delay, Inst not dependent on cache miss can continue execution until data is returned from DRAM - Data write miss: causes least delay, write can be queued & CPU can continue until queue is full

14 14 Types of Misses Conflict Misses Reduced by caching : Miss and Victim Compulsory Misses Capacity Misses Both are reduced by prefecthing: Stream Buffers Multi-way Buffers

15 15 Conflict Miss Conflict Misses are the misses which would not occur if the cache was Fully associative and had LRU If an item has been evicted from the cache and the next miss corresponds to that item then that kind of miss is called the conflict miss

16 16 Conflict Misses Contd.. Conflict Misses account to –20-40% of overall D-M misses –39% of L1-D$ misses –29% of L1-I$ misses

17 17 Conflict Misses,4Kb I&D

18 18 Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

19 19 Miss Caching Small, Fully associative on-chip cache On Miss Data is returned to -Direct mapped cache -Small Miss cache ( Where it replaces LRU item) Processor probes both D-M and Miss cache

20 20 Miss cache Organization

21 21 Observations Eliminates long off-chip miss penalty More data conflicts misses are removed than Instruction conflict misses - Instructions within a procedure do not conflict as long as the procedure size is < cache size - If an instruction within the program calls another program which may be mapped else where, a conflict arises- instruction conflict

22 22 Miss Cache Performance For 4 KB D$ size - Miss cache of 2 entries can remove 25% of D$ conflict misses i.e. 13% of overall D$ misses - Miss cache of 4 entries can remove 36% of D$ conflict misses i.e. 18% of overall D$ misses After 4 entries the improvement is minor

23 23 Conflict Misses removed by Miss caching

24 24 Overall Cache Misses removed by Miss Caching

25 25 Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

26 26 Victim Caching Duplication of the data wastes storage space in miss cache Loads F-A cache with victim line from the D-M cache When data misses in the D-M cache but hits in the Victim cache, contents are swapped

27 27 Victim Cache Organization

28 28 Victim Cache Performance Victim cache consisting of just one line is better than miss cache consisting of 2 lines Significant improvement in the performance of all the benchmark programs

29 29 Conflict Misses removed by Victim Caching

30 30 Overall Cache Misses removed by Victim Caching

31 31 Comparison of Miss cache and Victim cache performances

32 32 Effect of D-M cache size on Victim cache performance Smaller D-M caches – Most benefited due to addition of victim cache As D-M cache size increases, likelihood of conflict misses removed by victim cache decreases As the percentage of conflict misses decreases, the percentage of these misses removed by victim cache decreases

33 33 Victim cache: vary direct-map cache size

34 34 Effect of Line Size on Victim Cache Performance As line size increases the number of conflict misses increases As a result percentage of misses removed by victim cache increases

35 35 Victim cache: vary data cache line size

36 36 Victim caches and L2 Caches Victim caches are also useful for L2 caches due to large line sizes Using L1 victim cache can also reduce the number of L2 conflict misses

37 37 Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

38 38 Reducing Capacity and Compulsory Misses Compulsory Misses First reference to a piece of data Capacity Misses Due to insufficient cache size

39 39 Prefetching Algorithms Prefetch Always : Access to line “i” implies to prefetch access for “i+1” Prefetch on miss : Reference to block “i” causes prefetch to block “i+1” Iff the block was a miss Tagged Prefetch : Tag bit is set to `0` when a block is prefetched and to set 1 when block is used

40 40 Limited Time For Prefetch

41 41 Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

42 42 Stream Buffers Prefetched lines are placed in buffer in order to avoid polluting Each entry consists of tag,an available bit and data line If a reference misses in the cache but hits in the buffer, the cache can be reloaded When a line is moved from the SB, entries in the SB shift up and new successive data is fetched

43 43 Stream Buffer Mechanism

44 44 Stream Buffer Mechanism Contd.. On Miss  Prefetch successive lines  Enter tag for address in to the SB  Set available bit to false On return of the prefetched data  Place data in entry with its tag  Set available bit to true

45 45 Stream Buffer Performance Most instruction references break the purely sequential access pattern by the time the 6 th successive line is fetched Data references end even sooner As a result, Stream buffers show better performance at removing I$ misses

46 46 Sequential SB performance

47 Limitations of Stream Buffers Stream buffers considered are FIFO queues Head of the queue has tag comparator Elements must be removed strictly in sequence Works only for sequential line misses Fails for a non-sequential line miss 47

48 48 Outline Base Line Design Reducing Conflict Misses Miss Caching Victim Caching Reducing Capacity and Compulsory Misses Stream Buffers Multi-way Stream Buffers

49 49 Multi-way stream buffers Single data SB`s could remove 72% of I$ misses and 25% of D$ misses Multi-way SB was simulated- to improve performance of SB`s for data references Consists of 4 SB in parallel On Miss the least recently Hit SB is cleared and data is started fetching from the miss address

50 50 Multi-way stream Buffer Design

51 51 Observations Performance of the instruction stream remains virtually unchanged Significant improvement in the performance of the data stream Removes 43% of the misses for the test programs i.e. almost twice the performance of single SB

52 52 Four-way SB performance

53 53 SB Performance Vs Cache size

54 54 SB Performance Vs Line size

55 55 Performance Evaluation Over the set of 6 benchmarks on an average 2.5% of 4KB D-M D$ misses that hit in a 4 entry victim cache also hit in a 4 way SB The combination of buffers and victim caches reduces the L1 miss rate to less than half of that of the base line system Resulting in an average of 143% improvement in system performance for the 6 benchmarks

56 56 Improved System Performance

57 57 Future Enhancements The study has concentrated on applying these H/W techniques to L1 caches Application of these techniques to L2 caches forms an interesting area of future work Performance of victim caching and stream buffers can be investigated for OS design and for multi-programming work loads

58 58 Conclusions Miss caches remove tight conflicts where several addresses map to the same cache line Victim caches are an improvement to miss caching that save the victim of the cache miss Stream buffers prefetch cache lines after missed cache line Multi-way stream buffers are a set of stream buffers that can do concurrent prefetches

59 59


Download ppt "1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank."

Similar presentations


Ads by Google