Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.

Similar presentations


Presentation on theme: "Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of."— Presentation transcript:

1 Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of Michigan

2 Motivation Increasing Memory – Processor frequency Gap Increasing Memory – Processor frequency Gap Large Data Caches to hide Long Latencies Large Data Caches to hide Long Latencies Larger caches – Longer Access Latencies [McFarland 98] Larger caches – Longer Access Latencies [McFarland 98]  Processor Cycle determines Cache Size  Intel Pentium III – 16K DL1 Cache, 3 cycle access  Intel Pentium 4 – 8K DL1 Cache, 2 cycle access Need Large AND Fast Caches! Need Large AND Fast Caches!

3 Related Work Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98]  All Loads are NOT equal  Determining Criticality – Very Complex  Sophisticated Simulator with Rollback Non-Critical Buffer [Fisk & Bahar, ICCD99] Non-Critical Buffer [Fisk & Bahar, ICCD99]  Determining Criticality – Performance Degradation/Dependency Chains  Non-Critical Buffer – Victim Cache for non- critical loads  Small Performance Improvements (upto 4%)

4 Related Work(contd.) Locality vs. Criticality [Srinivasan et.al., ISCA 01] Locality vs. Criticality [Srinivasan et.al., ISCA 01]  Determining Criticality – Practical Heuristics  Potential for Improvement – 40%  Locality is better than Criticality Non-Vital Loads [Rakvic et.al., HPCA 02] Non-Vital Loads [Rakvic et.al., HPCA 02]  Determining Criticality – Run-time Heuristics  Small and fast Vital cache for Vital Loads  17% Performance Improvement

5 Load Latency Tolerance

6 Criticality Criticality – Effect of Load Latency on Performance Criticality – Effect of Load Latency on Performance Two thresholds – Performance and Latency Two thresholds – Performance and Latency A Very Direct Estimation of Criticality A Very Direct Estimation of Criticality Computation Intensive! Computation Intensive! Static Static

7 Determining Criticality- A Closer Look IPC Threshold=99.6% Latency Threshold = 8cycles

8 Most Frequently Executed Loads

9 Criticality(contd..) Benchmark (SPECINT 2000) # of Load Insns accounting for 80% of Load references BZIP2130 CRAFTY905 EON550 GAP100 GCC4650! GZIP74 MCF115 PARSER305 TWOLF185

10 Critical Cache Configuration

11 Effectiveness? Load Reference Distribution Load Reference Distribution  What %age of Loads Identified as Critical  Miss Rate for Critical Load References Critical Cache Configuration compared with Critical Cache Configuration compared with  Faster Conventional Cache Configuration  DL1/DL2 Latencies – 3/10, 6/20, 9/30 cycles Critical Cache Configuration compared with Critical Cache Configuration compared with  Larger Conventional Cache Configuration  DL1 Sizes – 8KB, 16KB, 32KB, 64KB

12 Processor Configuration Similar to Alpha 21264 using SimpleScalar-3.0 [Austin, Burger 97] Fetch Width 8 instructions per cycle Fetch Queue Size 64 Branch Predictor 2 Level, 4K entry level 2 Branch Target Buffer 2K entries, 8 way associative Issue Width 4 instructions per cycle Decode Width 4 instructions per cycle RUU Size 128 Load/Store Queue Size 32 Instruction Cache 64KB, 2-way, 64 byte lines L2 Cache 1MB, 2-way, 128 byte lines Memory Latency 64 cycles

13 Results Benchmark # of Critical Load Insns. Critical Load Refs (% of total Load Refs) Miss rate of Critical Loads for 1K critical cache BZIP22318.8410.6 CRAFTY10715.8728.2 EON5216.7412.7 GAP2613.17.1 GZIP1716.212.7 MCF3223.8613.2 PARSER3318.4412.8 TWOLF4214.888.6

14 Results Comparison with a faster conventional Cache Configuration IPCs normalized to 16K-1cycle Configuration 25-66% of the Penalty due to a slower cache is eliminated

15 Results Comparison with a faster Conventional Cache Configuration IPCs normalized to 32K- 1cycle Configuration 25-70% of the Penalty due to a slower cache is eliminated

16 Results Comparison with a larger Conventional cache Configuration IPCs normalized to 16K-3cycle Configuration

17 Results Comparison with a larger Conventional cache Configuration IPCs normalized to 32k_6cycle Configuration Critical cache Configuration outperforms a larger conventional cache

18 Conclusions & Future Work Conclusions Conclusions  Compares well with a faster conventional cache  Outperforms a larger conventional cache in most cases Future Work Future Work  More heuristics to refine “criticality”  Why are “critical loads” critical?  Criticality of a memory address vs. criticality of a load instruction  Criticality for lowpower Caches


Download ppt "Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of."

Similar presentations


Ads by Google