Download presentation
Presentation is loading. Please wait.
1
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of Michigan
2
Motivation Increasing Memory – Processor frequency Gap Increasing Memory – Processor frequency Gap Large Data Caches to hide Long Latencies Large Data Caches to hide Long Latencies Larger caches – Longer Access Latencies [McFarland 98] Larger caches – Longer Access Latencies [McFarland 98] Processor Cycle determines Cache Size Intel Pentium III – 16K DL1 Cache, 3 cycle access Intel Pentium 4 – 8K DL1 Cache, 2 cycle access Need Large AND Fast Caches! Need Large AND Fast Caches!
3
Related Work Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] Load Latency Tolerance [Srinivasan & Lebeck, MICRO 98] All Loads are NOT equal Determining Criticality – Very Complex Sophisticated Simulator with Rollback Non-Critical Buffer [Fisk & Bahar, ICCD99] Non-Critical Buffer [Fisk & Bahar, ICCD99] Determining Criticality – Performance Degradation/Dependency Chains Non-Critical Buffer – Victim Cache for non- critical loads Small Performance Improvements (upto 4%)
4
Related Work(contd.) Locality vs. Criticality [Srinivasan et.al., ISCA 01] Locality vs. Criticality [Srinivasan et.al., ISCA 01] Determining Criticality – Practical Heuristics Potential for Improvement – 40% Locality is better than Criticality Non-Vital Loads [Rakvic et.al., HPCA 02] Non-Vital Loads [Rakvic et.al., HPCA 02] Determining Criticality – Run-time Heuristics Small and fast Vital cache for Vital Loads 17% Performance Improvement
5
Load Latency Tolerance
6
Criticality Criticality – Effect of Load Latency on Performance Criticality – Effect of Load Latency on Performance Two thresholds – Performance and Latency Two thresholds – Performance and Latency A Very Direct Estimation of Criticality A Very Direct Estimation of Criticality Computation Intensive! Computation Intensive! Static Static
7
Determining Criticality- A Closer Look IPC Threshold=99.6% Latency Threshold = 8cycles
8
Most Frequently Executed Loads
9
Criticality(contd..) Benchmark (SPECINT 2000) # of Load Insns accounting for 80% of Load references BZIP2130 CRAFTY905 EON550 GAP100 GCC4650! GZIP74 MCF115 PARSER305 TWOLF185
10
Critical Cache Configuration
11
Effectiveness? Load Reference Distribution Load Reference Distribution What %age of Loads Identified as Critical Miss Rate for Critical Load References Critical Cache Configuration compared with Critical Cache Configuration compared with Faster Conventional Cache Configuration DL1/DL2 Latencies – 3/10, 6/20, 9/30 cycles Critical Cache Configuration compared with Critical Cache Configuration compared with Larger Conventional Cache Configuration DL1 Sizes – 8KB, 16KB, 32KB, 64KB
12
Processor Configuration Similar to Alpha 21264 using SimpleScalar-3.0 [Austin, Burger 97] Fetch Width 8 instructions per cycle Fetch Queue Size 64 Branch Predictor 2 Level, 4K entry level 2 Branch Target Buffer 2K entries, 8 way associative Issue Width 4 instructions per cycle Decode Width 4 instructions per cycle RUU Size 128 Load/Store Queue Size 32 Instruction Cache 64KB, 2-way, 64 byte lines L2 Cache 1MB, 2-way, 128 byte lines Memory Latency 64 cycles
13
Results Benchmark # of Critical Load Insns. Critical Load Refs (% of total Load Refs) Miss rate of Critical Loads for 1K critical cache BZIP22318.8410.6 CRAFTY10715.8728.2 EON5216.7412.7 GAP2613.17.1 GZIP1716.212.7 MCF3223.8613.2 PARSER3318.4412.8 TWOLF4214.888.6
14
Results Comparison with a faster conventional Cache Configuration IPCs normalized to 16K-1cycle Configuration 25-66% of the Penalty due to a slower cache is eliminated
15
Results Comparison with a faster Conventional Cache Configuration IPCs normalized to 32K- 1cycle Configuration 25-70% of the Penalty due to a slower cache is eliminated
16
Results Comparison with a larger Conventional cache Configuration IPCs normalized to 16K-3cycle Configuration
17
Results Comparison with a larger Conventional cache Configuration IPCs normalized to 32k_6cycle Configuration Critical cache Configuration outperforms a larger conventional cache
18
Conclusions & Future Work Conclusions Conclusions Compares well with a faster conventional cache Outperforms a larger conventional cache in most cases Future Work Future Work More heuristics to refine “criticality” Why are “critical loads” critical? Criticality of a memory address vs. criticality of a load instruction Criticality for lowpower Caches
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.