Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture
Stefano Nichele, Angelo Spalluto Department of Computer and Information Science 2011, April 15th Stefano Nichele – Angelo Spalluto, 2011

Agenda Moore’s law – Memory wall Related work
Fixed Sequential Prefetching Sequential Aggressive Prefetching (M-Adaptive, DM-Adaptive) DCPT, DCPT-P WA-DCPT and SA-DCPT Results Conclusion References

Moore vs. Mem. Wall Spatial Locality Temporal Locality

Prefetching Predicting Fetching
1 – Which data will be needed by the next instructions? 2 – Deliver it into the cache before it is referenced! Sequential RPT PC/DC DCPT Adaptive

Fixed Sequential Prefetching
Sequential Algorithm The prefetcher issues N requests after a miss occurs; The value of window is constant for the whole execution of program; Sequential benchmarks Wupwise; Applu; Galgel; Not sequential benchmarks Ammp; Art110; Art470; Speed up Benchmarks Fixed size window

Sequential Aggressive Adaptive Prefetcher
The adaptive prefetcher adjusts dynamically the degree of prefetching (N) Adaptive window parameters Window: Number of N contiguous blocks issued by prefetcher Accuracy: Number of good prefetches referred to a window Threshold: Number of good prefetches necessary to increase the window (Accuracy >= Threshold) Lock window: Number of times whereby the window is locked Listening state: The prefetcher counts the number of good prefetches Prefetcher algorithm Prefetcher initialises Window, Threshold and Lock Window Upon a request issued by CPU, the prefetcher issues N prefetching It waits for N times (listening state) In step N it checks if Accuracy >= Threshold If previous condition is satisfied, then it uses the same window for other L-1 times. Otherwise it decreases the window and it issues N requests (back in step 3) If step 4 succeedes for L times, the prefetcher increases the window and it issues other N requests Back in step 3

Example Seq. Aggressive Adaptive Prefetcher

Different listening states
Sequential Aggressive Prefetching occurs immediately after the last element checked in the window (either if it is a miss or hit) Each window is composed by P elements = #hits + #misses Miss-Adaptive (M-Adaptive) The M-Adaptive issues a prefetching (restart a new window) only when the first miss occurs after that the whole window has been checked (hits do not trigger prefetching) Discard Miss-Adaptive (DM-Adaptive) DM-Adaptive issues a prefetching immediately after the first miss occurs inside the window Each window is composed by P elements = #hits

DCPT and DCPT-P No last prefetched Test if in cache before prefetching
Maybe in the queue

Aggressive Adaptive - DCPT
Stefano, Aggressive Adaptive works pretty well with sequential benchmarks. What about DCPT? Great!! DCPT works very goods with not sequential benchmarks. Let’s try to combine them togheter !! Ja ja, we may achieve better results! Aggressive Adaptive DCPT Aggressive Adaptive-DCPT SA-DCPT WA-DCPT

WA-DCPT and SA-DCPT WA-DCPT SA-DCPT
WA-DCPT adds the concept of window in DCPT When DCPT issues a prefetching for a specific PC, it also delivers all subsequent blocks according to its window size WA-DCPT is more memory demanding than DCPT. It uses a larger data structure SA-DCPT At runtime it adapts the best algorithm between DCPT and Aggressive Sequential Switch Threshold is the major concern Best switch threshold is 4

Adaptive results Aggressive Adaptive M-Adaptive and DM-Adaptive
In some benchmarks (galgel, applu, wupwise) the window reaches also size between 13 and 15 Using a window greater than 12 does not improve the performances Low sequencing for ammp, art110 and art470 M-Adaptive and DM-Adaptive The results of M-Adaptive and DM-Adaptive are not better than Aggressive Adaptive As expected, they produce less “misses” and “prefetches issued”

DCPT results DCPT and DCPT-P
As expected, DCPT-P is slightly better than DCPT For ammp, DCPT-P outperforms almost twice better than adaptive Table composed by 16 deltas and 97 PCs is the best configuration (smaller than 8KB) DCPT-P uses a masking of 8bits In our tests there are not improvement using a bit mask of 12

Adaptive DCPT results WA-DCPT SA-DCPT
WA-DCPT has a different data structure than DCPT (window data) Best results are achieved using 14 deltas SA-DCPT SA-DCPT has same data structure than DCPT Tuning on switching threshold Best switching factor is 4 SA-DCPT behaves as DCPT for switching factor greater than 4

Developed and Literature prefetcher
Developed Prefetchers DCPT obtains the best performances SA-DCPT is a good compromise when we do not know the type of benchmark Literature VS Developed Our DCPT-P implementation outperforms the reference DCPT-P Likely because they have different data structure

Coverage Analysis Coverage Coverage vs Speedup
Benchmarks with low sequencing (ammp, art110 and art470) have a higher coverage with DCPT-P Benchmarks with high sequencing (except applu) have better coverage with SA-DCPT Coverage vs Speedup The coverage is not directly proportional to speedup If the algorithm spends too much time to discover the next element to prefetch, as consequence it might increase its execution time

Conclusion Importance of prefetcher, it can really improve performances Contribution: 3 new prefetcher variants: adaptive window (aggressive technique) DCPT-based with bit masking Combination: delta correlation with adaptive window Importance of parameter tuning DCPT-P has best performances (on overall) Difficult to combine two different (opposite) algorithms to exploit the best properties of each

References G. E. Moore, Cramming more Components onto Integrated Circuits, Electronics, 38(8), April 9, 1965. W.A. Wulf and S.A. McKee, Hitting the Memory Wall: Implications of the Obvious, Computer Architecture News, vol. 23, no. 1, Mar. 1995, pp. 20–24 A. R. Brodtkorb, C. Dyken, T. R. Hagen, J. M. Hjelmervik, O. O. Storaasli, State-of-the-art in heterogeneous computing, Sci. Program., Vol. 18 (January 2010), pp M. Jahre, Managing Shared Resources in Chip Multiprocessor Memory Systems.: NTNU 2010 (ISBN ) 238 s. Doktoravhandlinger ved NTNU (159) M. Grannaes, Reducing Memory Latency by Improving Resource Utilization.: NTNU 2010 (ISBN ) 242 s. Doktoravhandlinger ved NTNU (106) A. J. Smith, Cache memories, ACM Comput. Surv., vol. 14, no. 3, pp. 473–530, 1982 F. Dahlgren, M. Dubois, and P. Stenstrom. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Parallel Processing, ICPP International Conference on, volume 1, pages 56-63, Aug M. Grannaes, M. Jahre and L. Natvig. Multi-level Hardware Prefetching Using Low Complexity Delta Correlating Prediction Tables with Partial Matching. High Performance Embedded Architectures and Compilers LNCS, 2010, Volume 5952/2010, M. Grannaes, M. Jahre and L. Natvig. Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables. In Data Prefetching Championships (2009)

QUESTIONS ?

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Similar presentations

Presentation on theme: "Mini-Project Presentation: Prefetching TDT4260 Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mini-Project Presentation: Prefetching TDT4260 Computer Architecture

Similar presentations

Presentation on theme: "Mini-Project Presentation: Prefetching TDT4260 Computer Architecture"— Presentation transcript:

Similar presentations

About project

Feedback