Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 - 2006-09-13 pag. 1 Discovery of Locality-Improving Refactorings.

Similar presentations


Presentation on theme: "Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 - 2006-09-13 pag. 1 Discovery of Locality-Improving Refactorings."— Presentation transcript:

1 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 - 2006-09-13 pag. 1 Discovery of Locality-Improving Refactorings by Reuse Path Analysis Kristof Beyls Erik H. D’Hollander

2 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 2 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

3 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 3 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

4 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 4 Motivation: why optimize cache behavior manually? Many programs: cache misses cause more than 2x slowdown. Compilers can only eliminate small part of the misses by automatic code optimizations. Therefore: Need for profiling tools that enable effective source code optimization.

5 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 5 Illustration: bottlenecks of SPEC2000 on Itanium1

6 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 6 Good Cache Behavior = High Data Locality Caches allow more efficient execution when they retain data between different data reuses. Cache hits only occur when distance between reuses is small enough. (reuse distance) Long reuse distance => poor locality Short reuse distance => good locality

7 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 7 Optimize temporal locality: How-to? Optimize temporal locality = Shorten reuse distance General optimization methodology: –profile to find hot spots –refactor the hot spots to diminish bottleneck However: hot spots where misses occur is often not the place where a refactoring is needed to optimize temporal locality!!! (Example is given in a few slides)

8 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 8 Methodology to optimize temporal locality Goal: Find the refactorings that improves temporal locality, so that cache misses are eliminated. Step 1: locate cache misses in source code (e.g. using VTune, Cachegrind, etc.) (1990’s) Step 2: find previous uses (cache miss = long-distance reuse, using StatCache (Berg 2005), RDVIS (Beyls 2005)) Step 3: find refactoring to shorten reuse distance (using SLO (Suggestions for Locality Optimizations)), by analyzing reuse paths. Reuse path = code executed between use and reuse.

9 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 9 all cache misses occur here. Methodology to optimize locality: example: step 1 double sum( … ) { … for(int i=0; i<len; i++) result += X[i]; … } (VTune, Cachegrind, CProf, HPCView, Memspy, …)

10 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 10 previous uses occur here. all cache misses occur here. Methodology to optimize locality: example: step 2 double sum( … ) { … for(int i=0; i<len; i++) result += X[i]; … } double inproduct(…) { … for(int i=0; i<len; i++) result += X[i]*Y[i]; … } (StatCache, RDVIS)

11 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 11 Methodology to optimize locality: example: step 3 double inproduct( … ) { … for(int i=0; i<len; i++) result += X[i]*Y[i]; … } double sum( … ) { … for(int i=0; i<len; i++) result += X[i]; … } (SLO – Suggestions For Locality Optimizations)

12 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 12 Methodology to optimize cache behavior: example: conclusions Conclusions from example: Cache misses occur in function sum. Refactoring to optimize locality is needed in function prodsum. => refactoring is needed in different location than where the bottleneck occurs!! Place where previous use occurs (in inproduct) is not enough information to find the refactoring (in prodsum). How to automatically pinpoint the location where a refactoring is needed? =>topic of this paper.

13 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 13 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

14 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 14 Reuse paths are complex for many long-distance reuses. Long-distance reuses need to be shortened. All code executed between use and long-distance reuse is responsible for long distance. = reuse path. Reuse paths often span large section of source code, hence they are difficult to analyze manually. How to refactor complex reuse paths so that the distance is drastically reduced? Complexity of reuse paths is reduced by giving them hierarchical structure in the function call and loop hierarchy.

15 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 15 Function Call and Loop Hierarchy Function Call and Loop Hierarchy is a tree containing three kinds of internal nodes: Function call nodes Loop nodes Iteration nodes Tree reflects function calls, loops and iterations as executed at run-time.

16 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 16 Function Call and Loop Hierarchy For each memory reuse (indicated by arrow), the least common ancestor (LCA) of use and reuse is the level where an overview over both reuses is possible. LCA

17 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 17 Function Call and Loop Hierarchy To shorten the reuse distance, the nodes just below the LCA must be fused/interleaved at run- time. LCA

18 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 18 Function Call and Loop Hierarchy Mapping back to source code: For each transformation indicated in source code, it’s importance can be read from reuse distance histogram:

19 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 19 Function Call and Loop Hierarchy What transformation to apply? Function-nodes: fuse functions Loop-nodes: fuse loops Iteration-nodes: fuse iterations? = loop interchange or loop tiling…

20 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 20 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

21 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 21 Problems with naively measuring loop and functions hierarchy. Typically, billions of memory accesses and loop iterations in a program run. => loop and function hierarchy contains billions of nodes. Typically billions of reuses.

22 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 22 Solutions: 1.Only track “open reuse pairs” 2.Use reservoir sampling to only track limited number of reuses, while guaranteeing accuracy.

23 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 23 Tracking open reuse pairs. Details are in the paper Sample open reuses X X

24 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 24 Overview Cache behaviour optimization by reuse analysis: motivation & concepts Efficient profiling of reuse paths Experimental results Conclusion

25 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 25 Experiment Evaluation Profiling implemented by extending GCC: GCC-SLO. Results are interactively visualized by SLO. Part 1: Overhead of profiling SPEC2000. –Memory overhead –Execution time overhead Part 2: Program speedups attainable: –Optimized 5 programs from SPEC2000 using SLO. –Evaluated speedup on 5 different platforms

26 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 26 Overhead reduction of reuse path profiling by reservoir sampling From 1000-fold slow-down to 5-fold slow-down All SPEC2000 programs can be processed in less than 250 MiB extra memory.

27 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 27 Locality profiles: examples from SPEC2000: 173.APPLU …… Few refactorings to optimize most long-distance reuses. Blue refactoring: fuse function jacld will blts. At the algorithmic level: combine jacld: form lower triangular part of jacobian matrix with blts: solve lower triangular part

28 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 28 Locality profiles: other examples from SPEC2000 EquakeVPRArt CraftyGCCGalgel

29 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 29 Using SLO to optimize 5 SPEC2000 programs: Itanium.

30 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 30 Using SLO to optimize 5 SPEC2000 programs: 5 platforms.

31 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 31 Overview Cache behaviour optimization by reuse analysis: concepts Efficient measurement of reuse pairs and hierarchy of functions and loops Experimental results Conclusion

32 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 32 Conclusions To improve temporal locality, often a refactoring is needed in a function other than the one that generates the cache misses! Reuse Path Analysis in the Function Call and Loop Hierarchy discovers the required refactorings. Sampling reduces overhead from 1000x to 5x, and almost constant memory overhead Typically less than 10 refactorings required. Indicated refactorings are performance-portable over wide range of architectures. Implemented in SLO - Suggestions for Locality Optimizations: http://slo.sourceforge.net

33 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 33 Backup slides

34 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 34 SLO: example (continued)

35 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 35 SLO: example (continued)

36 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 36 SLO: example (continued) Resulting speedup on Pentium4: 5.72

37 Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 – 2006-09-13 pag. 37 SLO: example VPR


Download ppt "Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC06 - 2006-09-13 pag. 1 Discovery of Locality-Improving Refactorings."

Similar presentations


Ads by Google