Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander

2 Overview 1.Background: program optimization research 2.XML representations 3.Visualizations 4.Conclusion

3 Program optimization research What slows down a program execution? Need to pinpoint the performance bottlenecks. (by analyzing the program) How to improve the performance? By program transformations, based on pinpointed bottlenecks. How to transform the program? 1.Compiler advantage: automatic optimization disadvantage: sometimes hard to understand what program does 2.Programmer: advantage: has good understanding of program functionality disadvantage: requires human effort / How to present performance bottlenecks best? How to construct a research infrastructure that supports all the above in a common framework? (  XML)

4 Two main performance factors Parallelism performing computation in parallel reduces execution time Data locality fetching data from fast CPU caches reduces execution time

6 Why XML representations? Extensible and versatile Standard and Interoperable Language Independent XML namespace (tool) Representing 1. ast (yaxx) abstract syntax tree 2. par (oc) identified parallel or sequential loops 3. trace (isv, cv) execution trace of memory instructions 4. hotspot (isv,cv) performance bottleneck locations 5. isdg (isv) iteration space dependence graph 6. rdv (distv) a reuse distance vector yaxx – YACC extension to XML oc – Omega calculator isv – iteration space visualizer cv – cache (trace) visualizer distv – (cache reuse) distance visualizer

7 1. AST (Abstract Syntax Tree) ( ast ) XML is a good representation for AST by its hierarchical nature. ast namespace captures syntactical information of a program We can construct AST from source code through YAXX and regenerate source code through XSLT. … DO I=1,10,1 …… ENDDO

8 Program optimization research What slows down a program execution? Need to pinpoint the performance bottlenecks. (by analyzing the program) How to improve the performance? By program transformations, based on pinpointed bottlenecks. Who transforms the program? 1.Compiler advantage: automatic optimization disadvantage: sometimes hard to understand what program does 2.Programmer: advantage: has good understanding of program functionality disadvantage: requires human effort / How to present performance bottlenecks best? How to construct a research infrastructure that supports all the above in a common framework? (  XML)

9 2. Parallel loops ( par ) Identified parallel loop are annotated with a element in the “par” namespace. … In this way, semantics and syntax information are in orthogonal name spaces. Syntax-based tools (e.g. unparser) can still ignore it, or translate it into directive comments: e.g. Fortran C$DOALL.

10 XFPT: an extended optimizing compiler

12 3. Traces ( trace ) Trace records a sequence of memory address accesses …… Trace alone can be used to identify runtime data dependences and identify cache misses through cache simulator Associate an address with the array reference number or loop iteration index on the program’s AST, the trace can be used for advanced loop dependence analysis and cache reuse distance analysis. ……

13 4. Hotspots ( hotspot ) Hot spots are identified bottlenecks of the program Two types are used:  Bottleneck loops: tells which loop is the performance bottlenecks  Bottleneck references: tells which references are performance bottlenecks …… 1 10 …… …… 1DIM T(3), X(10) 2REAL S, X 3DO I = 1, 10 4 DO J = 1, 10 5 S = S + X(I)*J 6 ENDDO 7ENDDO 8…

16 Performance Visualizations XML plays an important role to glue the visualizers with an optimizing compiler: 1.Loop dependence visualization 2.Reuse distance visualization 3.Cache behavior visualization

17 Visualization 1: ISDG: iteration space dependence graph An iteration is an instance of the loop body statements. An iteration space is the set of integer vector values of the DO loop index variables for the traversed iterations. Loop carried dependence is a dependence caused by two references R1 and R2 that access to the same memory address, while: 1.One of R1, R2 is a write 2.R1 belongs to loop iteration (i1, j1) and R2 belongs to loop iteration (i2, j2)  (i1,j1) A ISDG is a graph with nodes representing the iteration space and edges representing loop carried dependences. DO i=1,5 DO j=1,5 A(i,j) = A(i,j+1) ENDDO ENDDO i j 1 1 5 5

18 The WTCM CFD application WTCM has a Computational Fluid Dynamics simulator which involves solving partial differential equations (PDE) through a Gauss- Siedel solver temperature 3D geometry + 1D time

19 The visualized dependences

20 The loop transformation A 3-D unimodular transformation is found after visualizing the 4D loop nest which has 177 array references at run-time for each iteration. Here we use a regular shape. The transformation makes it possible to speed-up the program around N 2 /6 times where N is the diameter of the geometry.

21 Visualization 2: Reuse distances Reuse distance is the amount of data accessed before a memory address is reused. reuse distance > cache size  cache miss

23 Execution time reduction on an Itanium processor (Spec2000 programs).

24 Visualization 3: Cache miss traces (Tomcatv/Spec95) White: hit Blue: compulsory Green: capacity Red: conflict

25 4.2 Visualizing hotspots of conflict cache misses X(I,J+1) and X(I,J) has conflict if X has a dimension (512,512). It is resolved by changing the dimension to (524, 524). Also known as, Array Padding

26 4.2 Cache misses trace after array padding, most spatial locality is exploited, conflict misses resolved On Intel550MHzPentium III(single CPU),the measuredspeedup withVTune > 50%

28 Conclusion An existing optimizing compiler FPT was extended with an extensible XML interface. The performance factors, in particular loop parallelism and data locality, were exported from FPT. These factors were visualized through  Loop dependence visualizer ISV  Execution trace visualizer CacheVis  Reuse distance visualizer ReuseVis The programmer can use the visualized feedback to improve the performance.

29 The End. Any questions?

30 Program semantics (Software) vs. Architecture capabilities (Hardware) Research AreaProgramArchitecture Parallel Computing Parallelism at Task, Loop, Instruction levels through data dependence analysis Multi-processors (MIMD), pipeline (SIMD), multi-threads, network of workstations (NOW, Grid computing) Memory-hierarchy Temporal and spatial data locality, data layout, stack reuse distances Cache at level 1, 2, 3, TLB, set associativity, data replacement policy

31 2. Major Performance factors Parallelism  Loop dependences  Loop-level parallelism  Instruction-level parallelism  Partition load balance Data locality  Temporal locality  Spatial locality  CCC (Compulsory, Capacity, Conflict) cache misses  Reuse distances

32 3.6 Cache parameters To tune different architectural cache configurations, we represent the cache parameters: cache size, cache line size and set associativity, into a configuration file in XML. For example, a 2-level cache is specified as follows: 1024 32 65536 32 1

33 4.2 Visualizing data locality histogram distributed over reuse distances

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Similar presentations

Presentation on theme: "Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Similar presentations

Presentation on theme: "Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander."— Presentation transcript:

Similar presentations

About project

Feedback