SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Lecture 13: 10/8/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Instruction Level Parallelism (ILP) Colin Stevens.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
Computer Architecture Lecture 3 Coverage: Appendix A
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Register Allocation (via graph coloring)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Lecture 1CS 380C 1 380C Last Time –Course organization –Read Backus et al. Announcements –Hadi lab Q&A Wed 1-2 in Painter 5.38N –UT Texas Learning Center:
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Topic ? Course Overview. Guidelines Questions are rated by stars –One Star Question  Easy. Small definition, examples or generic formulas –Two Stars.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Computer Architecture ECE 4801 Berk Sunar Erkay Savas.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
What’s in an optimizing compiler?
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Computer Organization and Architecture Tutorial 1 Kenneth Lee.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /19/2013 Lecture 17: The Processor - Overview Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Instruction Rescheduling and Loop-Unroll Department of Computer Science Southern Illinois University Edwardsville Fall, 2015 Dr. Hiroshi Fujinoki
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
1  1998 Morgan Kaufmann Publishers Where we are headed Performance issues (Chapter 2) vocabulary and motivation A specific instruction set architecture.
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
Memory-Aware Compilation Philip Sweany 10/20/2011.
CS412/413 Introduction to Compilers and Translators April 2, 1999 Lecture 24: Introduction to Optimization.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Credible Compilation With Pointers Martin Rinard and Darko Marinov Laboratory for Computer Science Massachusetts Institute of Technology.
Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.
Code Optimization Overview and Examples
Code Optimization.
Morgan Kaufmann Publishers
Optimization Code Optimization ©SoftMoore Consulting.
Compiler techniques for exposing ILP (cont)
Instruction Rescheduling and Loop-Unroll
CSC3050 – Computer Architecture
Loop-Level Parallelism
EE108b Review Session February 2nd, 2007 Daxia Ge.
Presentation transcript:

SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292

SCIENCES USC INFORMATION INSTITUTE Motivation Performance analysis is conceptually easy –Just run the program! The “what” of performance. Is this interesting? –Is that realistic? Huge programs with large data sets –“Uncertainty principle” and intractability of profiling/instrumenting Performance prediction and analysis is in practice very hard –Not just interested in wall clock time The “why” of performance is a big concern How to accurately characterize program behavior? What about architecture effects? –Can’t reuse wall clock time –Can reuse program characteristics

SCIENCES USC INFORMATION INSTITUTE Motivation (2) What about the future? Different architecture = better results? Compiler transformations (loop unrolling) Need a fast, scalable, automated way of determining program characteristics –Determine what causes poor performance What does profiling tell us? How can the programmer use profiling (low-level) information?

SCIENCES USC INFORMATION INSTITUTE Overview Approach –High level / low level synergy –Not architecture-bound Experimental results –CG core Caveats and future work Conclusion

SCIENCES USC INFORMATION INSTITUTE Low versus High level information la $r0, a lw $r1 i mult $offset, $r1, 4 add $offset, $offset, $r0 lw $r2, $offset add $r3, $r2, 1 la $r4, b sw $r4, $r3 or Which can provide meaningful performance information to a programmer? How do we capture the information at a low level while maintaining the structure of high level source?

SCIENCES USC INFORMATION INSTITUTE Low versus High level information (2) Drawbacks of looking at low-level –Too much data! –You found a “problem” spot. What now? How do programmers relate information back to source level? Drawbacks of looking at source-level –What about the compiler? Code may look very different –Architecture impacts? Solution: Look at high-level structure, try to anticipate compiler

SCIENCES USC INFORMATION INSTITUTE Experimental Approach Goal: Derive performance expectations from source code for different architectures –What should the performance be and why? –What is limiting the performance? Data-dependencies? Architecture limitations? Use high level information –WHIRL intermediate representation in Open64 Arrays not lowered Construct DFG –Decorate graph with latency information Schedule the DFG –Compute as-soon-as-possible schedule –Variable number of functional units ALU, Load/Store, Registers Pipelining of operations

SCIENCES USC INFORMATION INSTITUTE Compilation process OPR_STID: B OPR_ADD OPR_ARRAY OPR_LDA: A OPR_LDID: i OPR_CONST: 1 for (i; i < 0; … … B = A[i] + 1 … 1. Source (C/Fortran) 2. Open64 WHIRL (High-level) 3. Annotated DFG

SCIENCES USC INFORMATION INSTITUTE Memory modeling approach Array node represents address calculation at a high level i is a loop induction variable Array expression is affine. Assume a cache hit, and assign latency accordingly Register hit? Assign latency 0

SCIENCES USC INFORMATION INSTITUTE Example: CG do 200 j = 1, n xj = x(j) do 100 k = colstr(j), colstr(j+1)-1 y(rowidx(k)) = y(rowidx(k)) + a(k) + xj 100 continue 200 continue

SCIENCES USC INFORMATION INSTITUTE CG Analysis Results Figure 4. Validation results of CG on a MIPS R10000 machine Prediction results consistent with un-optimized version of the code

SCIENCES USC INFORMATION INSTITUTE CG Analysis Results (2) What’s the best way to use processor space? –Pipelined ALUs? –Replicate standard ALUs? Figure 5. Cycle time for an iteration of CG with varying architectural configurations

SCIENCES USC INFORMATION INSTITUTE Caveats, Future Work More compiler-like features are needed to improve accuracy –Control flow Implement trace scheduling Multiple-paths can give upper/lower performance bounds –Simple compiler transformations Common sub-expression elimination Strength reduction Constant folding –Register allocation “Distance”-based methods? Anticipate cache for spill code –Software pipelining? Unrolling exploits ILP Run-time data? –Array references, loop trip counts, access patterns from performance skeletons

SCIENCES USC INFORMATION INSTITUTE Conclusions SLOPE provides very fast performance prediction and analysis results High-level approach gives more meaningful information –Still try to anticipate compiler and memory hierarchy More compiler transformations to be added –Maintain high-level approach, refine low-level accuracy

SCIENCES USC INFORMATION INSTITUTE An Open64-based Compiler Approach to Performance Prediction and Performance Sensitivity Analysis for Scientific Codes Jeremy Abramson and Pedro C. Diniz University of Southern California / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292