Nvopencc tutorial1 Tutorial on NVIDIA’s Open64 Sources by Mike Murphy 11/06.

Slides:



Advertisements
Similar presentations
Chapter 11 Introduction to Programming in C
Advertisements

CSC 4181 Compiler Construction Code Generation & Optimization.
Programs in Memory Bryce Boe 2012/08/29 CS32, Summer 2012 B.
CPSC 388 – Compiler Design and Construction
Semantic Analysis and Symbol Tables
Target Code Generation
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) SSA Guo, Yao.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Intermediate Code Generation
7. Optimization Prof. O. Nierstrasz Lecture notes by Marcus Denker.
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Control-Flow Graphs & Dataflow Analysis CS153: Compilers Greg Morrisett.
1 Lecture 4: Procedure Calls Today’s topics:  Procedure calls  Large constants  The compilation process Reminder: Assignment 1 is due on Thursday.
Chapter 10 Code Optimization. A main goal is to achieve a better performance Front End Code Gen Intermediate Code source Code target Code user Machine-
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
The Jacquard Programming Environment Mike Stewart NUG User Training, 10/3/05.
1 Compiler Construction Intermediate Code Generation.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Components of representation Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements occur in right order Data.
Program Representations. Representing programs Goals.
Introduction to Advanced Topics Chapter 1 Mooly Sagiv Schrierber
6/9/2015© Hal Perkins & UW CSEU-1 CSE P 501 – Compilers SSA Hal Perkins Winter 2008.
Common Sub-expression Elim Want to compute when an expression is available in a var Domain:
TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
1 Handling nested procedures Method 1 : static (access) links –Reference to the frame of the lexically enclosing procedure –Static chains of such links.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.
NVIDIA’s Experience with Open64 Mike Murphy NVIDIA.
Class canceled next Tuesday. Recap: Components of IR Control dependencies: sequencing of operations –evaluation of if & then –side-effects of statements.
1 Copy Propagation What does it mean? – Given an assignment x = y, replace later uses of x with uses of y, provided there are no intervening assignments.
Improving Code Generation Honors Compilers April 16 th 2002.
Recap from last time: live variables x := 5 y := x + 2 x := x + 1 y := x y...
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.
Precision Going back to constant prop, in what cases would we lose precision?
© 2008, Renesas Technology America, Inc., All Rights Reserved 1 Purpose  This training course describes how to configure the the C/C++ compiler options.
ICD-C Compiler Framework Dr. Heiko Falk  H. Falk, ICD/ES, 2008 ICD-C Compiler Framework 1.Highlights and Features 2.Basic Concepts 3.Extensions.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Minimal standard C program int main(void) { return 0 ; }
Using Open64 for High Performance Computing on a GPU by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong.
3/6/20161 WHIRL SSA: A New Optimization Infrastructure for Open64 Keqiao Yang, Zhemin Yang Parallel Processing Institute, Fudan University, Shanghai Hui.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Code Optimization More Optimization Techniques. More Optimization Techniques  Loop optimization  Code motion  Strength reduction for induction variables.
CS 404 Introduction to Compiler Design
Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Code Optimization Overview and Examples
Compiler Construction (CS-636)
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Java Primer 1: Types, Classes and Operators
Semantic Analysis with Emphasis on Name Analysis
Optimization Code Optimization ©SoftMoore Consulting.
C Short Overview Lembit Jürimägi.
User-Defined Functions
CS 3304 Comparative Languages
Code Generation.
Chapter 6 Intermediate-Code Generation
Code Optimization Overview and Examples Control Flow Graph
Topic 5a Partial Redundancy Elimination and SSA Form
Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.
Intermediate Code Generation
EECS 583 – Class 9 Classic and ILP Optimization
C++ Programming Basics
The SGI Pro64 Compiler Infrastructure
Target Code Generation
CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019
Peter Oostema & Rajnish Aggarwal 6th March, 2019
Presentation transcript:

nvopencc tutorial1 Tutorial on NVIDIA’s Open64 Sources by Mike Murphy 11/06

nvopencc tutorial2 Outline What it is Where it is How to build it How to use it How to debug it How to change it Future work

nvopencc tutorial3 What it is nvopencc is a variant of the open-source Open64 compiler that targets NVIDIA’s virtual assembly PTX. nvopencc is invoked by nvcc, which does a preprocessing pass with cudafe, then calls nvopencc to produce PTX, which is then fed into OCG to produce SASS.

nvopencc tutorial4 What it is - Definitions Open64: sources: sw/compiler/gpgpu/open64/src docs: /doc/howto-debug-compiler nvcc: sw/compiler/gpgpu/doc/nvcc.doc PTX: sw/compiler/gpgpu/doc/spec/ptx_isa_beta.doc

nvopencc tutorial5 Subset of Open64 supports C, not C++ or FORTRAN no Inter-Procedural Analysis no Loop Nest Optimization no preprocessing or linking

nvopencc tutorial6 3 sub-executables Front end (gfec) –based on gcc, produces WHIRL IR Inliner (inline) –inlines all calls Back end (be) –optimizes and lowers WHIRL into PTX

nvopencc tutorial7 Back End phases VHO (Very High Optimizer) –switch -> if/else –struct copies -> field copies WOPT (Whirl OPTimizer) CG (Code Generator)

nvopencc tutorial8 WOPT translates WHIRL into SSA (Static Single Assignment) form then back to WHIRL PreOpt => MainOpt => RVI be/opt/opt_main.cxx lists main papers for algorithms –constant folding –copy propagation –dead code elimination –full and partial redundancy elimination –control flow optimization –register variable identification –strength reduction –induction variable recognition and elimination –code motion –alias analysis

nvopencc tutorial9 CG expand WHIRL into PTX assign virtual registers convert 32-bit ops into 16-bit ops rematerialize GRF loads to reduce live-ranges combine contiguous load/stores into vectors emit PTX no scheduling no “real” register allocation relies on OCG

nvopencc tutorial10 Changes from default Open64 ported to new target PTX host work to build on windows new intrinsics memory spaces optimizing struct copies tuning WOPT optimizations CG optimizations: vectors, rematerializing, 16-bit conversion

nvopencc tutorial11 Outline What it is Where it is How to build it How to use it How to debug it How to change it Future work

nvopencc tutorial12 Source Directories target-specific subdirectories like NVISA or x8664 ifdef TARG_NVISA// NVISA == PTX sw/compiler/gpgpu/open64/src/* be/be- backend driver be/cg- code generator be/com- common/shared files be/lno- loop nest optimizer be/opt- whirl optimizer be/region- region utilities be/vho- very high whirl optimizer common/com- main common files (WHIRL/symtab) common/targ_info - target description common/util- utilities

nvopencc tutorial13 more source directories doc- howto-debug document driver- nvopencc driver gccfe- C front end takes gnu IR->WHIRL gccfe/gnu- actual gcc code include- headers used by open64 ipa- inter-procedural analysis ir_tools- ir_b2a for dumping whirl files libdwarf- dwarf library libdwarf/dwarfdump – utility to dump dwarf info libelf- elf library libelfutil- extra elf utilities libiberty- gnu utilities linux/make- gcommon{defs,rules} included by all makefiles

nvopencc tutorial14 build directories targia32*- where build compiler on ia32 host targia32_nvisa- nvisa target on linux targia32_x8664- x86 target on linux targia32gw_nvisa- nvisa target on mingw targia32cyg_nvisa- nvisa target on cygwin *_rel directories for non-debug release builds installs in export/*/open64/bin/nvopencc nvopencc looks in../lib for gfec/inline/be export/*/bin/nvcc.profile has path to nvopencc

nvopencc tutorial15 Outline What it is Where it is How to build it How to use it How to debug it How to change it Future work

nvopencc tutorial16 building on linux cd sw/compiler/gpgpu; make open64_install cd open64/src/targia32_nvisa; make cd targia32_nvisa/libcg; make expand.o build directories != source directories /Makefile.gbase for each build dir

nvopencc tutorial17 building on windows same as linux, but need recent cygwin –sw/tools/win32/cygnus/2006 uses mingw so resulting executables can run on systems that don’t have cygwin backend uses static libraries rather than dlls/dsos

nvopencc tutorial18 Outline What it is Where it is How to build it How to use it How to debug it How to change it Future work

nvopencc tutorial19 from nvcc nvcc –keep.cu –produces.cpp3.i –input to nvopencc, creating.ptx --opencc-options –passes to nvopencc –e.g. --opencc-options –Wfib\\,-ttmsc:0x40 setenv OPENCC_FLAGS

nvopencc tutorial20 nvopencc directly nvopencc –show –keep x.i /lib/gfec -O2 -quiet -m32 -fpreprocessed -fbuiltin x.i -o x.B /lib/inline -O2 -INLINE:all -TARG:abi=n32 -fB,x.B -fI,x.I x.i /lib/be -PHASE:w:c -O2 -TARG:abi=n32 -LANG:=ansi_c -fB,x.I -s -fs,x.ptx x.i x.B and x.I (or x.BI) are elf files containing WHIRL - W{fib}, -Wb, passes option to back end Group option syntax: -WOPT: = :

nvopencc tutorial21 Outline What it is Where it is How to build it How to use it How to debug it How to change it Future work

nvopencc tutorial22 ir_b2a targ*/ir_tools/ir_b2a(Binary2Ascii) ir_b2a x.B will dump the WHIRL ir_b2a –st x.B will dump WHIRL and symbol table

nvopencc tutorial23 ir_b2a example int increment (int i) { return ++I; } ir_b2a produces: LOC 0 0 source files: 1 "c:\test/incr.i" LOC 1 1 int increment (int i) LOC 1 2 { FUNC_ENTRY IDNAME 0 BODY BLOCK END_BLOCK BLOCK END_BLOCK BLOCK PRAGMA (0x0) # PREAMBLE_END LOC 1 3 return ++i; BLOCK I4I4LDID 0 T I4INTCONST 1 (0x1) I4ADD I4STID 0 T END_BLOCK I4I4LDID 0 T I4COMMA I4RETURN_VAL END_BLOCK

nvopencc tutorial24 ir_b2a example explained LOC refers to source position (LOCation). The FUNC_ENTRY has one parameter: IDNAME 0 The later LDID is a load of this parameter. The <> gives a reference to the symbol table (level 2, index 1, name %parm_i). The symbol table usually has two levels: globals at level 1, and locals at level 2. There is a separate global table of types, which are the T references, which means type #4, named predef_I4, alignment 4. The I4 in the type and opcodes is a predefined "mtype": signed 4-byte integer. –Open64 types are in terms of bytes, whereas in PTX they are in bits, The I4I4LDID 0 says to load an I4 from offset 0 of. The first couple of empty BLOCKs are for pragmas; the third BLOCK has the list of statements, which in this case is just a store (STID). The code is printed in postfix order, so the child of STID is ADD, which has two kids, a LDID of parm_i and the constant 1.

nvopencc tutorial25 traces traces from –t* options are put in.t files see src/doc/howto-debug-compiler -tr gives IR dump after phase -ts gives symbol table after phase -tt : gives trace within phase -Wb,-trvho,-trlow -Wb,-ttopt:0xffffffff, -Wb,-ttexp:7,-trlra,-trebo

nvopencc tutorial26 adding a trace if (Get_Trace(TP_CGEXP, 0x800)) { fprintf (TFile, “new trace\n”); } -Wb,-ttexp:0x800

nvopencc tutorial27 adding a flag for –WOPT: add to common/com/config_wopt.cxx { OVK_BOOL, OV_VISIBLE, TRUE, "estr_outer_loop", "", 0, 0, 0, &WOPT_Enable_Estr_Outer_Loop, NULL }, if (WOPT_Enable_Estr_Outer_Loop) for –CG: add to be/cg/cgdriver.cxx

nvopencc tutorial28 DevWarns and Assertions DevWarn(“why am I here?”); -Wfib,-ttmsc:0x40 to turn on DevWarns FmtAssert(condition, (“message”));

nvopencc tutorial29 debugging builds with gcc, so use gdb can set breakpoint in Fail_FmtAssertion or DevWarn p dump_tree(WN*) p dump_st(ST*) p dump_ty(TY_IDX) p dump_op (OP*) p dump_tn (TN*)

nvopencc tutorial30 common data types WN*// Whirl Node; common/com/wn* ST*// Symbol Table; common/com/symtab* TY_IDX// TYpe Index; common/com/symtab* PREG// Pseudo-REGister; common/com/symtab* TYPE_ID | MTYPE // machine types; common/com/mtypes.h CODEREP* // SSA expression; be/opt/opt_htable.h STMTREP* // SSA statement; be/opt/opt_htable.h TN*// Temporary Name; be/cg/tn.h OP*// Operation; be/cg/op.h BB*// Basic Block; be/cg/bb.h TOP// Target OPcode; targ*/targ_info/topcode.h

nvopencc tutorial31 Outline What it is Where it is How to build it How to use it How to debug it How to change it Future work

nvopencc tutorial32 Example: adding an intrinsic 4 kinds of intrinsics 1.correspond to WHIRL instruction 2.map to no-side-effect PTX 3.have side effects 4.use vectors

nvopencc tutorial33 Intrinsic 1 (WHIRL) example: f32 max in gccfe/gnu/builtins.def: DEF_LIB_BUILTIN(BUILT_IN_FMAXF, "__builtin_fmaxf", BT_FN_FLOAT_FLOAT_FLOAT, ATTR_NOTHROW_LIST)

nvopencc tutorial34 Intrinsic 1 (WHIRL) in gccfe/wfe_expr.cxx: case BUILT_IN_FMAXF: arg1 = TREE_VALUE (arglist); arg2 = TREE_VALUE (TREE_CHAIN (arglist)); wn = WN_CreateExp2 (OPR_MAX, ret_mtype, MTYPE_V, WFE_Expand_Expr (arg1), WFE_Expand_Expr(arg2) ); whirl_generated = TRUE; in.B file: LOC 1 6 f = fmaxf(g,1.0f); F4F4LDID 0 T F4CONST F4MAX F4STID 0 T

nvopencc tutorial35 Intrinsic 1 (WHIRL) be/cg/NVISA/expand.cxx::Expand_Max() produces CG OP: [ 6] TN64003 :- max.f32 TN64001 TN64002 ; assigned registers: [ 6] TN64003($f3) :- max.f32 TN64001($f1) TN64002($f2) ; PTX: max.f32 $f3, $f1, $f2; TN == Temporary Name –can hold register, constant, or symbol names

nvopencc tutorial36 Intrinsic 2 (intrinsic_op) pure with no side effects example: f32 sin common/com/wintrinsic.h: INTRN_F4SIN common/com/intrn_info.cxx: { /* F4SIN */ BYVAL, PURE, NO_SIDEEFFECTS, DOES_RETURN, NOT_ACTUAL, CGINTRINSIC, IRETURN_F4, NULL, "SIN", "sinf"}, gccfe/wfe_expr.cxx: case BUILT_IN_SINF: iopc = INTRN_F4SIN; intrinsic_op = TRUE;

nvopencc tutorial37 Intrinsic 2 (intrinsic_op) WHIRL: LOC 1 5 f = sinf(f); F4F4LDID 0 T F4PARM 2 T # by_value F4INTRINSIC_OP 1 0 F4STID 0 T be/cg/NVISA/expand.cxx: case INTRN_F4SIN: Build_OP (TOP_sin_f32, result, op0, op1, ops);

nvopencc tutorial38 targ_info common/targ_info/isa/NVISA C++ files generate accessor files in targ*/targ_info/ isa.cxx – add instruction name isa_operands.cxx – describe operands isa_print.cxx – how to print to.ptx file isa_properties.cxx – e.g. TOP_is_load(t)

nvopencc tutorial39 Intrinsic 3 (intrinsic_call) has side effects so don’t optimize example: clock gccfe/wfe_expr.cxx: WN *wn = WN_Create_Intrinsic (OPC_I4INTRINSIC_CALL, INTRN_CLOCK, 0, NULL); calls are statements return value in next statement preg = Pseudo-REGister

nvopencc tutorial40 Intrinsic 3 (intrinsic_call) WHIRL: LOC c2 = clock(); // Read clock register I4INTRINSIC_CALL 0 # flags 0x0 I4I4LDID -1 T I4STID 34 T # I4I4LDID 34 T # I4STID 0 T be/cg/NVISA/expand.cxx: case INTRN_CLOCK: call_iresult = PREG_To_TN (Int_Preg, First_Int_Preg_Return_Offset); Build_OP (TOP_mov_u32, call_iresult, Clock_TN(), ops); return call_iresult;

nvopencc tutorial41 Intrinsic 4 (asm) intrinsic uses vectors vectors not basic type in Open64 & GCC vectors look like structs builtins won’t work, so use asm example: texfetch gccfe/wfe_expr.cxx: if (strcmp(name, "__utexfetchi1D") == 0) { wn = emit_builtin_texfetch(exp, "tex.1d.v4.u32.s32", MTYPE_U4, MTYPE_I4); asm_generated = TRUE;

nvopencc tutorial42 Outline What it is Where it is How to build it How to use it How to debug it How to change it Future work

nvopencc tutorial43 Future Work new hw features via intrinsics dwarf generation integrating with Open64 updates tune wopt to minimize register pressure unrolling using 16-bit instructions supporting calls analyze code to generate ideas