Processor Architectures and Program Mapping

Slides:



Advertisements
Similar presentations
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
Advertisements

ECE 667 Synthesis and Verification of Digital Circuits
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Embedded Systems in Silicon TD5102 Compilers with emphasis on ILP compilation Henk Corporaal Technical.
1 CS 201 Compiler Construction Software Pipelining: Circular Scheduling.
Compiler techniques for exposing ILP
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.
VLIW Compilation Techniques in a Superscalar Environment Kemal Ebcioglu, Randy D. Groves, Ki- Chang Kim, Gabriel M. Silberman and Isaac Ziv PLDI 1994.
Instruction Level Parallelism (ILP) Colin Stevens.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
CSRD, University of Illinois at Urbana-Champaign 1 A Complete Compilation System.
Design Methodology for Customizable Programmable Processors Berkeley – Finland Day, Oct. 18, 2002 Prof. Jarmo Takala Institute of Digital and Computer.
Processor Architectures and Program Mapping 5kk10 TU/e 2006 Henk Corporaal Jef van Meerbergen Bart Mesman.
Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman Exploiting ILP VLIW architectures (part b) ILP compilation (part a)
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Platform-based Design TU/e 5kk70 Henk Corporaal Bart Mesman ILP compilation (part b)
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Generic Software Pipelining at the Assembly Level Markus Pister
Embedded Computer Architecture TU/e 5kk73 Henk Corporaal VLIW architectures: Generating VLIW code.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
1 Compiler Design (40-414)  Main Text Book: Compilers: Principles, Techniques & Tools, 2 nd ed., Aho, Lam, Sethi, and Ullman, 2007  Evaluation:  Midterm.
Compilers: Overview/1 1 Compiler Structures Objective – –what are the main features (structures) in a compiler? , Semester 1,
What is a compiler? –A program that reads a program written in one language (source language) and translates it into an equivalent program in another language.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Advanced Computer Systems
Compiler Design (40-414) Main Text Book:
Advanced Architectures
PRINCIPLES OF COMPILER DESIGN
Introduction to Compiler Construction
Design-Space Exploration
ESE532: System-on-a-Chip Architecture
Optimizing Compilers Background
课程名 编译原理 Compiling Techniques
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
Henk Corporaal TUEindhoven 2009
Platform-based Design
CSL718 : VLIW - Software Driven ILP
Peter Poplavko, Saddek Bensalem, Marius Bozga
Hardware Multithreading
Performance Optimization for Embedded Software
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
CS 201 Compiler Construction
Instruction Scheduling Hal Perkins Winter 2008
Loop Scheduling and Software Pipelining
Parallelization, Compilation and Platforms PCP
Henk Corporaal TUEindhoven 2011
Code Optimization Overview and Examples Control Flow Graph
Architectural-Level Synthesis
Instruction Level Parallelism (ILP)
ESE532: System-on-a-Chip Architecture
Static Code Scheduling
Instruction Scheduling Hal Perkins Autumn 2005
CMSC 611: Advanced Computer Architecture
Compiler Structures 1. Overview Objective
Instruction Scheduling Hal Perkins Autumn 2011
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Processor Architectures and Program Mapping Exploiting ILP part 2: code generation TU/e 5kk10 Henk Corporaal Jef van Meerbergen Bart Mesman

Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Hands-on 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics Overview Compiler trajectory / structure / passes Control Flow Graph (CFG) Mapping and Scheduling Basic block list scheduling Extended scheduling scope Loop schedulin 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics: trajectory Source program Preprocessor Compiler Error messages Assembler Library code Loader/Linker Object program 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics: structure / passes Source code Lexical analyzer token generation check syntax check semantic parse tree generation Parsing Intermediate code data flow analysis local optimizations global optimizations Code optimization code selection peephole optimizations Code generation making interference graph graph coloring spill code insertion caller / callee save and restore code Register allocation Sequential code Scheduling and allocation exploiting ILP Object code 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics: structure Simple compilation example position := initial + rate * 60 Lexical analyzer temp1 := intoreal(60) temp2 := id3 * temp1 temp3 := id2 + temp2 id1 := temp3 id := id + id * 60 Syntax analyzer Code optimizer temp1 := id3 * 60.0 id1 := id2 + temp1 := + id * 60 Code generator movf id3, r2 mulf #60, r2, r2 movf id2, r1 addf r2, r1 movf r1, id1 Intermediate code generator 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Compiler basics: Control flow graph (CFG) C input code: if (a > b) { r = a % b; } else { r = b % a; } 1 sub t1, a, b bgz t1, 2, 3 CFG: 2 rem r, a, b goto 4 3 rem r, b, a goto 4 4 ………….. Program, is collection of Functions, each function is collection of Basic Blocks, each BB contains set of Instructions, each instruction consists of several Transports,.. 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Mapping / Scheduling: placing operations in space and time b 2 d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; * * d z y + + + e f - x r Data Dependence Graph (DDG) 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

How to map these operations? Architecture constraints: One Function Unit All operations single cycle latency a b 2 * * d cycle + + z y 1 * e f + 2 - * x 3 + r 4 + 5 - 6 + 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

How to map these operations? Architecture constraints: One Add-sub and one Mul unit All operations single cycle latency * + - a b 2 z y d e f r x Mul Add-sub cycle 1 * + 2 * + 3 + 4 - 5 6 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

There are many mapping solutions Pareto curve (solution space) T execution x Cost 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Basic Block Scheduling Make a dependence graph Determine minimal length Determine ASAP, ALAP, and slack of each operation Place each operation in first cycle with sufficient resources Note: Scheduling order sequential Priority determined by used heuristic; e.g. slack 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Basic Block Scheduling ASAP cycle B C ALAP cycle ADD A <1,1> slack SUB A C <2,2> ADD NEG LD <3,3> <1,3> <2,3> A B LD MUL ADD <4,4> <2,4> <1,4> z X y 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Cycle based list scheduling proc Schedule(DDG = (V,E)) beginproc ready = { v | (u,v)  E } ready’ = ready sched =  current_cycle = 0 while sched  V do for each v  ready’ do if ResourceConfl(v,current_cycle, sched) then cycle(v) = current_cycle sched = sched  {v} endif endfor current_cycle = current_cycle + 1 ready = { v | v  sched   (u,v) E, u  sched } ready’ = { v | v  ready   (u,v) E, cycle(u) + delay(u,v)  current_cycle} endwhile endproc 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Extended basic block scheduling: Code Motion a) add r4, r4, 4 b) beq . . . D e) st r1, 8(r4) C d) sub r1, r1, r2 B c) add r1, r1, r2 Downward code motions? — a  B, a  C, a  D, c  D, d  D Upward code motions? — c  A, d  A, e  B, e  C, e  A 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Extended Scheduling scope Code: CFG: Control Flow Graph A; If cond Then B Else C; D; Then E Else F; G; A B C D E F G 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Scheduling scopes Trace Superblock Decision tree Hyperblock/region 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Code movement (upwards) within regions destination block Legend: Copy needed I I Intermediate block I I Check for off-liveness Code movement I add source block 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Extended basic block scheduling: Code Motion A dominates B  A is always executed before B Consequently: A does not dominate B  code motion from B to A requires code duplication B post-dominates A  B is always executed after A B does not post-dominate A  code motion from B to A is speculative A C B E D F Q1: does C dominate E? Q2: does C dominate D? Q3: does F post-dominate D? Q4: does D post-dominate B? 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Scheduling: Loops Loop Optimizations: Loop unrolling Loop peeling A B 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Basic block scheduling Scheduling: Loops Problems with unrolling: Exploits only parallelism within sets of n iterations Iteration start-up latency Code expansion Basic block scheduling Basic block scheduling and unrolling resource utilization Software pipelining time 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Software pipelining Software pipelining a loop is: Or: Scheduling the loop such that iterations start before preceding iterations have finished Or: Moving operations across the backedge Example: y = a.x  LD LD ML LD ML ST ML ST ST LD LD ML LD ML ST ML ST ST LD ML ST Unroling 5/3 cycles/iteration Software pipelining 1 cycle/iteration 3 cycles/iteration 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Software pipelining (cont’d) Basic techniques: Modulo scheduling (Rau, Lam) list scheduling with modulo resource constraints Kernel recognition techniques unroll the loop schedule the iterations identify a repeating pattern Examples: Perfect pipelining (Aiken and Nicolau) URPR (Su, Ding and Xia) Petri net pipelining (Allan) Enhanced pipeline scheduling (Ebcioğlu) fill first cycle of iteration copy this instruction over the backedge 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Software pipelining: Modulo scheduling Example: Modulo scheduling a loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) (b) Code without loop control for (i = 0; i < n; i++) a[i+6] = 3* a[i] - 1; (a) Example loop ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Prologue ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Kernel ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5) Epilogue (c) Software pipeline Prologue fills the SW pipeline with iterations Epilogue drains the SW pipeline 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Software pipelining: determine II, Initation Interval Cyclic data dependences For (i=0;.....) A[i+6]= 3*A[i]-1 ld r1, (r2) mul r3, r1, 3 (0,1) (1,0) sub r4, r3, 1 st r4, (r5) (1,6) (delay, distance) cycle(v)  cycle(u) + delay(u,v) - II.distance(u,v) 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Modulo scheduling constraints MII minimum initiation interval bounded by cyclic dependences and resources: MII = max{ ResMII, RecMII } Resources: Cycles: Therefore: Or: 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

The Role of the Compiler 9 steps required to translate an HLL program Front-end compilation Determine dependencies Graph partitioning: make multiple threads (or tasks) Bind partitions to compute nodes Bind operands to locations Bind operations to time slots: Scheduling Bind operations to functional units Bind transports to buses Execute operations and perform transports 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Division of responsibilities between hardware and compiler Application Frontend Superscalar Determine Dependencies Determine Dependencies Dataflow Binding of Operands Binding of Operands Multi-threaded Scheduling Scheduling Indep. Arch Binding of Operations Binding of Operations VLIW Binding of Transports Binding of Transports TTA Execute Responsibility of compiler Responsibility of Hardware 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Overview Enhance performance: architecture methods Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering Code generation Hands-on 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Hands-on (not this year) Map JPEG to a TTA processor see web page: http://www.ics.ele.tue.nl/~heco/courses/pam Install TTA tools (compiler and simulator) Go through all listed steps Perform DSE: design space exploration Add SFU 1 or 2 page report in 2 weeks 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Hands-on Let’s look at DSE: Design Space Exploration We will use the Imagine processor http://cva.stanford.edu/projects/imagine/ 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Mapping applications to processors MOVE framework User intercation Pareto curve (solution space) cost exec. time x Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip TTA based system 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Code generation trajectory for TTAs Frontend: GCC or SUIF (adapted) Application (C) Compiler frontend Architecture description Sequential code Sequential simulation Input/Output Compiler backend Profiling data Parallel code Parallel simulation Input/Output 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Exploration: TTA resource reduction 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Exporation: TTA connectivity reduction Critical connections disappear Reducing bus delay Execution time FU stage constrains cycle time Number of connections removed 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Can we do better Yes !! How ? Transformations SFUs: Special Function Units Multiple Processors Cost Execution time 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Transforming the specification + + + + Based on associativity of + operation a + (b + c) = (a + b) + c 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Transforming the specification d = a * b; e = a + d; f = 2 * b + d; r = f – e; x = z + y; r = 2*b – a; x = z + y; 1 b y z << a + - x r 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Changing the architecture adding SFUs: special function units + + + + 4-input adder why is this faster? 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Changing the architecture adding SFUs: special function units In the extreme case put everything into one unit! Spatial mapping - no control flow However: no flexibility / programmability !! 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SFUs: fine grain patterns Why using fine grain SFUs: Code size reduction Register file #ports reduction Could be cheaper and/or faster Transport reduction Power reduction (avoid charging non-local wires) Supports whole application domain ! Which patterns do need support? Detection of recurring operation patterns needed 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

SFUs: covering results 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Exploration: resulting architecture 9 buses 4 RFs 4 Addercmp FUs 2 Multiplier FUs 2 Diffadd FUs stream output input Architecture for image processing Note the reduced connectivity 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Conclusions Billions of embedded processing systems how to design these systems quickly, cheap, correct, low power,.... ? what will their processing platform look like? VLIWs are very powerful and flexible can be easily tuned to application domain TTAs even more flexible, scalable, and lower power 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Conclusions Compilation for ILP architectures is getting mature, and Enters the commercial area. However Great discrepancy between available and exploitable parallelism Advanced code scheduling techniques needed to exploit ILP 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman

Bottom line: Do not pay for hardware if you can do it by software !! 1/15/2019 Processor Architectures and Program Mapping H. Corporaal, J. van Meerbergen, and B. Mesman