9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.
ECE 667 Synthesis and Verification of Digital Circuits
Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Compiler techniques for exposing ILP
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Course Outline Traditional Static Program Analysis –Theory Compiler Optimizations; Control Flow Graphs Data-flow Analysis – today’s class –Classic analyses.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
EECC551 - Shaaban #1 Fall 2005 lec# Static Compiler Optimization Techniques We examined the following static ISA/compiler techniques aimed.
Constraint Programming for Compiler Optimization March 2006.
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.
Instruction-Level Parallelism (ILP)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Instruction Level Parallelism (ILP) Colin Stevens.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Software Testing and Quality Assurance Lecture 41 – Software Quality Assurance.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Multiscalar processors
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Data Flow Analysis Compiler Design Nov. 8, 2005.
Saman Amarasinghe ©MIT Fall 1998 Simple Machine Model Instructions are executed in sequence –Fetch, decode, execute, store results –One instruction.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat
Generic Software Pipelining at the Assembly Level Markus Pister
Fall 2002 Lecture 14: Instruction Scheduling. Saman Amarasinghe ©MIT Fall 1998 Outline Modern architectures Branch delay slots Introduction to.
TECH 6 VLIW Architectures {Very Long Instruction Word}
Spring 2014Jim Hogg - UW - CSE - P501O-1 CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
Exact methods for ALB ALB problem can be considered as a shortest path problem The complete graph need not be developed since one can stop as soon as in.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.
Instruction Scheduling Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Code Optimization Overview and Examples
Code Optimization.
Computer Architecture Principles Dr. Mike Frank
Instruction Scheduling for Instruction-Level Parallelism
CSCI1600: Embedded and Real Time Software
Superscalar Processors & VLIW Processors
Instruction Scheduling Hal Perkins Summer 2004
CS 201 Compiler Construction
Instruction Scheduling Hal Perkins Winter 2008
Parallelization, Compilation and Platforms PCP
Topic 6a Basic Back-End Optimization
Code Optimization Overview and Examples Control Flow Graph
Instruction Level Parallelism (ILP)
Static Code Scheduling
Instruction Scheduling Hal Perkins Autumn 2005
Taken largely from University of Delaware Compiler Notes
Lecture 5: Pipeline Wrap-up, Static ILP
Instruction Scheduling Hal Perkins Autumn 2011
CSCI1600: Embedded and Real Time Software
Presentation transcript:

9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2 Basic block scheduling 9.3 Loop scheduling 9.4 Global scheduling CH01

Typical layout of compiler: traditional, optimizing, pre-pass parallel, post-pass parallel

Levels of static scheduling

Contrasting the performance of {software/hardware} static code schedulers and ILP-processor

9.1 Basic Block Schedulers: List scheduler: in each step Basic Block:  straight-line code,  can only enter at beginning and left at end

-Eligible Instructions are dependency-free  no predecessors in the Data Dependency Graph  or data to be produced by a predecessor node is already available  timing: checking when the data will be ready hardware required resources are available

-Looks for the ‘Best Choice’ from the eligible set of instructions {critical path in Data Dependency Graph} Choose First: Schedule instructions in the critical path Choose Next: others

Heuristic Rule for choosing the next instructions

Set of criteria 1. whether there is an immediate successor that is dependent on the instruction in question 2. how many successors the instruction has 3. how long is the longest path from the instruction in question to the leaves

Case Study: compiler for IBM Power1 Basic block approach: list scheduler Using Data Dependency Graph Critical path  the longest (in term of execution time) Earliest time  when check the data for the instruction will be available Latency value (on each arc of DDG)  how many time units the successors node have to wait before the result of the predecessor node becomes available

Example program: The intermediate code and the corresponding DDG

The first three steps in scheduling the example DDG using Warren‘s algoruthm

Successive steps in scheduling the example DDG

9.3 Loop Scheduling

-Loop Unrolling  for I = 1 to 3 do {  b(I) = 2 * a(I)  } Loop Unrolled: basic concept  b(1) = 2 * a(1)  b(2) = 2 * a(2)  b(3) = 2 * a(3) save execution time at the expense of code length omitting inter-iteration loop updating code  performance improvement enlarging the basic block size  can lead to more effective schedule with considerable speed up

Simply unrolling is not practicable  when a loop has to be executed a large number of times solution:  unroll the loop a given number of time (e.g. 3)  larger basic block Inter-iteration dependency  only feasible if the required data becomes available in due time

Speed-up produced by loop unrolling for the first 14 lawrence livermore loops

-Software Pipelining Software pipelining is an analogous term to hardware pipelining Software pipelining  each pipeline stage uses one EU  each pipeline cycle execute one instruction E.G.  for I = 1 to 7 do {  b(I) = 2 * a(I)  }

Basic methods for software pipelining

-Principle of the unrolling-based URPR sofeware pipelining method

E.G. loop 7 times, 4 EU’s VLIW, fmul takes 3 cycles

-Modulo scheduling //  find the repetitive character of the schedule guess the minimal required length of the new {partially unrolled} loop  period of repetition try schedule for this interval {period}  taking into account data and resource dependencies if the schedule is not feasible  then the length {period} is increased try again

9.4 Global Scheduling: Scheduling of individual instructions beyond basic blocks

Moving up an independent instruction beyond a basic block boundary ‘along both paths ’ of a conditional branch

Moving up an independent instruction beyond a basic block boundary ‘along a single path’ of a conditional branch {guess T}

Trace Scheduling: flow graph, guess 3

Bookkeeping Code [B]

Finite Resource Global Scheduling: solving long compilation times and code explosion For VLIW  3.7 Speed-up over Power1  49% Compilation time over PL.8  2.1 Code explosion over RISC