Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA 93106 NDFA Based Scheduling Forrest Brewer, Steve Haynal University.

Slides:



Advertisements
Similar presentations
Embedded System, A Brief Introduction
Advertisements

Heuristic Search techniques
Representing Boolean Functions for Symbolic Model Checking Supratik Chakraborty IIT Bombay.
ECE 667 Synthesis and Verification of Digital Circuits
CSCI 4717/5717 Computer Architecture
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Timed Automata.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Lecture 4 Data-Flow Scheduling Forrest Brewer. Data Flow Model Hierarchy Kahn Process Networks (KPN) (asynchronous task network) Dataflow Networks –special.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Register-transfer Design n Basics of register-transfer design: –data paths and controllers.
Background information Formal verification methods based on theorem proving techniques and model­checking –to prove the absence of errors (in the formal.
Algorithms and Problem Solving-1 Algorithms and Problem Solving.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Algorithms and Problem Solving. Learn about problem solving skills Explore the algorithmic approach for problem solving Learn about algorithm development.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Chapter 1 and 2 Computer System and Operating System Overview
Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.
COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.
ECE Synthesis & Verification - Lecture 4 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Allocation:
ICS 252 Introduction to Computer Design
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Direction of analysis Although constraints are not directional, flow functions are All flow functions we have seen so far are in the forward direction.
Introduction to Data Flow Graphs and their Scheduling Sources: Gang Quan.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.
Precision Going back to constant prop, in what cases would we lose precision?
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Scheduling policies for real- time embedded systems.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
1 File Management Chapter File Management n File management system consists of system utility programs that run as privileged applications n Concerned.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
1 Outline:  Optimization of Timed Systems  TA-Modeling of Scheduling Tasks  Transformation of TA into Mixed-Integer Programs  Tree Search for TA using.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
School of Computer Science, The University of Adelaide© The University of Adelaide, Control Data Flow Graphs An experiment using Design/CPN Sue Tyerman.
How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.
Formal Verification. Background Information Formal verification methods based on theorem proving techniques and model­checking –To prove the absence of.
Advanced Architectures
Computer Organization and Architecture + Networks
Instructor: Rajeev Alur
Morgan Kaufmann Publishers The Processor
Logical architecture refinement
Objective of This Course
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
ESE535: Electronic Design Automation
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Algorithms and Problem Solving
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Fast Min-Register Retiming Through Binary Max-Flow
Presentation transcript:

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA NDFA Based Scheduling Forrest Brewer, Steve Haynal University of California Santa Barbara

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Scheduling is Behavioral Synthesis Exploits fundamental freedom -- ordering and binding of operations, operands –Subdivided into DFG transformation, resource allocation, time-scheduling, operation binding, memory binding, communication binding, resource modeling, reallocation... –Complexity of tasks requires top-down flow -- yet evaluations/constraints are bottom-up Behavioral Synthesis difficult to use! –Seemingly trivial changes cause vast output changes –Design tradeoffs tied to a particular point language (~VHDL, ~Verilog, Silage, Esterel...) –No direct control of implementation –No direct control of binding, mapping –No distinction between problem statement and constraints –No canonical representation of design space Fundamental problem covers enormous scope –Universality issues in specification –How to capture design mapping knowledge? –How to create verifiable design representation without canonical model? Our viewpoint -- wrong problem

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Simpler Problem Assume Designer creates the design –Support incremental refinement of design at all levels of representation –Support incremental design synthesis when possible –Provide well defined hierarchy on which to place constraints, trial implementations... –Provide mechanism for subsystem abstraction, modeling and evaluation at each level How to do this? –Drop representation distinction between logic, module, and sub-system levels –Drop potential for universality in internal representations –Create mechanism for automatic design abstraction within designer's design decomposition –Use efficient representation of fundamental model –Provide feedback to designer for evaluating both the design itself and the representation Where do we start? –Interface Protocols are key complexity growth problem –Designer constructs system model with abstract protocols, required data-flows, possible maps –Generalize scheduling to provide possible sequencing of sub-systems into systems meeting external protocol constraints (models)

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Protocol Constrained Scheduling Problem: Conventional scheduling algorithms cannot accommodate the typical complex sequencing and timing constraints of modern design. Three Problems: Specification, Scheduling, and Problem Scale Specification: How to specify the required timing in an concise, explicit way? Scheduling: How to systematically exploit mapping freedom while meeting the timing requirements? Problem Scale: Problems of interest to industry are enormously complex! Idea: Protocol specification is amenable to NDFA modeling -- so create automata-based model to represent Control/Data-flow freedom => All possible implementations exist as sequences of states of the joint automaton

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Protocol Specification Sequencing complexity of digital system interfaces increasing Specification languages Verilog?, VHDL require implicit protocol specification Alternative specification via NDFA automata (e.g. PBS, Esterel, Custom point language) –Representation is finite –Synthesis can be very efficient -- can handle very complex designs –Provides mechanism for time sequence specification relatively independent of data- flow control semantics Protocol + CDFG semantics + mapping abstractions make a complete model –No ad-hoc mapping library (beyond control of designer) –No convenient dependency binding assumptions (to be worked around by designer) –No encrypting desired sequential FSM in higher level language! Designer specifies event sequences he wants System evaluates/synthesizes ensemble FSM

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Design Representation Model System as hierarchy of design frames Frames have external protocol specification NDFA, CDFG, and allowed Mappings Frames contain instances of other frames abstractions (abstracted NDFA/CDFG model) Resource utilization and sharing restricted to within a design frame Sub-frame Model Control Data Flow Graph External Protocol Frame

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Hierarchy of Refinement Exact protocol scheduling intractable for practical large problems Hierarchy of Refinement –partition the problem into manageable abstractions –hides lower level details –allows systematic high-level pruning of designs before more detailed treatment –Completed sub-frame designs can be abstracted to high level component models –allows incremental design change/refinement at any level –--provides mechanism for consistency verification

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Protocol Scheduling Implementation Represent CDFG model as Causal (NDFA) Automaton –Generalization of current scheduling model –Models all valid data flows –Models code hoisting, unrolling, transformations... Represent External Protocol as NDFA automaton –Very general, efficient model –Synchronous timing model (can be generalized-- future work) –Alternative behavior as NDFA alternatives CDFG maps I/O operations among sub-frames Sub-frames have interface protocols, abstracted CDFG semantics Construct ensemble automata model with all valid sequences of events meeting internal and external protocols and causal data-flow constraints Need only find complete sub-set of all possible states for solution

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Scheduling Solution Every schedule is some subset of states of the ensemble automaton Must construct causal and complete set of states Exact solution strategies: –Construct all states up to resource bounds –Depth-first search of states –Heuristic search -- choose good path, complete schedule automatically –Prune solution space –Additional constraints or objectives -- technique works best when highly constrained Heuristic strategies: –Sub-set BDD representation of reachable states –Incremental search (this is not verification!) Possible objectives: –Communication –Temporary storage (memory) –Performance –Control complexity

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA DFA Model of Two Stage Pipe Input = 1 indicates operands are supplied to the pipe Output = 1 indicates operand is produced by the pipe Stateabcbdcbddca a b c d 0/0 1/1 1/0 0/1 1/0 0/0 1/1

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA NDFA Protocol for Two Stage Pipe Inputs and outputs same as DFA model Some transitions produce no outputs -/ 1/ -/ abc -/1

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Operand Scheduling a CDFG on NDFA Protocol CDFG to Schedule: Two stage NDFA protocol description for component Protocol alone is insufficient -- need internal data-flow requirements Mapping is trivial (in this case) Protocol + CDFG is sufficient -- but also describes information not needed externally Solution: Simplify scheduling solution of sub-frame to make abstracted model ACB DE ** *

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Operand Schedule on NDFA Protocol Optimal one multiplier schedule (co-execution of protocol and causal automata): A C B D E ** * -/ 1/-/ abc -/1

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Causal Automaton Formulation of Scheduling Scheduling Problem (V, E, C, R) vertex v eV is an operation edge (u,v) e E is a directed edge representing a data dependency hyper-edge {v c,V T c,V F c } groups a control operation and corresponding subsets of operations hyper-edge {bound, (T m V)} e R represents a resource bound applied to a subset of (mapped) operations The edge set is partitioned into a forest of forward edges and a subset of looping edges which point backward Scheduling solution is a complete, compatible set of deterministic sequences of vertices such that all dependencies are causal and all resource bounds are met at each state, and the set has sequences for each possible future value of the set of controls. In the following, we will discuss minimum latency and maximal throughput as objective functions.

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Single-Cycle Operation Modeling Automata 0  0 Operation unscheduled and remains so 0000 1 01 0 0101 0  1 Operation scheduled next cycle 1  1 Operation scheduled and remains so j1j1 1  0 Operation scheduled but result lost 01 1 11 1

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Scheduling Automata: State represents current set of available operands and state of modeling protocol automata Constraints on transitions Representation Compact Product of Mapped Modeling automata for each resource protocol hik ….

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA 00001011111 h 000001011111 i 000001011111 j Resource Bounds 0  1 indicates resource Resource bounds constrain simultaneous 0  1 transitions Iterative constraint on CA 0101 0101 0101 0101 0101 1111 0101 0101 0000 ROBDD representation: –2  |bound|  |operations| One Resource

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Dependency Implication All transitions in which j is active before all of its predecessors are known are removed BDD Complexity is O(|predecessors| * |operations|) h i j

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA ij h Example NFA Assume 1 resource Transition relation induces graph Any path from all operations unknown to all known is a valid schedule Shortest paths are minimum latency schedules 000 i00 ij0 00h i0hi0h ijh 000 i00 ij0 00h i0hi0h ijh 000 i00 ij0 00h i0hi0h ijh 000 i00 ij0 00h i0hi0h ijh 000 i00 ij0 00h i0hi0h ijh 000 i00 ij0 00h i0hi0h ijh 000 i00 ij0 00h i0hi0h ijh 000 i00 ij0 00h i0hi0h ijh

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA All Minimum Latency Schedules Symbolic reachable state analysis 000 i00 ij0 00h i0hi0h ijh –Newly reached states are saved each cycle –Backward pruning preserves transitions used in all shortest paths

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA All Minimum Latency Schedules Symbolic reachable state analysis 000 i00 ij0 00h i0hi0h ijh –Backward pruning preserves transitions used in all shortest paths –Newly reached states are saved each cycle

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA All Minimum Latency Schedules Symbolic reachable state analysis 000 i00 ij0 00h i0hi0h ijh –Backward pruning preserves transitions used in all shortest paths –Newly reached states are saved each cycle

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA All Minimum Latency Schedules Symbolic reachable state analysis 000 i00 ij0 00h i0hi0h ijh –Backward pruning preserves transitions used in all shortest paths –Newly reached states are saved each cycle

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA All Minimum Latency Schedules Described construction is Exact -- Suitable heuristics are available and since they can use arbitrary subsets of the potential schedules are powerful 000 i00 ij0 00h i0hi0h ijh

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA CDFG Representation Operation Control Dependency Data Dependency Fork Join i2i2 h1h1 i1i1 j2j2 j1j1 k1k1 Resource Class

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA CDFGs: Multiple Control Paths Guard automata differentiate control paths –Before control operation scheduled: 01 Control value unknown –After control operation scheduled: 01 Control value known Guards are implemented as modified operation automata

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA CDFGs: Multiple Control Paths All control paths form ensemble schedule –Possibly 2 c control paths to schedule (non-looping case) Dummy operation identifies when control path terminates –Only one termination operation Ensemble schedule need not be causal! –Need solution for each control path (Completeness) –Need compatibility between paths whose control is not resolved (Causality) –Solution: validation algorithm –Validation is a path to path property for all control paths in ensemble schedule –Fixed Point Iteration

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA i00 c000 00j0 ci00 0ij0 c0j0c0j0 ci0t cij0 c0jt cijt CDFG Example One green resource ij c t Shortest paths False termination

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Validated CDFG Example Validation algorithm ensures control paths don’t bifurcate before control value is known i00 c000 00j0 ci00 0ij0 c0j0c0j0 cij0 cijt

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Validated CDFG Example Validation algorithm ensures control paths don’t bifurcate before control value is known Pruned for all shortest paths as before i0000j0 ci00c0j0c0j0 cij0 cijt

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Validation Algorithm Validation Proceeds on potential traces Re-traverse Automata, Dynamically Modifying Transition Relation based on current available states in each time step: Allow guard computation only for states with matching histories if the guard is true or false. Iterate until fixed point on all paths Apply the following non-linear filter to each transition:

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Selected CDFG Benchmarks

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Large Benchmarks 957

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Comparison of CPU Times Heuristic

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Required CPU Seconds

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Construction for Looping DFG’s Use trick: 0/1 representation of the MA could be interpreted as 2 mutually exclusive operand productions Schedule from ~know -> known -> ~known where each 0->1 or 1->0 transition requires a resource. Since dependencies are on operands, add new dependencies in 1 ->0 sense as well Idea is to remove all transitions which do not have complete set of known or ~known predecessors for respective sense of operation So -- get looping DFG automata as nearly same automata as before –preserve efficient representation Selection of “Minimal Latency” solutions is more difficult

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Loop construction: resources Resources: we now count both 0 -> 1 and 1 ->0 transition as requiring a resource. Use “Tuple” BDD construction: at most k bits of n BDD Despite exponential number of product terms, BDD complexity: O(bound * |V|)

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Example CA State order (v1,v2,v3,v4) Path 0,9,C,7,2,9,C,7,2,…is a valid schedule. By construction, only 1 instance of any operator can occur in a state. v1v1 v2v2 v3v3 v4v4

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Strategy to Find Maximal Throughput CA automata construction simple How to find closed subset of paths guaranteeing optimal throughput Could start from known initial state and prune slow paths as before-- but this is not optimal! Instead: find all reachable states (without resource bounds) Use state set to prune unreachable transitions from CA Choose operator at random to be pinned (marked) Propagate all states with chosen operator until it appears again in same sense Verify closure of constructed paths by Fixed Point iteration If set is empty -- add one clock to latency and verify again Result is maximal closed set of paths for which optimal throughput is guaranteed

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Maximal Throughput Example DFG above has closed 3-cycle solution (2 resources) However- average latency is 2.5-cycles (a,d) (b,e) (a,c) (b,d) (c,e) (a,d) … Requires 5 states to implement optimal throughput instance In general, it is possible that a k-cycle closed solution may exist, even if no k-state solution can be found Current implementation finds all possible k-cycle solutions abc de

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA EWF Looping Benchmarks 268

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Synthetic Benchmarks Over 100 synthetic benchmarks tested –Sizes 50 operator, 100 operator, randomly assigned dependency chains, resources 32% had no causal schedule 35% had all maximum throughput schedules found in 15 minute timeout (1 minute Reachable States, 14 minute Fixed Point) 33% Timed Out –Analysis of timeout cases: most included disconnected independent sub- graphs –Trial partitioning of the Transition Relation looks very promising on these cases (time/space reduction nearly quadratic!)

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Synthetic Loop Benchmarks

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Schedule Exploration: Loops Idea: Use partial symbolic traversal to find states bounding minimal latency paths Latency-- Identify all paths completing cycle in given number of steps Repeatability-- Fixed Point Algorithm to eliminate all paths which cannot repeat in given latency Validation-- Ensure all possible control paths are present for each remaining path Optimization-- Selection of Performance Objective

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Kernel Execution Sequence Set Path from Loop cut to first repeating states Represents candidates for loop kernel Loop Kernel I~ L~ k~ j~ Loop Cut i l k j a~ d~ c~ b~ a d c b

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Repeatable Kernel Execution Sequence Set Fixed-point prunes non-repeating states  Only repeatable loop kernels remain  Paths not all same length  Average latency <= shortest Repeating Kernel Loop Cut Repeatable Loop Kernel i l k j a~ c~ b~ a c b i~ l~ K~ j~

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Validation I Schedule Consists of bundle of compatible paths for each possible future Not Feasible to identify all schedules Instead, eliminate all states which do not belong to some ensemble schedule Fragile since any further pruning requires re-validation Double fixed point

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Validation II Path Divergence -- Control Behavior  Ensure each path is part of some complete set for each control outcome  Ensure that each set is Causal i l k j c~ b~ c b i~ l~ k~ j~

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Loop Cuts and Kernels Method Covers all Conventional Loop Transformations Sequential Loop Loop winding Loop Pielining Loop Kernel Loop Cut Loop Kernel Loop Cut Loop Kernel

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Results Conventional Scheduling x speedup over ILP Control Scheduling: Complexity typically pseudo polynomial in number of branching variables Cyclic Scheudling: Reduced preamble complexity Capacity: operands in exact implementation General Control Dominated Scheduling: Implicit formulation of all forms of CDFG transformation Exact Solutions with Millions of Control paths Protocol Constrained Scheduling: Exact for small instances – needs sensible pruning of domain

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA MIPS Model SimpleScalar (MIPS IV superset) Model Trace Probabilities from MediaBench Hierarchical Model  Collection of Instruction Tasks in Flight  Each Instruction Task is Complete Behavioral Model of Instruction Execution, including all instruction types, hazards, controls, and Contention for Physical Resources  Additional Sequential Protocols for Memory Subsystem, both Fetch and Load/Store

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Processor Composition Ordered Fetch/Commit 3 Simultaneous Instruction Executions Sequencing of Instructions separated from pipeline Out of Order Prefetch or Commit can be Modeled Bypass Next ins Next PPC Bypass Instruction PPC

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA PC update: Speculative Fetch Speculate Joins to allow early prefetch and address computation

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA MIPS Transaction Dependencies

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA MIPS Results: Constraints Scenario A 1/2 cycle tasks, Single Bypass 2 cycle Pipelined Double Word Memory Fetch 2 cycle Pipelined Multiply 2R/1W Register File, 2 ALU's, 2 port Memory Scenario B 2 cycle Memory Read/Write/Fetch 2R -1R/1W Register File, 1 ALU, 1 port Memory Cache 1 cycle hit/3 cycle miss, Deferred Pipeline

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA MIPS Results: Instruction Mix Media Bench Tuning: 88% reg-reg, reg-imm, br taken, load single 80% branch taken 35% Single Bypass Hazard 1% Multiple Bypass (Stall in model)‏ Two Sets of Priority Mixes Mix1: favors (reg-reg, reg-imm, br-taken)‏ Mix 2: favors (load-sw, br-taken)‏

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA MIPS Results: Mix 1 Mix 1 favors reg-reg, reg-imm, and br-taken

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA MIPS Results: Mix 2 Mix 2 Favors loads, reg-reg w. branches

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Cache and I/O Protocol For 3 instructions in flight > 542,000 control paths! Schedules still exact – every optimal sequence is constructed

Forrest Brewer UCSB CAD and Test Group ECE/UCSB Santa Barbara CA Conclusions NFA protocol modeling shown to be effective representation for generalized scheduling problem Efficiency of algorithms so far is comparable or superior to any known exact technique Potential for powerful heuristics based on sub-set representation First exact solutions for a wide variety of generalized scheduling problems