Download presentation

Presentation is loading. Please wait.

Published byAlexandra Aylesworth Modified about 1 year ago

1
Instruction Generation and Regularity Extraction for Reconfigurable Processors Philip Brisk, Adam Kaplan, Ryan Kastner*, Majid Sarrafzadeh Computer Science Department, UCLA *ECE Department, UCSB October 11, 2002 CASES Grenoble, France Philip Brisk, Adam Kaplan, Ryan Kastner*, Majid Sarrafzadeh Computer Science Department, UCLA *ECE Department, UCSB October 11, 2002 CASES Grenoble, France

2
OutlineOutline What is Instruction Generation? Related Work Sequential and Parallel Templates The Algorithm Experimental Setup Experimental Results Conclusion and Future Work What is Instruction Generation? Related Work Sequential and Parallel Templates The Algorithm Experimental Setup Experimental Results Conclusion and Future Work

3
Set of applications Instruction Generation Given a set of applications, what computations should be customized? customized? Main Objective: complex, commonly occurring computation patterns Look for computational patterns at the instruction level Basic operation is add, multiply, shift, etc. Main Objective: complex, commonly occurring computation patterns Look for computational patterns at the instruction level Basic operation is add, multiply, shift, etc. RAM VPB RAM Customized (Hard/Soft) Macro in PLD ALU Register Bank Application Specific Instruction set Processor Control Customized Macros

4
Customization and Performance A customized instruction must offer some measurable performance increase. In this work, we have categorized two types of customized instructions and quantified the performance that they offer us. Sequential Instructions: Savings could come from either instruction fetch reduction or datapath optimization. (e.g. ADD-ADD converted to 3-input ADDER) Sequential Instructions: Savings could come from either instruction fetch reduction or datapath optimization. (e.g. ADD-ADD converted to 3-input ADDER) Parallel Instructions: Given multiple ALUs and data paths, allow data independent instructions to be computed simultaneously. Parallel Instructions: Given multiple ALUs and data paths, allow data independent instructions to be computed simultaneously.

5
Problem Definition Determining customized functionality transforms to regularity extraction Regularity Extraction - find common sub- structures (templates) in one or a collection of graphs Each application can be specified by collection of graphs (CDFGs) Templates are implemented as customized instructions Related problem: Instruction Selection Determining customized functionality transforms to regularity extraction Regularity Extraction - find common sub- structures (templates) in one or a collection of graphs Each application can be specified by collection of graphs (CDFGs) Templates are implemented as customized instructions Related problem: Instruction Selection

6
What Is Instruction Generation? The Instruction Selection Problem MOV MEM + TiTi 4+ + * FPa X R1 M[fp + a] R2 Ti + 4 R1 R1 + R2 R2 FP + X M[R1] M[R2] Templates given as inputs. How do we determine templates?

7
What Is Instruction Generation? Reconfigurable architectures allow us to rethink the assumptions underlying our notion of instruction selection. The target machine language can be changed by reconfiguring the FPGA to implement new instructions. This presents new challenges for mapping IR to machine language. We propose a scheme by which this mapping could be obtained at compile time. Reconfigurable architectures allow us to rethink the assumptions underlying our notion of instruction selection. The target machine language can be changed by reconfiguring the FPGA to implement new instructions. This presents new challenges for mapping IR to machine language. We propose a scheme by which this mapping could be obtained at compile time. The Alternative : Instruction Generation

8
What Is Instruction Generation? Template Generation plays a role in the interaction between compilation and high-level synthesis. Each template corresponds to a resource which must be provided by the underlying architecture. A high-level synthesis tool can then allocate resources and schedule the operations on these resources. This work investigates the latency-area tradeoff created by instruction generation. Template Generation plays a role in the interaction between compilation and high-level synthesis. Each template corresponds to a resource which must be provided by the underlying architecture. A high-level synthesis tool can then allocate resources and schedule the operations on these resources. This work investigates the latency-area tradeoff created by instruction generation. Instruction Generation : Applications to CAD and Embedded System Design

9
Related Work Similar techniques have proven beneficial in reducing area and increasing performance for the PipeRench Architecture (Goldstein et al. 2000) Corazao et. Al have shown that well matched, regular templates can have a significant positive impact on critical path delay and clock speed Kastner et al. (ICCAD02) formulated an algorithm for template matching as well as template generation for hybrid reconfigurable systems Similar techniques have proven beneficial in reducing area and increasing performance for the PipeRench Architecture (Goldstein et al. 2000) Corazao et. Al have shown that well matched, regular templates can have a significant positive impact on critical path delay and clock speed Kastner et al. (ICCAD02) formulated an algorithm for template matching as well as template generation for hybrid reconfigurable systems

10
Our Model of Computation: Control Data Flow Graphs if (cond1) bb1(); else bb2(); bb3(); switch (test1) { case c1: bb4(); break; case c2: bb5(); break; case c3: bb6(); break; } bb7() if (cond1) bb1(); else bb2(); bb3(); switch (test1) { case c1: bb4(); break; case c2: bb5(); break; case c3: bb6(); break; } bb7() cond1 bb1() bb2() bb3() bb4() test1 bb5()bb6() T F c1 c2 c3 bb7() bb – basic block

11
Instruction Generation Ideally, we want large templates that occur often. The basic idea: an iterative process whereby we examine dataflow graphs and cluster combinations of nodes that occur frequently. Sequential Template Generation – Identifies templates where the IR operations have data dependencies between them. Parallel Template Generation – Identifies dataflow operations that may be scheduled in parallel.

12
Sequential Template Generation Algorithm designed Kastner et al. [ICCAD 2001]. Basic idea is to examine each edge in the DFG. The type of edge can be represented by an ordered pair consisting of the starting and ending node types. Maintain a count for each edge type. Cluster the most frequently occurring edge by replacing both vertices (head and tail) with a super-vertex maintaining the original vertices in an internal DAG.

13
Sequential Template Generation VAR IMMNEG MUL ADD MUL ADD VAR IMMVARLDA MUL ADD MUL ADD VAR LOD

14
Parallel Template Generation Instead of examining DFG edges, we must determine whether pairs of computations can be scheduled in parallel. We introduce a data structure called the All-Pairs Common Slack Graph (APCSG) to help us with this analysis. APCSG edges are placed between nodes that could possibly be scheduled together. Two nodes can be scheduled at the same time if they share common slack between them. Instead of examining DFG edges, we must determine whether pairs of computations can be scheduled in parallel. We introduce a data structure called the All-Pairs Common Slack Graph (APCSG) to help us with this analysis. APCSG edges are placed between nodes that could possibly be scheduled together. Two nodes can be scheduled at the same time if they share common slack between them.

15
All Pairs Common Slack Graph (APCSG) Common Slack – the total number time steps that two operations x and y could be scheduled using by some scheduling heuristic. APCSG – undirected graph Nodes correspond to operations Edges represent the common slack between every operation Common Slack – the total number time steps that two operations x and y could be scheduled using by some scheduling heuristic. APCSG – undirected graph Nodes correspond to operations Edges represent the common slack between every operation

16
All-Pairs Common Slack Graph (Example) A D F G C E B A D F G C E B

17
Parallel Template Generation Algorithm 1.Given: A Labeled Digraph G(V,E) 2.# T is a set of template types 3.T {} 4.while not stop_conditions_met(G) I.APCSG create_apcsg(G) II.T determine_template_candidates(APCSG) III.cluster_vertices(G,T)

18
Parallel Template Generation VAR IMM MUL ADD MUL ADD VAR IMMVARLDA MUL ADD MUL ADD VAR LOD

19
Stopping Conditions So… when should we stop clustering a graph? Aside from pragmatic arguments, a correct stopping condition is essential if we are to prove that our template generation algorithm is optimal based on some criteria.

20
Stopping Criteria We Have Considered Percentage of Nodes covered Percentage of Nodes covered Number of nodes left in the graph Number of nodes left in the graph Ratio of the number of nodes in a graph before and after clustering Ratio of the number of nodes in a graph before and after clustering Number of unique template types exceed a given threshold Number of unique template types exceed a given threshold Templates Exceed a Given Size Templates Exceed a Given Size Percentage of overall slack lost in the graph over an iteration. Percentage of overall slack lost in the graph over an iteration. Percentage of Nodes covered Percentage of Nodes covered Number of nodes left in the graph Number of nodes left in the graph Ratio of the number of nodes in a graph before and after clustering Ratio of the number of nodes in a graph before and after clustering Number of unique template types exceed a given threshold Number of unique template types exceed a given threshold Templates Exceed a Given Size Templates Exceed a Given Size Percentage of overall slack lost in the graph over an iteration. Percentage of overall slack lost in the graph over an iteration. Stopping Criteria We Have Used Template sizes are restricted to be <= 5 nodes total. The algorithm stops when the total number of nodes is less than half of what was started with... Template sizes are restricted to be <= 5 nodes total. The algorithm stops when the total number of nodes is less than half of what was started with...

21
Scheduling Constraints * SCHEDULER + × ALU1 CLK 1 2 … You MUST do these operations together… Essentially, we have scheduled our operations at the compiler level. What kind of job did we do?

22
Measuring The “Damage” SCHEDULER + × ALU1 CLK 1 2 … LENGTH OF SCHEDULE Length Of Schedule The latency of all the operations… Ideally we want it short. We must measure resulting clustered DAGs Original, non-clustered DAG Sequential Templates Only Sequential and Parallel Templates Length Of Schedule The latency of all the operations… Ideally we want it short. We must measure resulting clustered DAGs Original, non-clustered DAG Sequential Templates Only Sequential and Parallel Templates

23
Experimental Setup COMPILER IR (SUIF) Sequential Template Generation Algorithm Data Flow Graph and DAG Generation from a CDFG pass A High Level Synthesis Tool Using A Locally- Optimal Geometric Scheduling Algorithm CO - COMPILER

24
BenchmarksBenchmarks CONVOLUTION: Image convolution algorithm. DeCSS: Algorithm for breaking DVD encryption DES: The cryptographic symmetric encryption standard for over 20 years. Rijndael AES: The new advanced encryption standard. CONVOLUTION: Image convolution algorithm. DeCSS: Algorithm for breaking DVD encryption DES: The cryptographic symmetric encryption standard for over 20 years. Rijndael AES: The new advanced encryption standard.

25
Experimental Procedure First, we compiled the program to the SUIF IR using the front end built by The Portland Group and Stanford University. Next, we converted the SUIF IR to CDFG form Then, we performed template generation on each basic block for each program. We selected 4 large dataflow graphs from each program to schedule and evaluate our result. We scheduled the dataflow graphs following template generation and and compared them to the original graphs. First, we compiled the program to the SUIF IR using the front end built by The Portland Group and Stanford University. Next, we converted the SUIF IR to CDFG form Then, we performed template generation on each basic block for each program. We selected 4 large dataflow graphs from each program to schedule and evaluate our result. We scheduled the dataflow graphs following template generation and and compared them to the original graphs.

26
ResultsResults

27
Conclusion And Future Work The sequential template generation algorithm can be expanded to accommodate parallel templates. Parallel template generation reduces latency at the expense of slack and area. In the future, we plan to repeat these experiments with a more realistic architecture description with ability to cross-schedule parallel instructions We also plan to explore compiler transformations, such as function inlining, to: extract even more regularity determine a more global view of the program The sequential template generation algorithm can be expanded to accommodate parallel templates. Parallel template generation reduces latency at the expense of slack and area. In the future, we plan to repeat these experiments with a more realistic architecture description with ability to cross-schedule parallel instructions We also plan to explore compiler transformations, such as function inlining, to: extract even more regularity determine a more global view of the program

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google