Reconfigurable Computing (High-level Acceleration Approaches)

Reconfigurable Computing (High-level Acceleration Approaches)
Dr. Phillip Jones, Scott Hauck Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Projects: Target Timeline
Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Wed 10/20 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)

Common Questions

Overview First 15 minutes of Google FPGA lecture How to run Gprof
Discuss some high-level approaches for accelerating applications.

What you should learn Start to get a feel for approaches for accelerating applications.

Why use Customize Hardware?
Great talk about the benefits of Heterogeneous Computing

Profiling Applications
Finding bottlenecks Profiling tools gprof: Valgrind

Pipelining How many ns to process to process 100 input vectors? Assuming each LUT Has a 1 ns delay. Input vector <A,B,C,D> output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D How many ns to process 100 input vectors? Assume a 1 ns clock 4-LUT B C D A DFF 1 DFF delay per output

Pipelining (Systolic Arrays)
Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner.

Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1

Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 1 1

Dynamic Programming 1 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 1 1 1

Dynamic Programming 1 3 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1

Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1

Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if CPU can process one cell per clock (1 ns clock)?

Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if FPGA can obtain maximum parallelism each clock? (1 ns clock)

Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 What speed up would an FPGA obtain (assuming maximum parallelism) for an 100x100 matrix. (Hint find a formula for an NxN matrix)

Dr. James Moscola (Example)
MATL2 D10 ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 20

Example RNA Model 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21 MATL2 MATP1
ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21

Baseline Architecture Pipeline
END3 MATL2 MATP1 ROOT0 E12 IL11 D10 ML9 IR8 IL7 D6 MR5 ML4 MP3 IR2 IL1 S0 u g g c g a c a c c c residue pipeline 22

Processing Elements IL7,3,2 IR8,3,2 ML9,3,2 D10,3,2 ML4 + = + = + = +
1 2 3 .40 -INF .22 .72 .30 .44 1 j  IL7,3,2 2 + ML4_t(7) = 3 IR8,3,2 + ML4_t(8) = ML9,3,2 + ML4_t(9) = D10,3,2 + + ML4,3,3 = .22 ML4_t(10) ML4_e(A) ML4_e(C) ML4_e(G) ML4_e(U) input residue, xi 23

Baseline Results for Example Model
Comparison to Infernal software Infernal run on Intel Xeon 2.8GHz Baseline architecture run on Xilinx Virtex-II 4000 occupied 88% of logic resources run at 100 MHz Input database of 100 Million residues Bulk of time spent on I/O (41.434s)

Expected Speedup on Larger Models
Name Num PEs Pipeline Width Pipeline Depth Latency (ns) HW Processing Time (seconds) Total Time with measured I/O (seconds) Infernal Time (seconds) Infernal Time (QDB) (seconds) Expected Speedup over Infernal Expected Speedup over Infernal (w/QDB) RF00001 39492 195 19500 349492 128443 8236 3027 RF00016 43256 282 28200 336000 188521 7918 4443 RF00034 38772 187 18700 314836 87520 7419 2062 RF00041 44509 206 20600 388156 118692 9147 2797 Example 81 26 6 600 1039 868 25 20 Speedup estimated ... using 100 MHz clock for processing database of 100 Million residues Speedups range from 500x to over 13,000x larger models with more parallelism exhibit greater speedups

Distributed Memory ALU Cache BRAM BRAM PE BRAM BRAM

Next Class Models of Computation (Design Patterns)

Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 11: Fri 10/1/2010 (Design Patterns) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Weekly Project Updates
The current state of your project write up Even in the early stages of the project you should be able to write a rough draft of the Introduction and Motivation section The current state of your Final Presentation Your Initial Project proposal presentation (Due Wed 10/20). Should make for a starting point for you Final presentation What things are work & not working What roadblocks are you running into

Overview Class Project (example from 2008) Common Design Patterns

What you should learn Introduction to common Design Patterns & Compute Models

Outline Design patterns Why are they useful? Examples Compute models

References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986) Design Patterns: Abstraction and Reuse of Object Oriented Design [4] E. Gamma (1992) The Timeless Way of Building [5] C. Alexander (1979)

Design Patterns Design patterns: are a solution to reoccurring problems.

Reconfigurable Hardware Design
“Building good reconfigurable designs requires an appreciation of the different costs and opportunities inherent in reconfigurable architectures” [2] “How do we teach programmers and designers to design good reconfigurable applications and systems?” [2] Traditional approach: Read lots of papers for different applications Over time figure out ad-hoc tricks Better approach?: Use design patterns to provide a more systematic way of learning how to design It has been shown in other realms that studying patterns is useful Object oriented software [4] Computer Architecture [5]

Common Language Provides a means to organize and structure the solution to a problem Provide a common ground from which to discuss a given design problem Enables the ability to share solutions in a consistent manner (reuse)

Describing a Design Pattern [2]
10 attributes suggested by Gamma (Design Patterns, 1995) Name: Standard name Intent: What problem is being addressed?, How? Motivation: Why use this pattern Applicability: When can this pattern be used Participants: What components make up this pattern Collaborations: How do components interact Consequences: Trade-offs Implementation: How to implement Known Uses: Real examples of where this pattern has been used. Related Patterns: Similar patterns, patterns that can be used in conjunction with this pattern, when would you choose a similar pattern instead of this pattern.

Example Design Pattern
Coarse-grain Time-multiplexing Template Specialization

Coarse-grain Time-Multiplexing
B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

Name: Coarse-grained Time-Multiplexing Intent: Enable a design that is too large to fit on a chip all at once to run as multiple subcomponents Motivation: Method to share limited fixed resources to implement a design that is too large as a whole.

Applicability (Requirements): Configuration can be done on large time scale No feedback loops in computation Feedback loop only spans the current configuration Feedback loop is very slow Participants: Computational graph Control algorithm Collaborations: Control algorithm manages when sub-graphs are loaded onto the device

Consequences: Often platforms take millions of cycles to reconfigure Need an app that will run for 10’s of millions of cycles before needing to reconfigure May need large buffers to store data during a reconfiguration Known Uses: Video processing pipeline [Villasenor] “Video Communications using Rapidly Reconfigurable Hardware”, Transactions on Circuits and Systems for Video Technology 1995 Automatic Target Recognition [[Villasenor] “Configurable Computer Solutions for Automatic Target Recognition”, FCCM 1996

Implementation: Break design into multiple sub graphs that can be configured onto the platform in sequence Design a controller to orchestrate the configuration sequencing Take steps to minimize configuration time Related patterns: Streaming Data Queues with Back-pressure

B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 What constraint does this place on Temp? A B 1 MB buffer What if the data path is changed from 8-bit to 64-bit? M3 Temp M3 Temp 8 MB buffer Likely need off chip memory Configuration 1 Configuration 2

Template Specialization
Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)

Name: Template Specialization Intent: Reduce the size or time needed for a computation. Motivation: Use early-bound data and slowly changing data to reduce circuit size and execution time.

Applicability: When circuit specialization can be adapted quickly Example: Can treat LUTs as small memories that can be written. No interconnect modifications Participants: Template cell: Contains specialization configuration Template filler: Manages what and how a configuration is written to a Template cell Collaborations: Template filler manages Template cell

Consequences: Can not optimize as much as when a circuit is fully specialize for a given instance. Overhead need to allow template to implement several specializations. Known Uses: Multiply-by-Constant String Matching Implementation: Multiply-by-Constant Use LUT as memory to store answer Use controller to update this memory when a different constant should be used.

Related patterns: CONSTRUCTOR EXCEPTION TEMPLATE

Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)

Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0) Multiply by a constant of 2: Support inputs of 0 - 7

Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0)

Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 Mult by 2 A(2) A(1) A(0) LUT LUT LUT LUT 2 4 6 8 10 12 14 1 1 1 C(3) C(2) C(1) C(0)

Catalog of Patterns (Just a start) [2]
[2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions

Next Lecture Continue Compute Models

Lecture Notes:

CPRE 583 Reconfigurable Computing Lecture 12: Wed 10/6/2010 (Compute Models) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Fri 10/22 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)

Work on projects: 10/ /8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)

Project Grading Breakdown
50% Final Project Demo 30% Final Project Report 30% of your project report grade will come from your 5-6 project updates. Friday’s midnight 20% Final Project Presentation

Common Questions

Overview Compute Models

What you should learn Introduction to Compute Models

Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

References Reconfigurable Computing (2008) [1]
Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986)

Building Applications
Problem -> Compute Model + Architecture -> Application Questions to answer How to think about composing the application? How will the compute model lead to a naturally efficient architecture? How does the compute model support composition? How to conceptualize parallelism? How to tradeoff area and time? How to reason about correctness? How to adapt to technology trends (e.g. larger/faster chips)? How does compute model provide determinacy? How to avoid deadlocks? What can be computed? How to optimize a design, or validate application properties?

Compute Models Compute Models [1]: High-level models of the flow of computation. Useful for: Capturing parallelism Reasoning about correctness Decomposition Guide designs by providing constraints on what is allowed during a computation Communication links How synchronization is performed How data is transferred

Two High-level Families
Data Flow: Single-rate Synchronous Data Flow Synchronous Data Flow Dynamic Streaming Dataflow Dynamic Streaming Dataflow with Peeks Steaming Data Flow with Allocation Sequential Control: Finite Automata (i.e. Finite State Machine) Sequential Controller with Allocation Data Centric Data Parallel

Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

Composition of functions Captures: Parallelism Dependences Communication X X +

Single-rate Synchronous Data Flow
One token rate for the entire graph For example all operation take one token on a given link before producing an output token Same power as a Finite State Machine 1 1 1 update - 1 1 1 1 1 1 1 1 F copy

Synchronous Data Flow - F
Each link can have a different constant token input and output rate Same power as signal rate version but for some applications easier to describe Automated ways to detect/determine: Dead lock Buffer sizes 1 10 1 update - 1 1 1 1 10 10 1 1 F copy

Dynamic Steaming Data Flow
Token rates dependent on data Just need to add two structures Switch Select in in0 in1 S S Switch Select out0 out1 out

Dynamic Steaming Data Flow
Token rates dependent on data Just need to add two structures Switch, Select More Powerful Difficult to detect Deadlocks Still Deterministic 1 Switch y x x y S F0 F1 x y x y Select

Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge

Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times A Merge

Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times B Merge A

Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge B A

Steaming Data Flow with Allocation
Removes the need for static links and operators. That is the Data Flow graph can change over time More Power: Turing Complete More difficult to analysis Could be useful for some applications Telecom applications. For example if a channel carries voice verses data the resources needed may vary greatly Can take advantage of platforms that allow runtime reconfiguration

Sequential Control Sequence of sub routines
Programming languages (C, Java) Hardware control logic (Finite State Machines) Transform global data state

Finite Automata (i.e. Finite State Machine)
Can verify state reachablilty in polynomial time S1 S2 S3

Sequential Controller with Allocation
Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S3

Sequential Controller with Allocation
Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S4 S3 SN

Data Parallel Multiple instances of a operation type acting on separate pieces of data. For example: Single Instruction Multiple Data (SIMD) Identical match test on all items in a database Inverting the color of all pixels in an image

Data Centric Similar to Data flow, but state contained in the objects of the graph are the focus, not the tokens flowing through the graph Network flow example Source1 Dest1 Source2 Switch Dest2 Source3 Flow rate Buffer overflow

Multi-threaded Multi-threaded: a compute model made up a multiple sequential controllers that have communications channels between them Very general, but often too much power and flexibility. No guidance for: Ensuring determinism Dividing application into threads Avoiding deadlock Synchronizing threads The models discussed can be defined in terms of a Multi-threaded compute model

Multi-threaded (Illustration)

Streaming Data Flow as Multithreaded
Thread: is an operator that performs transforms on data as it flows through the graph Thread synchronization: Tokens sent between operators

Data Parallel as Multithreaded
Thread: is a data item Thread synchronization: data updated with each sequential instruction

Caution with Multithreaded Model
Use when a stricter compute model does not give enough expressiveness. Define restrictions to limit the amount of expressive power that can be used Define synchronization policy How to reason about deadlocking

Other Models “A Framework for Comparing Models of computation” [1998]
E. Lee, A. Sangiovanni-Vincentelli Transactions on Computer-Aided Design of Integrated Circuits and Systems “Concurrent Models of Computation for Embedded Software”[2005] E. Lee, S. Neuendorffer IEEE Proceedings – Computers and Digital Techniques

Next Lecture System Architectures

User Defined Instruction
MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor

MP3 Notes MUCH less VHDL coding than MP2
But you will be writing most of the VHDL from scratch The focus will be more on learning to read a specification (Power PC coprocessor interface protocol), and designing hardware that follows that protocol. You will be dealing with some pointer intensive C-code. It’s a small amount of C code, but somewhat challenging to get the pointer math right.

Lecture Notes kk

CPRE 583 Reconfigurable Computing Lecture 13: Fri 10/8/2010 (System Architectures) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

Work on projects: 10/ /8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)

Common Questions

Overview Common System Architectures Plus/Delta mid-semester feedback

What you should learn Introduction to common System Architectures

Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

References Reconfigurable Computing (2008) [1]
Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon

System Architectures Compute Models: Help express the parallelism of an application System Architecture: How to organize application implementation

Efficient Application Implementation
Compute model and system architecture should work together Both are a function of The nature of the application Required resources Required performance The nature of the target platform Resources available

(Image Processing) Platform 1 (Vector Processor) Platform 2 (FPGA)

(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

(Image Processing) Data Parallel Compute Model Vector System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

(Image Processing) X X Data Flow Compute Model + Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

Data Presence X X +

Data Presence X X data_ready data_ready + data_ready

Data Presence X X FIFO FIFO data_ready data_ready + FIFO data_ready

Data Presence X X stall stall FIFO FIFO data_ready data_ready + FIFO

Data Presence Flow control: Term typical used in networking X X stall
FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking

Data Presence Flow control: Term typical used in networking
Increase flexibility of how application can be implemented X X stall stall FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking

Datapath Sharing X X +

Datapath Sharing Platform may only have one multiplier X X +

Datapath Sharing Platform may only have one multiplier X +

Datapath Sharing Platform may only have one multiplier REG X REG +

Datapath Sharing Platform may only have one multiplier REG X FSM REG +

Datapath Sharing Platform may only have one multiplier
REG X FSM REG + Important to keep track of were data is coming!!

Interconnect sharing X X +

Interconnect sharing Need more efficient use of interconnect X X +

Interconnect sharing Need more efficient use of interconnect X X FSM +

Streaming coprocessor
See SCORE chapter 9 of text for an example.

Sequential Control Typically thought of in the context of sequential programming on a processor (e.g. C, Java programming) Key to organizing synchronizing and control over highly parallel operations Time multiplexing resources: when task to too large for computing fabric Increasing data path utilization

Sequential Control X + A B C

Sequential Control X + A B C A*x2 + B*x + C

Sequential Control X + A B C C A B X X + A*x2 + B*x + C A*x2 + B*x + C

Finite State Machine with Datapath (FSMD)
B X X + A*x2 + B*x + C

Finite State Machine with Datapath (FSMD)
B X FSM X + A*x2 + B*x + C

Sequential Control: Types
Finite State Machine with Datapath (FSMD) Very Long Instruction Word (VLIW) data path control Processor Instruction augmentation Phased reconfiguration manager Worker farm

Very Long Instruction Word (VLIW) Datapath Control
See 5.2 of text for this architecture

Processor

Instruction Augmentation

Phased Configuration Manager
Will see more detail with SCORE architecture from chapter 9 of text.

Worker Farm Chapter 5.2 of text

Bulk Synchronous Parallelism
See chapter 5.2 for more detail

Data Parallel Single Program Multiple Data
Single Instruction Multiple Data (SIMD) Vector Vector Coprocessor

Data Parallel

Cellular Automata

Multi-threaded

Next Lecture

Lecture Notes Add CSP/Mulithread as root of a simple tree
15+5(late start) minutes of time left Think of one to two in class exercise (10 min) Data Flow graph optimization algorithm? Dead lock detection on a small model? Give some examples of where a given compute model would map to a given application. Systolic array (implement) or Dataflow compute model) String matching (FSM) (MISD) New image for MP3, too dark of a color

CPRE 583 Reconfigurable Computing Lecture 14: Fri 10/13/2010 (Streaming Applications) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Common Questions

Overview Steaming Applications (Chapters 8 & 9) Simulink SCORE

What you should learn Two approaches for implementing streaming applications

Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

Composition of functions Captures: Parallelism Dependences Communication X X +

Streaming Application Examples
Some images processing algorithms Edge Detection Image Recognition Image Compression (JPEG) Network data processing String Matching (your MP2 assignment) Sorting??

Sorting Initial list of items Split Split Split Sort Sort Sort Sort
merge merge merge

Example Tools for Streaming Application Design
Simulink from Matlab: Graphical based SCORE (Steam Computation Organized for Reconfigurable Hardware): A programming model

Simulink (MatLab) What is it?
MatLab module that allows building and simulating systems through a GUI interface

Simulink: Example Model

Simulink: Sub-Module

Simulink: Example Model

Simulink: Example Plot

Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection Detect Horizontal Edges Detect Vertical Edges -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200

CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200

Top Level

Shifter

Multiplier

Input Image

Output Image

SCORE Overview of the SCORE programming approach Developed by
Stream Computations Organized for Reconfigurable Execution Developed by University of California Berkeley California Institute of Technology FPL 2000 overview presentation

Next Lecture Data Parallel

Lecture Notes

CPRE 583 Reconfigurable Computing Lecture 15: Fri 10/15/2010 (Reconfiguration Management) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Announcements/Reminders
Midterm: Take home portion (40%) given Friday 10/22, due Tue 10/26 (midnight) In class portion (60%) Wed 10/27 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today/tomorrow) Problem 2 of HW 2 (released after MP3 gets released)

Common Questions

Overview Chapter 4: Reconfiguration Management

What you should learn Some basic configuration architectures
Key issues when managing the reconfiguration of a system

Reconfiguration Management
Goal: Minimize the overhead associated with run-time reconfiguration Why import to address Can take 100’s of milliseconds to reconfigure a device For high performance applications this can be a large overhead (i.e. decreases performance)

High Level Configuration Setups
Externally trigger reconfiguration CPU Configuration Request FPGA ROM (bitfile) Config Data FSM Config Control (CC)

High Level Configuration Setups
Self trigger reconfiguration FPGA Config Data ROM (bitfile) FSM CC

Configuration Architectures
Single-context Multi-context Partially Reconfigurable Relocation & Defragmentation Pipeline Reconfiguration Block Reconfigurable

Single-context FPGA Config clk Config I/F Config Data Config enable
OUT IN OUT IN OUT EN EN EN Config enable

Multi-context FPGA 1 1 2 2 3 3 Config clk Context switch Config Config
OUT IN OUT IN EN EN Context switch 1 1 Context 1 Enable 2 2 Context 2 Enable 3 3 Context 3 Enable Config Enable Config Enable Config Data Config Data

Partially Reconfigurable
Reduce amount of configuration to send to device. Thus decreasing reconfiguration overhead Need addressable configuration memory, as opposed to single context daisy chain shifting Example Encryption Change key And logic dependent on key PR devices AT40K Xilinx Virtex series (and Spartan, but not a run time) Need to make sure partial config do not overlap in space/time (typical a config needs to be placed in a specific location, not as homogenous as you would think in terms of resources, and timing delays)

Full Reconfig 10-100’s ms

Partial Reconfig 100’s us - 1’s ms

Partial Reconfig 100’s us - 1’s ms Typically a partial configuration modules map to a specific physical location

Relocation and Defragmentation
Make configuration architectures support relocatable modules Example of defragmentation text good example (defrag or swap out, 90% decrease in reconfig time compared to full single context) Best fit, first fit, … Limiting factor Routing/logic is heterogeneous timing issues, need modified routes Special resources needed (e.g. hard mult, BRAMS) Easy issue if there are blocks of homogeneity Connection to external I/O (fix IP cores, board restrict) Virtualized I/O (fixed pin with multiple internal I/Fs? 2D architecture more difficult to deal with Summary of feature PR arch should have Homogenous logic and routing layout Bus based communication (e.g. network on chip) 1D organization for relocation

B C

More efficient use of Configuration Space C A

Pipeline Reconfigurable
Example: PipeRench Simplifies reconfiguration Limit what can be implemented Cycle Virtual Pipeline stage 1 2 3 4 PE PE PE PE 1 1 1 PE PE PE PE 2 2 2 3 3 3 PE PE PE PE 4 4 Cycle Physical Pipeline stage 1 2 3 3 3 1 1 1 4 4 2 2 2

Block Reconfigurable Swappable Logic Units
Abstraction layer over a general PR architecture: SCORE Config Data

Managing the Reconfiguration Process
Choosing a configuration When to load Where to load Reduce how often one needs to reconfigure, hiding latency

Configuration Grouping
What to pack Pack multiple related in time configs into one Simulated annealing, clustering based on app control flow

Configuration Caching
When to load LRU, credit based dealing with variable sized configs

Configuration Scheduling
Prefetching Control flow graph Static compiler inserted conf instructions Dynamic: probabilistic approaches MM (branch prediction) Constraints Resource Real-time Mitigation System status and prediction What are current request Predict which config combination will give best speed up

Software-based Relocation Defragmentation
Placing R/D decision on CPU host not on chip config controller

Context Switching Safe state then start where left off.

Next Lecture Data Parallel

Lecture Notes

CPRE 583 Reconfigurable Computing Lecture 16: Fri 10/20/2010 (Data Parallel Architectures) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Midterm: Take home portion (40%) given Friday 10/29, due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today): Problem 2 of HW 2 (released after MP3 gets released)

Common Questions

Overview Data Parallel Architectures: MP3 Demo/Overview
Chapters 5.2.4, and chapter 10 MP3 Demo/Overview

What you should learn Data Parallel Architecture basics
Flexibility Reconfigurable Hardware Addes

Data Parallel Architectures

Next Lecture Project initial presentations.

Lecture Notes

CPRE 583 Reconfigurable Computing Lecture 17: Fri 10/22/2010 (Initial Project Presentations) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Common Questions

Overview Present Project Ideas

Projects

Next Lecture Fixed Point Math and Floating Point Math

Lecture Notes

CPRE 583 Reconfigurable Computing Lecture 18: Fri 10/27/2010 (Floating Point) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

Midterm: Take home portion (40%) given Friday 10/29 (released today by 5pm), due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Problem 2 of HW 2 (released soon)

Common Questions

Overview Floating Point on FPGAs (Chapter 21.4 and 31)
Why is it viewed as difficult?? Options for mitigating issues

Floating Point Format (IEEE-754)
Single Precision S exp Mantissa 1 8 23 23 Mantissa = b-1 b-2 b-3 ….b-23 = ∑ b-i 2-i i=1 Floating point value = (-1)S * 2(exp-127) * (1.Mantissa) Example: 0 x”80” x”00000” = -1^0 * 2^ * 1.(1/2 + 1/4) = -1^0 * 2^1 * 1.75 = 3.5 Double Precision S exp Mantissa 1 11 52 Floating point value = (-1)S * 2(exp-1023) * (1.Mantissa)

Fixed Point Whole Fractional bW-1 … b1 b0 b-1 b-2 …. b-F
Example formats (W.F): 5.5, 10.12, 3.7 Example fixed point 5.5 format: = 10. 1/4 + 1/8 = Compare floating point and fixed point Floating point: 0 x”80” “110” x”00000” = 3.5 10-bit (Format 3.7) Fixed Point for 3.5 = ?

Fixed Point (Addition)
Whole Fractional Operand 1 Whole Fractional Operand 2 + Whole Fractional sum

Fixed Point (Addition)
11-bit 4.7 format Operand 1 = 3.875 + Operand 2 = 1.625 sum = 5.5 You can use a standard ripple-carry adder!

Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 +

0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”80” -> x”7F” or visa-verse?

0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”7F”->x”80”, lose least significant bits of Operand 2 - Add the difference of x”80” – x“7F” = 1 to x”7F” - Shift mantissa of Operand 2 by difference to the right. remember “implicit” 1 of the original mantissa 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 +

0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 +

0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + Overflow! x”00000”

0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas You can’t just overflow mantissa into exponent field You are actually overflowing the implicit “1” of Operand 1, so you sort of have an implicit “2” (i.e. “10”). 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + Overflow! x”00000”

0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + 0 x”81” x”00000”

0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + 0 x”81” x”00000” = 5.5 Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + 0 x”81” x”00000”

Floating Point (Addition): Other concerns
Special Value Sign Exponent Mantissa Zero 0/1 Infinity MAX -Infinity 1 NaN Non-zero Denormal nonzero Single Precision S exp Mantissa 1 8 23

Floating Point (Addition): High-level Hardware
M0 M1 Difference Greater Than Mux SWAP Shift value Right Shift Add/Sub Priority Encoder Round Denormal? Left Shift value Left Shift Sub/const E M

Floating Point Both Xilinx and Altera supply floating point soft-cores (which I believe are IEEE-754 compliant). So don’t get too afraid if you need floating point in your class projects Also there should be floating point open cores that are freely available.

Fixed Point vs. Floating Point
Floating Point advantages: Application designer does not have to think “much” about the math Floating point format supports a wide range of numbers (+/- 3x1038 to +/-1x10-38), single precision If IEEE-754 compliant, then easier to accelerate existing floating point base applications Floating Point disadvantages Ease of use at great hardware expense 32-bit fix point add (~32 DFF + 32 LUTs) 32-bit single precision floating point add (~250 DFF LUTs). About 10x more resources, thus 1/10 possible best case parallelism. Floating point typically needs massive pipeline to achieve high clock rates (i.e. high throughput) No hard-resouces such as carry-chain to take advantage of

Fixed Point vs. Floating Point
Range example: Floating Point vs. Fixed Point advantages: Some exception with respect to precision

Mitigating Floating Point Disadvantages
Only support a subset of the IEEE-754 standard Could use software to off-load special cases Modify floating point format to support a smaller data type (e.g. 18-bit instead of 32-bit) Link to Cornell class: Add hardware support in the FPGA for floating point Hardcore multipliers: Added by companies early 2000’s Altera: Hard shared paths for floating point (Stratix-V 2011) How to get 1-TFLOP throughput on FPGAs article achieve-1-trillion-floating-point-operations-per-second-in-an-FPGA

Mitigating Fixed Point Disadvantages (21.4)
Block Floating Point (mitigating range issue)

CPU/FPGA/GPU reported FLOPs
Block Floating Point (mitigating range issue)

Next Lecture Mid-term Then on Friday: Evolvable Hardware

Lecture Notes Altera App Notes on computing FLOPs for Stratix-III
Altera old app Notes on floating point add/mult

CPRE 583 Reconfigurable Computing Lecture 19: Fri 11/5/2010 (Evolvable Hardware) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

MP3: Extended due date until Monday midnight Those that finish by Friday (11/5) midnight bonus +1% per day before new deadline If after Friday midnight, no bonus but no penalty 10% deduction after Monday midnight, and addition -10% each day late Problem 2 of HW 2 (will now call HW3): released by Sunday midnight, will be due Monday 11/22 midnight. Turn in weekly project report (tonight midnight)

What you should learn Understand Evolvable Hardware basics?
Benefits and Drawbacks Key types/categories

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA GATAGA

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream

Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream DFF DFF

Classifying Adaption/Evolution
Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks

Classifying Adaption/Evolution
Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks Phylogeny Epigenesis Ontogeny

Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming

Genetic Algorithms Genome: a finite sting of symbols encoding an individual Phenotype: The decoding of the genome to realize the individual Constant Size population Generic steps Initial population Decode Evaluate (must define a fitness function) Selection Mutation Cross over

Initialize Population
Genetic Algorithms Initialize Population Evaluate Decode Next Generation Selection Cross Over Mutation

Genetic Algorithms Initialize Population Evaluate Decode ( ) Next Generation Selection Cross Over Mutation

Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over Mutation

Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation

Evolvable Hardware Platform

Genetic Algorithms GA are a type of guided search
Why use a guide search? Why not just do an exhaustive search?

Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second The genome of a individual is 32-bits in size How long to do an exhaustive search?

Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second Now genome of a individual is a FPGA 1,000,000 bits in size How long to do an exhaustive search?

Evolvable Hardware Taxonomy
Extrinsic Evolution (furthest from biology) Evolution done in SW, then result realized in HW Intrinsic Evolution HW is used to deploy individuals Results are sent back to SW for fitness calculation Complete Evolution Evolution is completely done on target HW device Open-ended Evolution (closest to biology) Evaluation criteria changes dynamically Phylogeny Epigenesis Ontogeny

Evolvable Hardware Applications
Prosthetic Hand controller chip Kajitani “An Evolvable Hardware Chip for Prostatic Hand Controller”, 1999

Tone Discrimination and Frequency generation Adrian Thompson “Silicon Evolution”, 1996 Xilinx XC6200

Tone Discrimination and Frequency generation Node Functions Node Genotype

Tone Discrimination and Frequency generation Evolved 4KHz oscillator

Evolvable Hardware Issues?

Evolvable Hardware Platforms
Commercial Platforms Xilinx XC6200 Completely multiplex base, thus could program random bitstreams dynamically without damaging chip Xilinx Virtex FPGA Custom Platforms POEtic cell Evolvable LSI chip (Higuchi)

Next Lecture Overview the synthesis process

Notes Notes

Adaptive Thermoregulation for Applications on Reconfigurable Devices
Phillip Jones Applied Research Laboratory Washington University Saint Louis, Missouri, USA Iowa State University Seminar April 2008 Funded by NSF Grant ITR

What are FPGAs? FPGA: Field Programmable Gate Array
Sea of general purpose logic gates CLB Configurable Logic Block

Sea of general purpose logic gates CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

Sea of general purpose logic gates CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB

FPGA Usage Models Partial Reconfiguration Fast Prototyping System on
Experimental ISA Experimental Micro Architectures Run-time adaptation Run-time Customization CPU + Specialized HW - Sparc-V8 Leon Partial Reconfiguration Fast Prototyping System on Chip (SoC) Parallel Applications Full Reconfiguration Image Processing Computational Biology Remote Update Fault Tolerance

Some FPGA Details CLB CLB CLB CLB

Some FPGA Details CLB CLB CLB 4 input Look Up Table 0000 0001 1110
1111 ABCD Z Z A LUT B C D

Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111
1 A AND Z 4 input Look Up Table B C D

Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111
1 A OR Z 4 input Look Up Table B C D

Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z B X000 X001 X110
1 Z 4 input Look Up Table C 2:1 Mux D

Some FPGA Details CLB CLB CLB Z A LUT B C D

Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z
LUT DFF B C D

Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

Why Thermal Management?

Location? Hot Cold Regulated

Mobile? Hot Cold Regulated

Reconfigurability FPGA Plasma Physics Microcontroller

Exceptional Events

Local Experience Thermally aggressive application
Disruption of air flow

Damaged Board (bottom view)
Thermally aggressive application Disruption of air flow

Damaged Board (side view)
Thermally aggressive application Disruption of air flow

Response to catastrophic thermal events
Easy Fix Not Feasible!! Very Inconvenient

Solutions Over provision Use thermal feedback
Large heat sinks and fans Restrict performance Limiting operating frequency Limit amount chip utilization Use thermal feedback Dynamic operating frequency Adaptive Computation Shutdown device My approach

Measuring Temperature
FPGA

Measuring Temperature
FPGA A/D 60 C

Background: Measuring Temperature
FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp , 2000. Temperature 1. .0 .1 0. .0 1. Period

FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period

FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp , 2000. Temperature 1. .1 .0 Period Voltage

FPGA Temperature 1. .1 .0 Period Voltage

FPGA “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands Temperature 1. .1 .0 Period Voltage

FPGA Mode 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

FPGA Mode 1 Mode 2 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

FPGA Mode 3 Mode 1 Mode 2 Core 1 Core 2 70C Temperature 40C Core 3 Core 4 Period 8,000 8,300 Frequency: Low Frequency: High

FPGA Mode 3 Mode 1 Mode 2 Pause Sample Controller Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 2 5 3 1 4 5 2 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High

FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 3 2 5 1 4 5 3 1 2 3 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High

FPGA Mode 2 1 3 Sample Mode Pause Time out Counter 2 1 5 4 3 5 2 3 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

Temperature Benchmark Circuits
Desired Properties: Scalable Work over a wide range of frequencies Can easily increase or decrease circuit size Simple to analyze Regular structure Distributes evenly over chip Help reduce thermal gradients that may cause damage to the chip May serve as standard Further experimentation Repeatability of results “A Thermal Management and Profiling Method for Reconfigurable Hardware Applications”, by Phillip H. Jones, John W. Lockwood, and Young H. Cho; Field Programmable Logic and Applications (FPL’06), Madrid, Spain,

LUT 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF

RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate 8 Input Gen Array of 18 core blocks (864 LUTs, 864 DFFs) (1 LUT, 1 DFF) Thermal workload unit: Computation Row CB 0 CB 17 CB 1 CB 16

RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate RLOC_ORIGIN: Row, Col 100% Activation Rate Thermal workload unit: Computation Row 01 Input Gen CB 0 CB 1 CB 16 CB 17 00 1 1 8 8 (1 LUT, 1 DFF) Array of 18 core blocks (864 LUTs, 864 DFFs)

Example Circuit Layout (Configuration 1x, 9% LUTs and DFFs)
RLOC_ORIGIN: Row, Col (27,6) Thermal Workload Unit

Example Circuit Layout (Configuration 4x, 36% LUTs and DFFs)

Observed Temperature vs. Frequency
T ~ P P ~ F*C*V2 Steady-State Temperatures Cfg4x Cfg10x Cfg2x Cfg1x

Observed Temperature vs. Active Area
Max rated Tj 85 C T ~ P P ~ F*C*V2 Steady-State Temperatures 200 MHz 100 MHz 50 MHz 25 MHz 10 MHz

Projecting Thermal Trajectories
Estimate Steady State Temperature 5.4±.5 Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)

Projecting Thermal Trajectories
Estimate Steady State Temperature How long until 60 C? 5.4±.5 Exploit this phase for performance Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)

Thermal Shutdown Max Tj (70C)

Image Correlation Application
Template

Image Correlation Application Virtex-4 100FX Resource Utilization
Heats FPGA a lot! (> 85 C) Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs)

Application Infrastructure Temperature Sample Controller
Thermoregulation Controller Pause 65 C Application Mode “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands

Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Image Buffer Mode Image Processor Core 1 Mask 1 2 Image Processor Core 3 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 Score Out

Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 Mode MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 3 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Score Out

Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 180 150 100 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 100 75 50 MHz MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 50 MHz 6 4 5 7 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 2 Mask 1 Mask 2 Mask 1 Mask 2 Mask 2 High Priority Features Low Priority Features Score Out

Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 75 100 180 150 50 200 MHz MHz 4 7 8 6 5 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 Mask 2 Mask 1 Mask 2 High Priority Features Low Priority Features Score Out

Thermally Adaptive Frequency
High Frequency Thermal Budget = 72 C “An Adaptive Frequency Control Method Using Thermal Feedback for Reconfigurable Hardware Applications”, by Phillip H. Jones, Young H. Cho, and John W. Lockwood; Field Programmable Technology (FPT’06), Bangkok, Thailand Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)

Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) S. Wang (“Reactive Speed Control”, ECRTS06) Time (s)

Platform Overview Virtex-4 FPGA Temperature Probe

Thermal Budget Efficiency
200 MHz 106 MHz 184 MHz 50 MHz 65 MHz 50 MHz 50 MHz Adaptive Fixed 70 Adaptive Thermal Budget (65 C) 65 4 Features MHz 60 Fixed 25 C Unused 55 Junction Temperature (C) 50 45 40 35 30 40 C 35 C 30 C 25 C 25 C 25 C 0 Fans 0 Fans 0 Fans 0 Fans 1 Fan 2 Fans Thermal Condition

Conclusions Motivated the need for thermal management
Measuring temperature Application dependent voltage variations effects. Temperature benchmark circuits Examined application specific adaptation for improving performance in dynamic thermal environments

Thermally Constrained Systems
Space Craft Sun Earth

Thermally Constrained Systems

Temperature-Safe Real-time Systems
Task scheduling is a concern in many embedded systems Goal: Satisfy thermal constraints without violating real-time constraints

How to manage temperature?
Static frequency scaling Sleep while idle Time T1 T2 T3 T1 T2 T3 Time

Static frequency scaling Sleep while idle Time T1 T2 T3 Too hot? Deadlines could be missed T1 T2 T3 Idle Time

Static frequency scaling Sleep while idle Time T1 T2 T3 Deadlines could be missed T1 T2 T3 Idle Idle Idle Time Generalization: Idle task insertion

Idle Task Insertion More Powerful
Task for schedule at F_max (100 MHz) Period (s) Cost (s) Deadline (s) Utilization (%) Deadline equals cost, frequency cannot be scaled or task schedule becomes infeasible 30 10.0 10.0 33.33 120 30.0 120 25.00 480 30.0 480 6.25 960 20.0 960 2.08 66.66 a. No idle task inserted Tasks scheduled at F_max (100 MHz), 1 Idle Task 960 480 120 60.0 10.0 Deadline (s) 33.33 20.0 60 2.08 99.99 6.25 30.0 25.00 30 Utilization (%) Cost (s) Period (s) b. 1 idle task inserted Idle task insertion No impact on tasks’ cost Higher priority task response times unaffected Allow control over distribution of idle time

Sleep when idle is insufficient
Temperature constraint = 65 C Peak Temperature = 70 C

Idle-task inserted Temperature constraint = 65 C
Peak Temperature = 61 C

Idle-Task Insertion + Deadlines Temperature met? Yes No System
(task set) Idle tasks Scheduler (e.g. RMS) + Deadlines met? Temperature Yes No a. Original schedule does not meet temperature constraints b. Use idle tasks to redistribute device idle time in order to reduce peak device temperature

Related Research Power Management Thermal Management
EDF, Dynamic Frequency Scaling Yao (FOCS’95) EDF, Minimize Temperature Bansal (FOCS’04) Worst Case Execution Time Shin (DAC’99) RMS, Reactive Frequency, CIA Wang (RTSS’06, ECRTS’06)

Thermally Driven Adaptation Experimental Results Conclusions Temperature-Safe Real-time Systems Future Directions

Research Fronts Near term Longer term
Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)

Questions/Comments? Near term Longer term
Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)

Temperature per Processing Core
Temperature vs. Number of Processing Core 70 y = 2.21x 65 S1 y = 2.24x S2 60 y = 2.23x S3 55 2.07x Junction Temperature (C) y = 50 S4 45 y = 1.43x S5 40 y = 1.22x S6 35 1 2 3 4 Number of Processing Cores

Temperature Sample Mode

Ring Oscillator Thermometer Characteristics
Thermometer size Ring oscillator size Oscillation period Incrementer Cycle Period Temperature resolution ~100 LUTs 48 LUTs (47 NOT + 1 OR) ~40 ns ~.16 ms (40ns * 4096) .1ºC/ count Or .1ºC/ 20ns

Application Mode B C Count = 8235 Count = 8425 Count = 8620
Temperature vs. Incrementer Period (Measuring Temperature while Application Active) 10 20 30 40 50 60 70 80 90 8100 8200 8300 8400 8500 8600 8700 Incrementer Period (20ns/count) Temperature (C) Application Mode A B C Count = 8235 Count = 8425 Count = 8620

Virtex-4 100FX Resource Utilization
Application implementation statistics Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) Image Correlation Characteristics 40.6 (at 200 MHz) 1 - 8 8-bit (grey scale) 320x480 Image Processing Rate (Frames per second) # of Features Pixel Resolution Image Size (# pixels)

VirtexE 2000 Resource Utilization Image Correlation Characteristics
Application implementation statistics 125 MHz 26% (43) 32,868 (15,808) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) VirtexE 2000 Resource Utilization 12.7/second (at 125 MHz) 10 (in parallel) 1 - 4 8-bit (grey scale) 640x480 Image Processing Rate # of Templates # of Mask Patterns Pixel Resolution Image Size (# pixels) Image Correlation Characteristics a.) b.)

Scenario Descriptions
30 C (86 F) S3 25 C (77 F) S4 40 C (104 F) S1 35 C (95 F) S2 # of Fans Ambient Temperature Scenario S1 – S6 1 S5 2 S6

High Level Architecture
Application Pause Thermal Manager Frequency & Quality Controller Frequency mode Quality Temperature

Periodic Temperature Sampling
Application Pause Thermal Manager 50 ms Event Counter Event Ring Oscillator Based Thermometer ready Sample Mode Controller Temperature Frequency & Quality capture Frequency mode Quality

Ring Oscillator Based Thermometer
Reset 12-bit incrementer ring_clk MSB Edge Detect 14-bit Clk DFF reset 14 Temperature sel Ready mux

ASIC, GPP, FPGA Comparison
Cost Performance Power Flexibility

Frequency Multiplexing Circuit
Frequency Control Clk Multiplier (DLLs) clk clk to global clock tree 2:1 MUX 4xclk BUFG Current Virtex-4 platform uses glitch free BUFGMUX component

High Frequency Thermal Budget = 72 C Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)

Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

Worst Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C Thermally Safe Frequency 50 MHz

Thermal Budget = 70 C 30/120MHz Adaptive Frequency Thermally Safe Frequency 50 MHz

Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

Typical Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

Typical Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

Best Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Thermally Safe Frequency 50 MHz

Best Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 119 MHz Thermally Safe Frequency 50 MHz 2.4x Factor Performance Increase

CPRE 583 Reconfigurable Computing Lecture 21: Fri 11/12/2010 (Synthesis) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm

What you should learn Intro to synthesis
Synthesis and Optimization of Digital Circuits De micheli, 1994 (chapter 1)

Synthesis (big picture)
Synthesis & Optimization Architectural Logic Boolean Function Min Boolean Relation Min State Min Scheduling Sharing Coloring Covering Satisfiability Graph Theory Boolean Algebra

Views of a design Behavioral view Structural view PC = PC +1 Fetch(PC)
Decode(INST) Add Mult Architectural level RAM control S1 S2 Logic level DFF S3

Levels of Synthesis Architectural level
Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view ID Func. Resources Schedule use (control) Inter connect (data path) Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 read S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 + S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 *, + S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 +,* S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 + S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 write S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit DFF DFF DFF DFF * ALU Control Unit Memory & Steering logic

Optimization Combinational Metrics: propagation delay, circuit size
Sequential Cycle time Latency Circuit size

Impact of Highlevel Syn on Optimaiztion
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

Impact of Highlevel Syn on Optimaiztion
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit * * * ALU Memory & Steering logic Control Unit

Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Sum of products A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’

Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11

Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products Sum of products (minimized) 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 A * B + A’*C*D’ 01 1 10 11

Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw

Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw (xy + xw)’ (xw)’CD + (xy + xw)’(xw)C’D’

Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

Introduction to HW3

Next Lecture MAP

Notes Notes

CPRE 583 Reconfigurable Computing Lecture 22: Fri 11/19/2010 (Coregen Overview) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

HW3): released by Saturday midnight, will be due Wed 12/15 midnight. Turn in weekly project report (tonight midnight) Midterms still being graded, sorry for the delay: You can stop by my office after 5pm today to pick up your graded tests 584 Advertisement: Number 1

What you should learn Basic of using coregen, in class demo

Next Lecture Finish up synthesis process, start MAP

Notes Notes

CPRE 583 Reconfigurable Computing Lecture 22: Fri 12/1/2010 (Class Project Work) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm

Next Lecture Finish up synthesis process, MAP

Notes Notes

CPRE 583 Reconfigurable Computing Lecture 24: Wed 12/8/2010 (Map, Place & Route) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (9 – 10:30 am) Take home final given on Wed 12/15 due 12/17 5pm

Applications on FPGA: Low-level
Implement circuit in VHDL (Verilog) Simulate compiled VHDL Synthesis VHDL into a device independent format Map device independent format to device specific resources Check that device has enough resources for the design Place resources onto physical device locations Route (connect) resources together Completely routed Circuit meets specified performance Download configuration file (bit-steam) to the FPGA

Implement Simulate Synthesize Map Place Route Download

(Technology) Map Translate device independent net list to device specific resources

Place Bind each mapped resource to a physical device location
User Guided Layout (Chapter 16:Reconfigurable Computing) General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based Heuristics used No efficient means for finding an optimal solution

Place (High-level) Netlist from technology mapping in A in B in C RAM
LUT D DFF F DFF G clk out

Place (High-level) Netlist from technology mapping
FPGA physical layout I/O I/O I/O I/O in A in B in C I/O LUT BRAM I/O LUT RAM E DFF F I/O I/O LUT D LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O

Place (High-level) Netlist from technology mapping
FPGA physical layout clk in C out I/O in A in B in C In A LUT G E I/O D F RAM E In B I/O LUT D DFF F LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O

Place User Guided Layout (Chapter 16:Reconfigurable Computing
General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based

Place (User-Guided) User provide information about applications structure to help guide placement Can help remove critical paths Can greatly reduce amount of time for routing Several methods to guide placement Fixed region Floating region Exact location Relative location

Place (User-Guided): Examples
FPGA LUT D DFF F G Part of Map Netlist Fixed region

FPGA LUT D DFF F G Part of Map Netlist Fixed region SDRAM

FPGA Floating region Softcore Processor

FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT G LUT D F LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT G D F LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT G D F LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT G D F LUT LUT LUT LUT

Place (General Purpose)
Characteristics: Places resources without any knowledge of high level structure Guided primarily by local connections between resources Drawback: Does not take explicit advantage of applications structure Advantage: Typically can be used to place any arbitrary circuit

Preprocess Map Netlist using Clustering Group netlist components that have local conductivity into a single logic block Clustering helps to reduce the number of objects a placement algorithm has to explicitly place.

Placement using simulated annealing Based on the physical process of annealing used to create metal alloys

Simulated annealing basic algorithm Placement_cur = Inital_Placement; T = Initial_Temperature; While (not exit criteria 1) While (not exit criteria 2) Placement_new = Modify_placement(Placement_cur) ∆ Cost = Cost(Placement_new) – Cost(Placement_cur) r = random (0,1); If r < e^(-∆Cost / T), Then Placement_cur = Placement_new End loop T = UpdateTemp(T);

Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT LUT G LUT Z B BRAM X A LUT LUT F LUT LUT D LUT

Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X LUT LUT A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT X A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

Place (Structured-based)
Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure.

Structure high-level example

Route Connect placed resources together Two requirements
Design must be completely routed Routed design meets timing requirements Widely used algorithm “PathFinder” PathFinder (FPGA’95) McMurchie and Ebeling Reconfigurable Computing (Chapter 17) Scott Hauch, Andre Dehon (2008)

Route: Route FPGA Circuit

Route (PathFinder) PathFinder: A Negotiation-Based Performance- Driven Router for FPGAs (FPGA’95) Basic PathFinder algorithm Based closely on Djikstra’s shortest path Weights are assigned to nodes instead of edges

Route (PathFinder): Example
G = (V,E) Vertices V: set of nodes (wires) Edges E: set of switches used to connect wires Cost of using a wire: c_n = (b_n + h_n) * p_n S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

Simple node cost cn = bn Obstacle avoidance Note order matters S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

cn = b * p p: sharing cost (function of number of signals sharing a resource) Congestion avoidance S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

Download Convert routed design into a device configuration file (e.g. bitfile for Xilinx devices)

Next Lecture Project presentations

Place (Structured-based)
Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure. GLACE “A Generic Library for Adaptive Computing Environments” (FPL 2001) Is an example tool that takes the structure of an application into account. FLAME (Flexible API for Module-based Environments) JHDL (From BYU) Gen (From Lockheed-Martin Advanced Technology Laboratories)

GLACE: High-level

GLACE: Flow

GLACE: Library Modules

GLACE: Data Path and Control Path

GLACE: FLAME low-level

GLACE: Final placement example

Reconfigurable Computing (High-level Acceleration Approaches)

Similar presentations

Presentation on theme: "Reconfigurable Computing (High-level Acceleration Approaches)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconfigurable Computing (High-level Acceleration Approaches)

Similar presentations

Presentation on theme: "Reconfigurable Computing (High-level Acceleration Approaches)"— Presentation transcript:

Similar presentations

About project

Feedback