Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reconfigurable Computing (High-level Acceleration Approaches)

Similar presentations


Presentation on theme: "Reconfigurable Computing (High-level Acceleration Approaches)"— Presentation transcript:

1 Reconfigurable Computing (High-level Acceleration Approaches)
Dr. Phillip Jones, Scott Hauck Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

2 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

3 Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

4 Projects: Target Timeline
Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Wed 10/20 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)

5 Common Questions

6 Overview First 15 minutes of Google FPGA lecture How to run Gprof
Discuss some high-level approaches for accelerating applications.

7 What you should learn Start to get a feel for approaches for accelerating applications.

8 Why use Customize Hardware?
Great talk about the benefits of Heterogeneous Computing

9 Profiling Applications
Finding bottlenecks Profiling tools gprof: Valgrind

10 Pipelining How many ns to process to process 100 input vectors? Assuming each LUT Has a 1 ns delay. Input vector <A,B,C,D> output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D How many ns to process 100 input vectors? Assume a 1 ns clock 4-LUT B C D A DFF 1 DFF delay per output

11 Pipelining (Systolic Arrays)
Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner.

12 Pipelining (Systolic Arrays)
Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1

13 Pipelining (Systolic Arrays)
Dynamic Programming Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 1 1

14 Pipelining (Systolic Arrays)
Dynamic Programming 1 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 1 1 1

15 Pipelining (Systolic Arrays)
Dynamic Programming 1 3 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1

16 Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1

17 Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if CPU can process one cell per clock (1 ns clock)?

18 Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 How many ns to process if FPGA can obtain maximum parallelism each clock? (1 ns clock)

19 Pipelining (Systolic Arrays)
Dynamic Programming 1 3 6 Start with base case Lower left corner Formula for computing numbering cells 3. Final result in upper right corner. 1 2 3 1 1 1 What speed up would an FPGA obtain (assuming maximum parallelism) for an 100x100 matrix. (Hint find a formula for an NxN matrix)

20 Dr. James Moscola (Example)
MATL2 D10 ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 20

21 Example RNA Model 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21 MATL2 MATP1
ML9 MATP1 IL7 IR8 END3 E12 IL11 ROOT0 MP3 D6 MR5 ML4 S0 IL1 IR2 c g a 1 2 3 ROOT0 MATP1 MATL2 END3 1 2 3 21

22 Baseline Architecture Pipeline
END3 MATL2 MATP1 ROOT0 E12 IL11 D10 ML9 IR8 IL7 D6 MR5 ML4 MP3 IR2 IL1 S0 u g g c g a c a c c c residue pipeline 22

23 Processing Elements IL7,3,2 IR8,3,2 ML9,3,2 D10,3,2 ML4 + = + = + = +
1 2 3 .40 -INF .22 .72 .30 .44 1 j  IL7,3,2 2 + ML4_t(7) = 3 IR8,3,2 + ML4_t(8) = ML9,3,2 + ML4_t(9) = D10,3,2 + + ML4,3,3 = .22 ML4_t(10) ML4_e(A) ML4_e(C) ML4_e(G) ML4_e(U) input residue, xi 23

24 Baseline Results for Example Model
Comparison to Infernal software Infernal run on Intel Xeon 2.8GHz Baseline architecture run on Xilinx Virtex-II 4000 occupied 88% of logic resources run at 100 MHz Input database of 100 Million residues Bulk of time spent on I/O (41.434s)

25 Expected Speedup on Larger Models
Name Num PEs Pipeline Width Pipeline Depth Latency (ns) HW Processing Time (seconds) Total Time with measured I/O (seconds) Infernal Time (seconds) Infernal Time (QDB) (seconds) Expected Speedup over Infernal Expected Speedup over Infernal (w/QDB) RF00001 39492 195 19500 349492 128443 8236 3027 RF00016 43256 282 28200 336000 188521 7918 4443 RF00034 38772 187 18700 314836 87520 7419 2062 RF00041 44509 206 20600 388156 118692 9147 2797 Example 81 26 6 600 1039 868 25 20 Speedup estimated ... using 100 MHz clock for processing database of 100 Million residues Speedups range from 500x to over 13,000x larger models with more parallelism exhibit greater speedups

26 Distributed Memory ALU Cache BRAM BRAM PE BRAM BRAM

27 Next Class Models of Computation (Design Patterns)

28 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

29 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 11: Fri 10/1/2010 (Design Patterns) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

30 Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

31 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

32 Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

33 Weekly Project Updates
The current state of your project write up Even in the early stages of the project you should be able to write a rough draft of the Introduction and Motivation section The current state of your Final Presentation Your Initial Project proposal presentation (Due Wed 10/20). Should make for a starting point for you Final presentation What things are work & not working What roadblocks are you running into

34 Overview Class Project (example from 2008) Common Design Patterns

35 What you should learn Introduction to common Design Patterns & Compute Models

36 Outline Design patterns Why are they useful? Examples Compute models

37 Outline Design patterns Why are they useful? Examples Compute models

38 References Reconfigurable Computing (2008) [1] Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986) Design Patterns: Abstraction and Reuse of Object Oriented Design [4] E. Gamma (1992) The Timeless Way of Building [5] C. Alexander (1979)

39 Design Patterns Design patterns: are a solution to reoccurring problems.

40 Reconfigurable Hardware Design
“Building good reconfigurable designs requires an appreciation of the different costs and opportunities inherent in reconfigurable architectures” [2] “How do we teach programmers and designers to design good reconfigurable applications and systems?” [2] Traditional approach: Read lots of papers for different applications Over time figure out ad-hoc tricks Better approach?: Use design patterns to provide a more systematic way of learning how to design It has been shown in other realms that studying patterns is useful Object oriented software [4] Computer Architecture [5]

41 Common Language Provides a means to organize and structure the solution to a problem Provide a common ground from which to discuss a given design problem Enables the ability to share solutions in a consistent manner (reuse)

42 Describing a Design Pattern [2]
10 attributes suggested by Gamma (Design Patterns, 1995) Name: Standard name Intent: What problem is being addressed?, How? Motivation: Why use this pattern Applicability: When can this pattern be used Participants: What components make up this pattern Collaborations: How do components interact Consequences: Trade-offs Implementation: How to implement Known Uses: Real examples of where this pattern has been used. Related Patterns: Similar patterns, patterns that can be used in conjunction with this pattern, when would you choose a similar pattern instead of this pattern.

43 Example Design Pattern
Coarse-grain Time-multiplexing Template Specialization

44 Coarse-grain Time-Multiplexing
B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

45 Coarse-grain Time-Multiplexing
Name: Coarse-grained Time-Multiplexing Intent: Enable a design that is too large to fit on a chip all at once to run as multiple subcomponents Motivation: Method to share limited fixed resources to implement a design that is too large as a whole.

46 Coarse-grain Time-Multiplexing
Applicability (Requirements): Configuration can be done on large time scale No feedback loops in computation Feedback loop only spans the current configuration Feedback loop is very slow Participants: Computational graph Control algorithm Collaborations: Control algorithm manages when sub-graphs are loaded onto the device

47 Coarse-grain Time-Multiplexing
Consequences: Often platforms take millions of cycles to reconfigure Need an app that will run for 10’s of millions of cycles before needing to reconfigure May need large buffers to store data during a reconfiguration Known Uses: Video processing pipeline [Villasenor] “Video Communications using Rapidly Reconfigurable Hardware”, Transactions on Circuits and Systems for Video Technology 1995 Automatic Target Recognition [[Villasenor] “Configurable Computer Solutions for Automatic Target Recognition”, FCCM 1996

48 Coarse-grain Time-Multiplexing
Implementation: Break design into multiple sub graphs that can be configured onto the platform in sequence Design a controller to orchestrate the configuration sequencing Take steps to minimize configuration time Related patterns: Streaming Data Queues with Back-pressure

49 Coarse-grain Time-Multiplexing
B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

50 Coarse-grain Time-Multiplexing
Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 A B M3 Temp M3 Temp Configuration 1 Configuration 2

51 Coarse-grain Time-Multiplexing
Assume: 1.) reconfiguration take10 thousand clocks 2.) 100 MHz clock 3.) We need to process for 100x the time spent in reconfiguration to get needed speed up. 4. A and B each produce one byte per clock M2 M1 A B M3 M1 M2 M1 M2 What constraint does this place on Temp? A B 1 MB buffer What if the data path is changed from 8-bit to 64-bit? M3 Temp M3 Temp 8 MB buffer Likely need off chip memory Configuration 1 Configuration 2

52 Template Specialization
Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)

53 Template Specialization
Name: Template Specialization Intent: Reduce the size or time needed for a computation. Motivation: Use early-bound data and slowly changing data to reduce circuit size and execution time.

54 Template Specialization
Applicability: When circuit specialization can be adapted quickly Example: Can treat LUTs as small memories that can be written. No interconnect modifications Participants: Template cell: Contains specialization configuration Template filler: Manages what and how a configuration is written to a Template cell Collaborations: Template filler manages Template cell

55 Template Specialization
Consequences: Can not optimize as much as when a circuit is fully specialize for a given instance. Overhead need to allow template to implement several specializations. Known Uses: Multiply-by-Constant String Matching Implementation: Multiply-by-Constant Use LUT as memory to store answer Use controller to update this memory when a different constant should be used.

56 Template Specialization
Related patterns: CONSTRUCTOR EXCEPTION TEMPLATE

57 Template Specialization
Empty LUTs A(1) A(0) LUT LUT LUT LUT - - - - C(3) C(2) C(1) C(0) Mult by 3 Mult by 5 A(1) A(1) A(0) A(0) LUT LUT LUT LUT LUT LUT LUT LUT 3 6 9 1 1 1 1 5 10 15 1 1 1 1 C(3) C(2) C(1) C(0) C(3) C(2) C(1) C(0)

58 Template Specialization
Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0) Multiply by a constant of 2: Support inputs of 0 - 7

59 Template Specialization
Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 C(3) C(2) C(1) C(0)

60 Template Specialization
Mult by 3 A(1) A(0) LUT LUT LUT LUT 1 1 1 1 3 6 9 Mult by 2 A(2) A(1) A(0) LUT LUT LUT LUT 2 4 6 8 10 12 14 1 1 1 C(3) C(2) C(1) C(0)

61 Catalog of Patterns (Just a start) [2]
[2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions

62 Catalog of Patterns (Just a start) [2]
[2] Identifies 89 patterns Area-Time Tradeoff Basic (implementation): Coarse-grain Time-Multiplex Parallel (Expression): Dataflow, Data Parallel Parallel (Implementation): SIMD, Communicating FSM Reducing Area or Time Ruse Hardware (implementation): Pipelining Specialization (Implementation): Template Communications Layout (Expression/Implementation): Systolic Memory Numbers and Functions

63 Next Lecture Continue Compute Models

64 Lecture Notes:

65 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 12: Wed 10/6/2010 (Compute Models) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

66 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

67 Projects: Target Timeline
Teams Formed and Idea: Mon 10/11 Project idea in Power Point 3-5 slides Motivation (why is this interesting, useful) What will be the end result High-level picture of final product Project team list: Name, Responsibility High-level Plan/Proposal: Fri 10/22 Power Point 5-10 slides System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Related research papers (if any)

68 Projects: Target Timeline
Work on projects: 10/ /8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)

69 Project Grading Breakdown
50% Final Project Demo 30% Final Project Report 30% of your project report grade will come from your 5-6 project updates. Friday’s midnight 20% Final Project Presentation

70 Common Questions

71 Common Questions

72 Common Questions

73 Common Questions

74 Overview Compute Models

75 What you should learn Introduction to Compute Models

76 Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

77 Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

78 References Reconfigurable Computing (2008) [1]
Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon Design Patterns for Reconfigurable Computing [2] Andre DeHon (FCCM 2004) Type Architectures, Shared Memory, and the Corollary of Modest Potential [3] Lawrence Snyder: Annual Review of Computer Science (1986)

79 Building Applications
Problem -> Compute Model + Architecture -> Application Questions to answer How to think about composing the application? How will the compute model lead to a naturally efficient architecture? How does the compute model support composition? How to conceptualize parallelism? How to tradeoff area and time? How to reason about correctness? How to adapt to technology trends (e.g. larger/faster chips)? How does compute model provide determinacy? How to avoid deadlocks? What can be computed? How to optimize a design, or validate application properties?

80 Compute Models Compute Models [1]: High-level models of the flow of computation. Useful for: Capturing parallelism Reasoning about correctness Decomposition Guide designs by providing constraints on what is allowed during a computation Communication links How synchronization is performed How data is transferred

81 Two High-level Families
Data Flow: Single-rate Synchronous Data Flow Synchronous Data Flow Dynamic Streaming Dataflow Dynamic Streaming Dataflow with Peeks Steaming Data Flow with Allocation Sequential Control: Finite Automata (i.e. Finite State Machine) Sequential Controller with Allocation Data Centric Data Parallel

82 Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

83 Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

84 Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

85 Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

86 Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

87 Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

88 Data Flow Graph of operators that data (tokens) flows through
Composition of functions X X +

89 Data Flow Graph of operators that data (tokens) flows through
Composition of functions Captures: Parallelism Dependences Communication X X +

90 Single-rate Synchronous Data Flow
One token rate for the entire graph For example all operation take one token on a given link before producing an output token Same power as a Finite State Machine 1 1 1 update - 1 1 1 1 1 1 1 1 F copy

91 Synchronous Data Flow - F
Each link can have a different constant token input and output rate Same power as signal rate version but for some applications easier to describe Automated ways to detect/determine: Dead lock Buffer sizes 1 10 1 update - 1 1 1 1 10 10 1 1 F copy

92 Dynamic Steaming Data Flow
Token rates dependent on data Just need to add two structures Switch Select in in0 in1 S S Switch Select out0 out1 out

93 Dynamic Steaming Data Flow
Token rates dependent on data Just need to add two structures Switch, Select More Powerful Difficult to detect Deadlocks Still Deterministic 1 Switch y x x y S F0 F1 x y x y Select

94 Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge

95 Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times A Merge

96 Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times B Merge A

97 Dynamic Steaming Data Flow with Peeks
Allow operator to fire before all inputs have arrived Example were this is useful is the merge operation Now execution can be nondeterministic Answer depends on input arrival times Merge B A

98 Steaming Data Flow with Allocation
Removes the need for static links and operators. That is the Data Flow graph can change over time More Power: Turing Complete More difficult to analysis Could be useful for some applications Telecom applications. For example if a channel carries voice verses data the resources needed may vary greatly Can take advantage of platforms that allow runtime reconfiguration

99 Sequential Control Sequence of sub routines
Programming languages (C, Java) Hardware control logic (Finite State Machines) Transform global data state

100 Finite Automata (i.e. Finite State Machine)
Can verify state reachablilty in polynomial time S1 S2 S3

101 Sequential Controller with Allocation
Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S3

102 Sequential Controller with Allocation
Adds ability to allocate memory. Equivalent to adding new states Model becomes Turing Complete S1 S2 S4 S3 SN

103 Data Parallel Multiple instances of a operation type acting on separate pieces of data. For example: Single Instruction Multiple Data (SIMD) Identical match test on all items in a database Inverting the color of all pixels in an image

104 Data Centric Similar to Data flow, but state contained in the objects of the graph are the focus, not the tokens flowing through the graph Network flow example Source1 Dest1 Source2 Switch Dest2 Source3 Flow rate Buffer overflow

105 Multi-threaded Multi-threaded: a compute model made up a multiple sequential controllers that have communications channels between them Very general, but often too much power and flexibility. No guidance for: Ensuring determinism Dividing application into threads Avoiding deadlock Synchronizing threads The models discussed can be defined in terms of a Multi-threaded compute model

106 Multi-threaded (Illustration)

107 Streaming Data Flow as Multithreaded
Thread: is an operator that performs transforms on data as it flows through the graph Thread synchronization: Tokens sent between operators

108 Data Parallel as Multithreaded
Thread: is a data item Thread synchronization: data updated with each sequential instruction

109 Caution with Multithreaded Model
Use when a stricter compute model does not give enough expressiveness. Define restrictions to limit the amount of expressive power that can be used Define synchronization policy How to reason about deadlocking

110 Other Models “A Framework for Comparing Models of computation” [1998]
E. Lee, A. Sangiovanni-Vincentelli Transactions on Computer-Aided Design of Integrated Circuits and Systems “Concurrent Models of Computation for Embedded Software”[2005] E. Lee, S. Neuendorffer IEEE Proceedings – Computers and Digital Techniques

111 Next Lecture System Architectures

112 User Defined Instruction
MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor

113 User Defined Instruction
MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor

114 User Defined Instruction
MP3 FPGA Power PC PC Display.c Ethernet (UDP/IP) User Defined Instruction VGA Monitor

115 MP3 Notes MUCH less VHDL coding than MP2
But you will be writing most of the VHDL from scratch The focus will be more on learning to read a specification (Power PC coprocessor interface protocol), and designing hardware that follows that protocol. You will be dealing with some pointer intensive C-code. It’s a small amount of C code, but somewhat challenging to get the pointer math right.

116 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

117 Lecture Notes kk

118 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 13: Fri 10/8/2010 (System Architectures) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

119 Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Team size: 3-4 (5 case-by-case) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

120 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

121 Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

122 Projects: Target Timeline
Work on projects: 10/ /8 Weekly update reports More information on updates will be given Presentations: Last Wed/Fri of class Present / Demo what is done at this point 15-20 minutes (depends on number of projects) Final write up and Software/Hardware turned in: Day of final (TBD)

123 Common Questions

124 Common Questions

125 Common Questions

126 Overview Common System Architectures Plus/Delta mid-semester feedback

127 What you should learn Introduction to common System Architectures

128 Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

129 Outline Design patterns (previous lecture)
Why are they useful? Examples Compute models (Abstraction) System Architectures (Implementation)

130 References Reconfigurable Computing (2008) [1]
Chapter 5: Compute Models and System Architectures Scott Hauck, Andre DeHon

131 System Architectures Compute Models: Help express the parallelism of an application System Architecture: How to organize application implementation

132 Efficient Application Implementation
Compute model and system architecture should work together Both are a function of The nature of the application Required resources Required performance The nature of the target platform Resources available

133 Efficient Application Implementation
(Image Processing) Platform 1 (Vector Processor) Platform 2 (FPGA)

134 Efficient Application Implementation
(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

135 Efficient Application Implementation
(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

136 Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

137 Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

138 Efficient Application Implementation
(Image Processing) Compute Model System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

139 Efficient Application Implementation
(Image Processing) Data Parallel Compute Model Vector System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

140 Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

141 Efficient Application Implementation
(Image Processing) Data Flow Compute Model Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

142 Efficient Application Implementation
(Image Processing) X X Data Flow Compute Model + Streaming Data Flow System Architecture Platform 1 (Vector Processor) Platform 2 (FPGA)

143 Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

144 Data Presence X X +

145 Data Presence X X data_ready data_ready + data_ready

146 Data Presence X X FIFO FIFO data_ready data_ready + FIFO data_ready

147 Data Presence X X stall stall FIFO FIFO data_ready data_ready + FIFO

148 Data Presence Flow control: Term typical used in networking X X stall
FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking

149 Data Presence Flow control: Term typical used in networking
Increase flexibility of how application can be implemented X X stall stall FIFO FIFO data_ready data_ready + FIFO stall data_ready Flow control: Term typical used in networking

150 Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

151 Datapath Sharing X X +

152 Datapath Sharing Platform may only have one multiplier X X +

153 Datapath Sharing Platform may only have one multiplier X +

154 Datapath Sharing Platform may only have one multiplier REG X REG +

155 Datapath Sharing Platform may only have one multiplier REG X FSM REG +

156 Datapath Sharing Platform may only have one multiplier
REG X FSM REG + Important to keep track of were data is coming!!

157 Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

158 Interconnect sharing X X +

159 Interconnect sharing Need more efficient use of interconnect X X +

160 Interconnect sharing Need more efficient use of interconnect X X +

161 Interconnect sharing Need more efficient use of interconnect X X FSM +

162 Implementing Streaming Dataflow
Data presence variable length connections between operators data rates vary between operator implementations data rates varying between operators Datapath sharing not enough spatial resources to host entire graph balanced use of resources (e.g. operators) cyclic dependencies impacting efficiency Interconnect sharing Interconnects are becoming difficult to route Links between operators infrequently used High variability in operator data rates Streaming coprocessor Extreme resource constraints

163 Streaming coprocessor
See SCORE chapter 9 of text for an example.

164 Sequential Control Typically thought of in the context of sequential programming on a processor (e.g. C, Java programming) Key to organizing synchronizing and control over highly parallel operations Time multiplexing resources: when task to too large for computing fabric Increasing data path utilization

165 Sequential Control X + A B C

166 Sequential Control X + A B C A*x2 + B*x + C

167 Sequential Control X + A B C C A B X X + A*x2 + B*x + C A*x2 + B*x + C

168 Finite State Machine with Datapath (FSMD)
B X X + A*x2 + B*x + C

169 Finite State Machine with Datapath (FSMD)
B X FSM X + A*x2 + B*x + C

170 Sequential Control: Types
Finite State Machine with Datapath (FSMD) Very Long Instruction Word (VLIW) data path control Processor Instruction augmentation Phased reconfiguration manager Worker farm

171 Very Long Instruction Word (VLIW) Datapath Control
See 5.2 of text for this architecture

172 Processor

173 Instruction Augmentation

174 Phased Configuration Manager
Will see more detail with SCORE architecture from chapter 9 of text.

175 Worker Farm Chapter 5.2 of text

176 Bulk Synchronous Parallelism
See chapter 5.2 for more detail

177 Data Parallel Single Program Multiple Data
Single Instruction Multiple Data (SIMD) Vector Vector Coprocessor

178 Data Parallel

179 Data Parallel

180 Data Parallel

181 Data Parallel

182 Cellular Automata

183 Multi-threaded

184 Next Lecture

185 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

186 Lecture Notes Add CSP/Mulithread as root of a simple tree
15+5(late start) minutes of time left Think of one to two in class exercise (10 min) Data Flow graph optimization algorithm? Dead lock detection on a small model? Give some examples of where a given compute model would map to a given application. Systolic array (implement) or Dataflow compute model) String matching (FSM) (MISD) New image for MP3, too dark of a color

187 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 14: Fri 10/13/2010 (Streaming Applications) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

188 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

189 Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

190 Common Questions

191 Common Questions

192 Common Questions

193 Overview Steaming Applications (Chapters 8 & 9) Simulink SCORE

194 What you should learn Two approaches for implementing streaming applications

195 Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

196 Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

197 Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

198 Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

199 Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

200 Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

201 Data Flow: Quick Review
Graph of operators that data (tokens) flows through Composition of functions X X +

202 Data Flow Graph of operators that data (tokens) flows through
Composition of functions Captures: Parallelism Dependences Communication X X +

203 Streaming Application Examples
Some images processing algorithms Edge Detection Image Recognition Image Compression (JPEG) Network data processing String Matching (your MP2 assignment) Sorting??

204 Sorting Initial list of items Split Split Split Sort Sort Sort Sort
merge merge merge

205 Example Tools for Streaming Application Design
Simulink from Matlab: Graphical based SCORE (Steam Computation Organized for Reconfigurable Hardware): A programming model

206 Simulink (MatLab) What is it?
MatLab module that allows building and simulating systems through a GUI interface

207 Simulink: Example Model

208 Simulink: Example Model

209 Simulink: Sub-Module

210 Simulink: Example Model

211 Simulink: Example Model

212 Simulink: Example Plot

213 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection

214 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient

215 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection Detect Horizontal Edges Detect Vertical Edges -1 1 1 2 1 -2 2 -1 1 -1 -2 -1 Sobel X gradient Sobel Y gradient

216 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

217 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

218 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

219 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

220 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

221 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 -1 1 50 50 50 50 50 50 50 50

222 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50

223 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -1 1 50 50 50 50 50 50 50 50

224 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50

225 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50

226 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 50 50 50

227 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50

228 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 50 50 50

229 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 50 50 50

230 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 50 50 50

231 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 50 50

232 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50

233 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 50 50

234 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 50 50

235 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 50 50

236 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 50

237 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50

238 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 50

239 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 50

240 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 50

241 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50

242 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200

243 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200

244 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200

245 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200

246 Example Edge Detection: Sobel
CPRE584 student implementation of Sobel Basic Sobel Algorithm for Edge Detection -1 1 50 50 -50 -2 2 50 150 -150 -1 1 50 50 50 50 50 150 -150 -50 -50 50 100 -100 -100 -100 50 150 -150 -50 -50 50 200 -200

247 Top Level

248 Shifter

249 Multiplier

250 Input Image

251 Output Image

252 SCORE Overview of the SCORE programming approach Developed by
Stream Computations Organized for Reconfigurable Execution Developed by University of California Berkeley California Institute of Technology FPL 2000 overview presentation

253 Next Lecture Data Parallel

254 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

255 Lecture Notes

256 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 15: Fri 10/15/2010 (Reconfiguration Management) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

257 Announcements/Reminders
Midterm: Take home portion (40%) given Friday 10/22, due Tue 10/26 (midnight) In class portion (60%) Wed 10/27 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today/tomorrow) Problem 2 of HW 2 (released after MP3 gets released)

258 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

259 Common Questions

260 Common Questions

261 Overview Chapter 4: Reconfiguration Management

262 What you should learn Some basic configuration architectures
Key issues when managing the reconfiguration of a system

263 Reconfiguration Management
Goal: Minimize the overhead associated with run-time reconfiguration Why import to address Can take 100’s of milliseconds to reconfigure a device For high performance applications this can be a large overhead (i.e. decreases performance)

264 High Level Configuration Setups
Externally trigger reconfiguration CPU Configuration Request FPGA ROM (bitfile) Config Data FSM Config Control (CC)

265 High Level Configuration Setups
Self trigger reconfiguration FPGA Config Data ROM (bitfile) FSM CC

266 Configuration Architectures
Single-context Multi-context Partially Reconfigurable Relocation & Defragmentation Pipeline Reconfiguration Block Reconfigurable

267 Single-context FPGA Config clk Config I/F Config Data Config enable
OUT IN OUT IN OUT EN EN EN Config enable

268 Multi-context FPGA 1 1 2 2 3 3 Config clk Context switch Config Config
OUT IN OUT IN EN EN Context switch 1 1 Context 1 Enable 2 2 Context 2 Enable 3 3 Context 3 Enable Config Enable Config Enable Config Data Config Data

269 Partially Reconfigurable
Reduce amount of configuration to send to device. Thus decreasing reconfiguration overhead Need addressable configuration memory, as opposed to single context daisy chain shifting Example Encryption Change key And logic dependent on key PR devices AT40K Xilinx Virtex series (and Spartan, but not a run time) Need to make sure partial config do not overlap in space/time (typical a config needs to be placed in a specific location, not as homogenous as you would think in terms of resources, and timing delays)

270 Partially Reconfigurable

271 Partially Reconfigurable
Full Reconfig 10-100’s ms

272 Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms

273 Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms

274 Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms

275 Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms

276 Partially Reconfigurable
Partial Reconfig 100’s us - 1’s ms Typically a partial configuration modules map to a specific physical location

277 Relocation and Defragmentation
Make configuration architectures support relocatable modules Example of defragmentation text good example (defrag or swap out, 90% decrease in reconfig time compared to full single context) Best fit, first fit, … Limiting factor Routing/logic is heterogeneous timing issues, need modified routes Special resources needed (e.g. hard mult, BRAMS) Easy issue if there are blocks of homogeneity Connection to external I/O (fix IP cores, board restrict) Virtualized I/O (fixed pin with multiple internal I/Fs? 2D architecture more difficult to deal with Summary of feature PR arch should have Homogenous logic and routing layout Bus based communication (e.g. network on chip) 1D organization for relocation

278 Relocation and Defragmentation
B C

279 Relocation and Defragmentation

280 Relocation and Defragmentation

281 Relocation and Defragmentation

282 Relocation and Defragmentation

283 Relocation and Defragmentation
More efficient use of Configuration Space C A

284 Pipeline Reconfigurable
Example: PipeRench Simplifies reconfiguration Limit what can be implemented Cycle Virtual Pipeline stage 1 2 3 4 PE PE PE PE 1 1 1 PE PE PE PE 2 2 2 3 3 3 PE PE PE PE 4 4 Cycle Physical Pipeline stage 1 2 3 3 3 1 1 1 4 4 2 2 2

285 Block Reconfigurable Swappable Logic Units
Abstraction layer over a general PR architecture: SCORE Config Data

286 Managing the Reconfiguration Process
Choosing a configuration When to load Where to load Reduce how often one needs to reconfigure, hiding latency

287 Configuration Grouping
What to pack Pack multiple related in time configs into one Simulated annealing, clustering based on app control flow

288 Configuration Caching
When to load LRU, credit based dealing with variable sized configs

289 Configuration Scheduling
Prefetching Control flow graph Static compiler inserted conf instructions Dynamic: probabilistic approaches MM (branch prediction) Constraints Resource Real-time Mitigation System status and prediction What are current request Predict which config combination will give best speed up

290 Software-based Relocation Defragmentation
Placing R/D decision on CPU host not on chip config controller

291 Context Switching Safe state then start where left off.

292 Next Lecture Data Parallel

293 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

294 Lecture Notes

295 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 16: Fri 10/20/2010 (Data Parallel Architectures) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

296 Announcements/Reminders
Midterm: Take home portion (40%) given Friday 10/29, due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Start thinking of class projects and forming teams Submit teams and project ideas: Mon 10/11 midnight Project proposal presentations: Fri 10/22 MP3: PowerPC Coprocessor offload (today): Problem 2 of HW 2 (released after MP3 gets released)

297 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

298 Common Questions

299 Common Questions

300 Overview Data Parallel Architectures: MP3 Demo/Overview
Chapters 5.2.4, and chapter 10 MP3 Demo/Overview

301 What you should learn Data Parallel Architecture basics
Flexibility Reconfigurable Hardware Addes

302 Data Parallel Architectures

303 Data Parallel Architectures

304 Data Parallel Architectures

305 Data Parallel Architectures

306 Data Parallel Architectures

307 Data Parallel Architectures

308 Data Parallel Architectures

309 Data Parallel Architectures

310 Data Parallel Architectures

311 Data Parallel Architectures

312 Next Lecture Project initial presentations.

313 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

314 Lecture Notes

315 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 17: Fri 10/22/2010 (Initial Project Presentations) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

316 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

317 Initial Project Proposal Slides (5-10 slides)
Project team list: Name, Responsibility (who is project leader) Project idea Motivation (why is this interesting, useful) What will be the end result High-level picture of final product High-level Plan Break project into mile stones Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. System block diagrams High-level algorithms (if any) Concerns Implementation Conceptual Research papers related to you project idea

318 Common Questions

319 Common Questions

320 Overview Present Project Ideas

321 Projects

322 Next Lecture Fixed Point Math and Floating Point Math

323 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

324 Lecture Notes

325 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 18: Fri 10/27/2010 (Floating Point) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

326 Announcements/Reminders
Midterm: Take home portion (40%) given Friday 10/29 (released today by 5pm), due Tue 11/2 (midnight) In class portion (60%) Wed 11/3 Distance students will have in class portion given via a timed WebCT (2 hour) session (take on Wed, Thur or Friday). Problem 2 of HW 2 (released soon)

327 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

328 Common Questions

329 Common Questions

330 Overview Floating Point on FPGAs (Chapter 21.4 and 31)
Why is it viewed as difficult?? Options for mitigating issues

331 Floating Point Format (IEEE-754)
Single Precision S exp Mantissa 1 8 23 23 Mantissa = b-1 b-2 b-3 ….b-23 = ∑ b-i 2-i i=1 Floating point value = (-1)S * 2(exp-127) * (1.Mantissa) Example: 0 x”80” x”00000” = -1^0 * 2^ * 1.(1/2 + 1/4) = -1^0 * 2^1 * 1.75 = 3.5 Double Precision S exp Mantissa 1 11 52 Floating point value = (-1)S * 2(exp-1023) * (1.Mantissa)

332 Fixed Point Whole Fractional bW-1 … b1 b0 b-1 b-2 …. b-F
Example formats (W.F): 5.5, 10.12, 3.7 Example fixed point 5.5 format: = 10. 1/4 + 1/8 = Compare floating point and fixed point Floating point: 0 x”80” “110” x”00000” = 3.5 10-bit (Format 3.7) Fixed Point for 3.5 = ?

333 Fixed Point (Addition)
Whole Fractional Operand 1 Whole Fractional Operand 2 + Whole Fractional sum

334 Fixed Point (Addition)
11-bit 4.7 format Operand 1 = 3.875 + Operand 2 = 1.625 sum = 5.5 You can use a standard ripple-carry adder!

335 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 +

336 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”80” -> x”7F” or visa-verse?

337 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Common exponent (i.e. align binary point) Make x”7F”->x”80”, lose least significant bits of Operand 2 - Add the difference of x”80” – x“7F” = 1 to x”7F” - Shift mantissa of Operand 2 by difference to the right. remember “implicit” 1 of the original mantissa 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 +

338 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 +

339 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + Overflow! x”00000”

340 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas You can’t just overflow mantissa into exponent field You are actually overflowing the implicit “1” of Operand 1, so you sort of have an implicit “2” (i.e. “10”). 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + Overflow! x”00000”

341 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + 0 x”81” x”00000”

342 Floating Point (Addition)
0 x”80” x”80000” Operand 1 = 3.875 0 x”7F” x”00000” Operand 2 = 1.625 + 0 x”81” x”00000” = 5.5 Add mantissas Deal with overflow of Mantissa by normalizing. Shift mantissa right by 1 (shift a “0” in because of implicit “2”) Increment exponent by 1 0 x”80” x”80000” Operand 1 = 3.875 0 x”80” x”80000” Operand 2 = 1.625 + 0 x”81” x”00000”

343 Floating Point (Addition): Other concerns
Special Value Sign Exponent Mantissa Zero 0/1 Infinity MAX -Infinity 1 NaN Non-zero Denormal nonzero Single Precision S exp Mantissa 1 8 23

344 Floating Point (Addition): High-level Hardware
M0 M1 Difference Greater Than Mux SWAP Shift value Right Shift Add/Sub Priority Encoder Round Denormal? Left Shift value Left Shift Sub/const E M

345 Floating Point Both Xilinx and Altera supply floating point soft-cores (which I believe are IEEE-754 compliant). So don’t get too afraid if you need floating point in your class projects Also there should be floating point open cores that are freely available.

346 Fixed Point vs. Floating Point
Floating Point advantages: Application designer does not have to think “much” about the math Floating point format supports a wide range of numbers (+/- 3x1038 to +/-1x10-38), single precision If IEEE-754 compliant, then easier to accelerate existing floating point base applications Floating Point disadvantages Ease of use at great hardware expense 32-bit fix point add (~32 DFF + 32 LUTs) 32-bit single precision floating point add (~250 DFF LUTs). About 10x more resources, thus 1/10 possible best case parallelism. Floating point typically needs massive pipeline to achieve high clock rates (i.e. high throughput) No hard-resouces such as carry-chain to take advantage of

347 Fixed Point vs. Floating Point
Range example: Floating Point vs. Fixed Point advantages: Some exception with respect to precision

348 Mitigating Floating Point Disadvantages
Only support a subset of the IEEE-754 standard Could use software to off-load special cases Modify floating point format to support a smaller data type (e.g. 18-bit instead of 32-bit) Link to Cornell class: Add hardware support in the FPGA for floating point Hardcore multipliers: Added by companies early 2000’s Altera: Hard shared paths for floating point (Stratix-V 2011) How to get 1-TFLOP throughput on FPGAs article achieve-1-trillion-floating-point-operations-per-second-in-an-FPGA

349 Mitigating Fixed Point Disadvantages (21.4)
Block Floating Point (mitigating range issue)

350 CPU/FPGA/GPU reported FLOPs
Block Floating Point (mitigating range issue)

351 Next Lecture Mid-term Then on Friday: Evolvable Hardware

352 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

353 Lecture Notes Altera App Notes on computing FLOPs for Stratix-III
Altera old app Notes on floating point add/mult

354 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 19: Fri 11/5/2010 (Evolvable Hardware) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

355 Announcements/Reminders
MP3: Extended due date until Monday midnight Those that finish by Friday (11/5) midnight bonus +1% per day before new deadline If after Friday midnight, no bonus but no penalty 10% deduction after Monday midnight, and addition -10% each day late Problem 2 of HW 2 (will now call HW3): released by Sunday midnight, will be due Monday 11/22 midnight. Turn in weekly project report (tonight midnight)

356 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

357 What you should learn Understand Evolvable Hardware basics?
Benefits and Drawbacks Key types/categories

358 Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream

359 Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA

360 Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA GATAGA

361 Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream GATACA GATAGA

362 Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream

363 Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream

364 Evolvable Hardware One of the first papers to compare reconfigurable HW with biological organisms (1993) “Evolvable Hardware with Genetic Learning: A first step towards building a Darwin Machine”, Higuchi Biological organism => DNA GATACAAAGATACACCAGATA Reconfigurable Hardware => Configuration bitstream DFF DFF

365 Classifying Adaption/Evolution
Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks

366 Classifying Adaption/Evolution
Phylogeny Ontogeny Epigenesis (POE) Phylogeny: Evolution through recombination and mutations Biological reproduction : Genetic Algorithms Ontogeny: Self replication Multicellular organism's cell division : Cellular Automata Epigenesis: adaptation trigger by external environment Immune system development : Artificial Neural Networks Phylogeny Epigenesis Ontogeny

367 Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming

368 Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming

369 Artificial Evolution 30/40 year old concept. But applying to reconfigurable hardware is newish (1990’s) Evolutionary Algorithms (EAs) Genetic Algorithms Genetic Programming Evolution Strategies Evolutionary programming

370 Genetic Algorithms Genome: a finite sting of symbols encoding an individual Phenotype: The decoding of the genome to realize the individual Constant Size population Generic steps Initial population Decode Evaluate (must define a fitness function) Selection Mutation Cross over

371 Initialize Population
Genetic Algorithms Initialize Population Evaluate Decode Next Generation Selection Cross Over Mutation

372 Initialize Population
Genetic Algorithms Initialize Population Evaluate Decode ( ) Next Generation Selection Cross Over Mutation

373 Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over Mutation

374 Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation

375 Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation

376 Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation

377 Initialize Population
Genetic Algorithms Initialize Population (.40) (.70) (.20) (.10) (.10) (.60) Evaluate Decode ( ) Next Generation Selection Cross Over (.40) (.70) (.60) Mutation

378 Evolvable Hardware Platform

379 Genetic Algorithms GA are a type of guided search
Why use a guide search? Why not just do an exhaustive search?

380 Genetic Algorithms GA are a type of guided search
Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second The genome of a individual is 32-bits in size How long to do an exhaustive search?

381 Genetic Algorithms GA are a type of guided search
Why use a guide search? Why not just do an exhaustive search? Assume 1 billion individuals can be evaluated a second Now genome of a individual is a FPGA 1,000,000 bits in size How long to do an exhaustive search?

382 Evolvable Hardware Taxonomy
Extrinsic Evolution (furthest from biology) Evolution done in SW, then result realized in HW Intrinsic Evolution HW is used to deploy individuals Results are sent back to SW for fitness calculation Complete Evolution Evolution is completely done on target HW device Open-ended Evolution (closest to biology) Evaluation criteria changes dynamically Phylogeny Epigenesis Ontogeny

383 Evolvable Hardware Applications
Prosthetic Hand controller chip Kajitani “An Evolvable Hardware Chip for Prostatic Hand Controller”, 1999

384 Evolvable Hardware Applications
Tone Discrimination and Frequency generation Adrian Thompson “Silicon Evolution”, 1996 Xilinx XC6200

385 Evolvable Hardware Applications
Tone Discrimination and Frequency generation Node Functions Node Genotype

386 Evolvable Hardware Applications
Tone Discrimination and Frequency generation Evolved 4KHz oscillator

387 Evolvable Hardware Issues?

388 Evolvable Hardware Issues?

389 Evolvable Hardware Platforms
Commercial Platforms Xilinx XC6200 Completely multiplex base, thus could program random bitstreams dynamically without damaging chip Xilinx Virtex FPGA Custom Platforms POEtic cell Evolvable LSI chip (Higuchi)

390 Next Lecture Overview the synthesis process

391 Notes Notes

392 Adaptive Thermoregulation for Applications on Reconfigurable Devices
Phillip Jones Applied Research Laboratory Washington University Saint Louis, Missouri, USA Iowa State University Seminar April 2008 Funded by NSF Grant ITR

393 What are FPGAs? FPGA: Field Programmable Gate Array
Sea of general purpose logic gates CLB Configurable Logic Block

394 What are FPGAs? FPGA: Field Programmable Gate Array
Sea of general purpose logic gates CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

395 What are FPGAs? FPGA: Field Programmable Gate Array
Sea of general purpose logic gates CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB

396 FPGA Usage Models Partial Reconfiguration Fast Prototyping System on
Experimental ISA Experimental Micro Architectures Run-time adaptation Run-time Customization CPU + Specialized HW - Sparc-V8 Leon Partial Reconfiguration Fast Prototyping System on Chip (SoC) Parallel Applications Full Reconfiguration Image Processing Computational Biology Remote Update Fault Tolerance

397 Some FPGA Details CLB CLB CLB CLB

398 Some FPGA Details CLB CLB CLB 4 input Look Up Table 0000 0001 1110
1111 ABCD Z Z A LUT B C D

399 Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111
1 A AND Z 4 input Look Up Table B C D

400 Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z 0000 0001 1110 1111
1 A OR Z 4 input Look Up Table B C D

401 Some FPGA Details CLB CLB CLB Z A LUT B C D ABCD Z B X000 X001 X110
1 Z 4 input Look Up Table C 2:1 Mux D

402 Some FPGA Details CLB CLB CLB Z A LUT B C D

403 Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z
LUT DFF B C D

404 Some FPGA Details CLB CLB PIP Programmable Interconnection Point CLB Z
LUT DFF B C D

405 Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

406 Why Thermal Management?

407 Why Thermal Management?
Location? Hot Cold Regulated

408 Why Thermal Management?
Mobile? Hot Cold Regulated

409 Why Thermal Management?
Reconfigurability FPGA Plasma Physics Microcontroller

410 Why Thermal Management?
Exceptional Events

411 Why Thermal Management?
Exceptional Events

412 Local Experience Thermally aggressive application
Disruption of air flow

413 Damaged Board (bottom view)
Thermally aggressive application Disruption of air flow

414 Damaged Board (side view)
Thermally aggressive application Disruption of air flow

415 Response to catastrophic thermal events
Easy Fix Not Feasible!! Very Inconvenient

416 Solutions Over provision Use thermal feedback
Large heat sinks and fans Restrict performance Limiting operating frequency Limit amount chip utilization Use thermal feedback Dynamic operating frequency Adaptive Computation Shutdown device My approach

417 Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

418 Measuring Temperature
FPGA

419 Measuring Temperature
FPGA A/D 60 C

420 Background: Measuring Temperature
FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp , 2000. Temperature 1. .0 .1 0. .0 1. Period

421 Background: Measuring Temperature
FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period

422 Background: Measuring Temperature
FPGA Temperature 1. .0 1. .1 0. 1. .0 0. 1. Period

423 Background: Measuring Temperature
FPGA S. Lopez-Buedo, J. Garrido, and E. Boemo, . Thermal testing on reconfigurable computers,. IEEE Design and Test of Computers, vol. 17, pp , 2000. Temperature 1. .1 .0 Period Voltage

424 Background: Measuring Temperature
FPGA Temperature 1. .1 .0 Period Voltage

425 Background: Measuring Temperature
FPGA “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands Temperature 1. .1 .0 Period Voltage

426 Background: Measuring Temperature
FPGA Mode 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

427 Background: Measuring Temperature
FPGA Mode 1 Mode 2 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

428 Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Core 1 Core 2 70C Temperature 40C Core 3 Core 4 Period 8,000 8,300 Frequency: Low Frequency: High

429 Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Sample Controller Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

430 Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

431 Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 2 5 3 1 4 5 2 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High

432 Background: Measuring Temperature
FPGA Mode 3 Mode 1 Mode 2 Pause Time out Counter 3 2 5 1 4 5 3 1 2 3 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: Low Frequency: High

433 Background: Measuring Temperature
FPGA Mode 2 1 3 Sample Mode Pause Time out Counter 2 1 5 4 3 5 2 3 3 1 Core 1 Core 2 Temperature Core 3 Core 4 Period Frequency: High

434 Temperature Benchmark Circuits
Desired Properties: Scalable Work over a wide range of frequencies Can easily increase or decrease circuit size Simple to analyze Regular structure Distributes evenly over chip Help reduce thermal gradients that may cause damage to the chip May serve as standard Further experimentation Repeatability of results “A Thermal Management and Profiling Method for Reconfigurable Hardware Applications”, by Phillip H. Jones, John W. Lockwood, and Young H. Cho; Field Programmable Logic and Applications (FPL’06), Madrid, Spain,

435 Temperature Benchmark Circuits
LUT 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF

436 Temperature Benchmark Circuits
RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate 8 Input Gen Array of 18 core blocks (864 LUTs, 864 DFFs) (1 LUT, 1 DFF) Thermal workload unit: Computation Row CB 0 CB 17 CB 1 CB 16

437 Temperature Benchmark Circuits
RLOC: Row, Col 0 , 0 7 , 5 AND 00 70 05 75 DFF Core Block (CB): Array of 48 LUTs and 48 DFF Each LUT configured to be a 4-input AND gate RLOC_ORIGIN: Row, Col 100% Activation Rate Thermal workload unit: Computation Row 01 Input Gen CB 0 CB 1 CB 16 CB 17 00 1 1 8 8 (1 LUT, 1 DFF) Array of 18 core blocks (864 LUTs, 864 DFFs)

438 Example Circuit Layout (Configuration 1x, 9% LUTs and DFFs)
RLOC_ORIGIN: Row, Col (27,6) Thermal Workload Unit

439 Example Circuit Layout (Configuration 4x, 36% LUTs and DFFs)

440 Observed Temperature vs. Frequency
T ~ P P ~ F*C*V2 Steady-State Temperatures Cfg4x Cfg10x Cfg2x Cfg1x

441 Observed Temperature vs. Active Area
Max rated Tj 85 C T ~ P P ~ F*C*V2 Steady-State Temperatures 200 MHz 100 MHz 50 MHz 25 MHz 10 MHz

442 Projecting Thermal Trajectories
Estimate Steady State Temperature 5.4±.5 Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)

443 Projecting Thermal Trajectories
Estimate Steady State Temperature How long until 60 C? 5.4±.5 Exploit this phase for performance Tj_ss = Power * θjA + TA θjA is the FPGA Thermal resistance (ºC/W) Use measured power at t=0 Exponential specific equation Temperature(t) = ½*(-41*e(-t/20) + 71) + ½*(-41*e(-t/180) + 71)

444 Thermal Shutdown Max Tj (70C)

445 Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

446 Image Correlation Application
Template

447 Image Correlation Application Virtex-4 100FX Resource Utilization
Heats FPGA a lot! (> 85 C) Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs)

448 Application Infrastructure Temperature Sample Controller
Thermoregulation Controller Pause 65 C Application Mode “Adaptive Thermoregulation for Applications on Reconfigurable Devices”, by Phillip H. Jones, James Moscola, Young H. Cho, and John W. Lockwood; Field Programmable Logic and Applications (FPL’07), Amsterdam, Netherlands

449 Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Image Buffer Mode Image Processor Core 1 Mask 1 2 Image Processor Core 3 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 Score Out

450 Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 Mode MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 3 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 4 Mask 1 2 Score Out

451 Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

452 Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 200 180 150 100 MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

453 Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 100 75 50 MHz MHz 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 2 Mask 2 Mask 1 High Priority Features Low Priority Features Score Out

454 Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 50 MHz 6 4 5 7 8 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 2 Mask 1 Mask 2 Mask 1 Mask 2 Mask 2 High Priority Features Low Priority Features Score Out

455 Application Specific Adaptation
Temperature Sample Controller Thermoregulation Controller Pause 65 C Frequency Quality Image Buffer 75 100 180 150 50 200 MHz MHz 4 7 8 6 5 Image Processor Core 1 Mask 1 2 Image Processor Core 2 Mask 1 2 Image Processor Core 3 Image Processor Core 4 Mask 1 Mask 2 Mask 1 Mask 2 High Priority Features Low Priority Features Score Out

456 Thermally Adaptive Frequency
High Frequency Thermal Budget = 72 C “An Adaptive Frequency Control Method Using Thermal Feedback for Reconfigurable Hardware Applications”, by Phillip H. Jones, Young H. Cho, and John W. Lockwood; Field Programmable Technology (FPT’06), Bangkok, Thailand Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)

457 Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

458 Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) S. Wang (“Reactive Speed Control”, ECRTS06) Time (s)

459 Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

460 Platform Overview Virtex-4 FPGA Temperature Probe

461 Thermal Budget Efficiency
200 MHz 106 MHz 184 MHz 50 MHz 65 MHz 50 MHz 50 MHz Adaptive Fixed 70 Adaptive Thermal Budget (65 C) 65 4 Features MHz 60 Fixed 25 C Unused 55 Junction Temperature (C) 50 45 40 35 30 40 C 35 C 30 C 25 C 25 C 25 C 0 Fans 0 Fans 0 Fans 0 Fans 1 Fan 2 Fans Thermal Condition

462 Conclusions Motivated the need for thermal management
Measuring temperature Application dependent voltage variations effects. Temperature benchmark circuits Examined application specific adaptation for improving performance in dynamic thermal environments

463 Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Temperature-Safe Real-time Systems Future Directions

464 Thermally Constrained Systems
Space Craft Sun Earth

465 Thermally Constrained Systems

466 Temperature-Safe Real-time Systems
Task scheduling is a concern in many embedded systems Goal: Satisfy thermal constraints without violating real-time constraints

467 How to manage temperature?
Static frequency scaling Sleep while idle Time T1 T2 T3 T1 T2 T3 Time

468 How to manage temperature?
Static frequency scaling Sleep while idle Time T1 T2 T3 Too hot? Deadlines could be missed T1 T2 T3 Idle Time

469 How to manage temperature?
Static frequency scaling Sleep while idle Time T1 T2 T3 Deadlines could be missed T1 T2 T3 Idle Idle Idle Time Generalization: Idle task insertion

470 Idle Task Insertion More Powerful
Task for schedule at F_max (100 MHz) Period (s) Cost (s) Deadline (s) Utilization (%) Deadline equals cost, frequency cannot be scaled or task schedule becomes infeasible 30 10.0 10.0 33.33 120 30.0 120 25.00 480 30.0 480 6.25 960 20.0 960 2.08 66.66 a. No idle task inserted Tasks scheduled at F_max (100 MHz), 1 Idle Task 960 480 120 60.0 10.0 Deadline (s) 33.33 20.0 60 2.08 99.99 6.25 30.0 25.00 30 Utilization (%) Cost (s) Period (s) b. 1 idle task inserted Idle task insertion No impact on tasks’ cost Higher priority task response times unaffected Allow control over distribution of idle time

471 Sleep when idle is insufficient
Temperature constraint = 65 C Peak Temperature = 70 C

472 Idle-task inserted Temperature constraint = 65 C
Peak Temperature = 61 C

473 Idle-Task Insertion + Deadlines Temperature met? Yes No System
(task set) Idle tasks Scheduler (e.g. RMS) + Deadlines met? Temperature Yes No a. Original schedule does not meet temperature constraints b. Use idle tasks to redistribute device idle time in order to reduce peak device temperature

474 Related Research Power Management Thermal Management
EDF, Dynamic Frequency Scaling Yao (FOCS’95) EDF, Minimize Temperature Bansal (FOCS’04) Worst Case Execution Time Shin (DAC’99) RMS, Reactive Frequency, CIA Wang (RTSS’06, ECRTS’06)

475 Outline Why Thermal Management? Measuring Temperature
Thermally Driven Adaptation Experimental Results Conclusions Temperature-Safe Real-time Systems Future Directions

476 Research Fronts Near term Longer term
Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)

477 Questions/Comments? Near term Longer term
Exploration of adaptation techniques Advanced FPGA reconfiguration capabilities Other frequency adaptation techniques Integration of temperature into real-time systems Longer term Cyber physical systems (NSF initiative)

478 Temperature per Processing Core
Temperature vs. Number of Processing Core 70 y = 2.21x 65 S1 y = 2.24x S2 60 y = 2.23x S3 55 2.07x Junction Temperature (C) y = 50 S4 45 y = 1.43x S5 40 y = 1.22x S6 35 1 2 3 4 Number of Processing Cores

479 Temperature Sample Mode

480 Ring Oscillator Thermometer Characteristics
Thermometer size Ring oscillator size Oscillation period Incrementer Cycle Period Temperature resolution ~100 LUTs 48 LUTs (47 NOT + 1 OR) ~40 ns ~.16 ms (40ns * 4096) .1ºC/ count Or .1ºC/ 20ns

481 Application Mode B C Count = 8235 Count = 8425 Count = 8620
Temperature vs. Incrementer Period (Measuring Temperature while Application Active) 10 20 30 40 50 60 70 80 90 8100 8200 8300 8400 8500 8600 8700 Incrementer Period (20ns/count) Temperature (C) Application Mode A B C Count = 8235 Count = 8425 Count = 8620

482 Virtex-4 100FX Resource Utilization
Application implementation statistics Virtex-4 100FX Resource Utilization 200 MHz 44 (11%) 32,868 (77%) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) Image Correlation Characteristics 40.6 (at 200 MHz) 1 - 8 8-bit (grey scale) 320x480 Image Processing Rate (Frames per second) # of Features Pixel Resolution Image Size (# pixels)

483 VirtexE 2000 Resource Utilization Image Correlation Characteristics
Application implementation statistics 125 MHz 26% (43) 32,868 (15,808) 49,148 (58%) 57,461 (68%) Max Frequency Block RAM Occupied Slices D Flip Flops (DFFs) Lookup Tables (LUTs) VirtexE 2000 Resource Utilization 12.7/second (at 125 MHz) 10 (in parallel) 1 - 4 8-bit (grey scale) 640x480 Image Processing Rate # of Templates # of Mask Patterns Pixel Resolution Image Size (# pixels) Image Correlation Characteristics a.) b.)

484 Scenario Descriptions
30 C (86 F) S3 25 C (77 F) S4 40 C (104 F) S1 35 C (95 F) S2 # of Fans Ambient Temperature Scenario S1 – S6 1 S5 2 S6

485 High Level Architecture
Application Pause Thermal Manager Frequency & Quality Controller Frequency mode Quality Temperature

486 Periodic Temperature Sampling
Application Pause Thermal Manager 50 ms Event Counter Event Ring Oscillator Based Thermometer ready Sample Mode Controller Temperature Frequency & Quality capture Frequency mode Quality

487 Ring Oscillator Based Thermometer
Reset 12-bit incrementer ring_clk MSB Edge Detect 14-bit Clk DFF reset 14 Temperature sel Ready mux

488 ASIC, GPP, FPGA Comparison
Cost Performance Power Flexibility

489 Frequency Multiplexing Circuit
Frequency Control Clk Multiplier (DLLs) clk clk to global clock tree 2:1 MUX 4xclk BUFG Current Virtex-4 platform uses glitch free BUFGMUX component

490 Thermally Adaptive Frequency
High Frequency Thermal Budget = 72 C Junction Temperature, Tj (C) Low Frequency Low Threshold = 67 C Time (s)

491 Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

492 Thermally Adaptive Frequency
Thermal Budget = 72 C High Frequency Low Frequency Low Threshold = 67 C Junction Temperature, Tj (C) Time (s)

493 Worst Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C Thermally Safe Frequency 50 MHz

494 Worst Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency Thermally Safe Frequency 50 MHz

495 Worst Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

496 Typical Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

497 Typical Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 48.5 MHz Thermally Safe Frequency 50 MHz

498 Best Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Thermally Safe Frequency 50 MHz

499 Best Case Thermal Condition Thermally Safe Frequency
Thermal Budget = 70 C 30/120MHz Adaptive Frequency 95 MHz Adaptive Frequency 119 MHz Thermally Safe Frequency 50 MHz 2.4x Factor Performance Increase

500 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 21: Fri 11/12/2010 (Synthesis) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

501 Announcements/Reminders
HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm

502 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

503 What you should learn Intro to synthesis
Synthesis and Optimization of Digital Circuits De micheli, 1994 (chapter 1)

504 Synthesis (big picture)
Synthesis & Optimization Architectural Logic Boolean Function Min Boolean Relation Min State Min Scheduling Sharing Coloring Covering Satisfiability Graph Theory Boolean Algebra

505 Views of a design Behavioral view Structural view PC = PC +1 Fetch(PC)
Decode(INST) Add Mult Architectural level RAM control S1 S2 Logic level DFF S3

506 Levels of Synthesis Architectural level
Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

507 Levels of Synthesis Architectural level
Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view ID Func. Resources Schedule use (control) Inter connect (data path) Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

508 Levels of Synthesis Architectural level
Translate the Architectural behavioral view of a design in to a structural (e.g. block level) view Logic Translate the logic behavioral view of a design into a gate level structural view Behavioral view Structural view PC = PC +1 Fetch(PC) Decode(INST) Add Mult Architectural level RAM control S2 S1 Logic level DFF S3

509 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if

510 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

511 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

512 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

513 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 read S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

514 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 + S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

515 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

516 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic

517 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 *, + S6 S5 S4 * ALU Control Unit Memory & Steering logic

518 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 S3 S7 * S6 S5 S4 * ALU Control Unit Memory & Steering logic

519 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 * S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

520 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 S2 S8 +,* S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

521 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 + S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

522 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit S10 S1 S9 write S2 S8 S3 S7 S6 S5 S4 * ALU Control Unit Memory & Steering logic

523 Example: Diffeq Forward Euler method
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if x <= x1; u <= u1; y <= y1; Control Unit DFF DFF DFF DFF * ALU Control Unit Memory & Steering logic

524 Optimization Combinational Metrics: propagation delay, circuit size
Sequential Cycle time Latency Circuit size

525 Optimization Combinational Metrics: propagation delay, circuit size
Sequential Cycle time Latency Circuit size

526 Impact of Highlevel Syn on Optimaiztion
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit

527 Impact of Highlevel Syn on Optimaiztion
y’’ + 3xy’ + 3y = 0, where x(0) = 0; y(0) =y; y’(0) = u, for x = 0 to a, dx step size clk’rise_edge x1 <= x + dx; u1 <= u – (3 * x * u * dx) – (3 * y * dx); y1 <= y + u * dx; if( x1 < a) then ans_done <= 0; else ans_done <= 1 end if * ALU Memory & Steering logic Control Unit * * * ALU Memory & Steering logic Control Unit

528 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

529 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

530 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Sum of products A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’

531 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11

532 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 01 1 10 11

533 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models K-map CD Sum of products Sum of products (minimized) 00 01 10 11 AB A’B’C’D’ + A’B’C’D + A’B’CD’ + A’B’CD + A’BCD’ 00 1 1 1 1 A * B + A’*C*D’ 01 1 10 11

534 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw

535 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models Multi-level high-level view A’B’C’D’ + A’B’C’D ’ A = xy + xw B = xw (xy + xw)’ (xw)’CD + (xy + xw)’(xw)C’D’

536 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

537 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

538 Logic-level Synthesis and Optimization
Combinational Two-level optimization Multi-level optimization Sequential State-based models Network models

539 Introduction to HW3

540 Introduction to HW3

541 Introduction to HW3

542 Next Lecture MAP

543 Notes Notes

544 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 22: Fri 11/19/2010 (Coregen Overview) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

545 Announcements/Reminders
HW3): released by Saturday midnight, will be due Wed 12/15 midnight. Turn in weekly project report (tonight midnight) Midterms still being graded, sorry for the delay: You can stop by my office after 5pm today to pick up your graded tests 584 Advertisement: Number 1

546 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

547 What you should learn Basic of using coregen, in class demo

548 Next Lecture Finish up synthesis process, start MAP

549 Notes Notes

550 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 22: Fri 12/1/2010 (Class Project Work) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

551 Announcements/Reminders
HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (??) Take home final given on Wed 12/15 due 12/17 5pm

552 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

553 Next Lecture Finish up synthesis process, MAP

554 Notes Notes

555 Instructor: Dr. Phillip Jones
CPRE 583 Reconfigurable Computing Lecture 24: Wed 12/8/2010 (Map, Place & Route) Instructor: Dr. Phillip Jones Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA

556 Announcements/Reminders
HW3: finishing up (hope to release this evening) will be due Fri12/17 midnight. Two lectures left Fri 12/3: Synthesis and Map Wed 12/8: Place and Route Two class sessions for Project Presentations Fri 12/10 Wed 12/15 (9 – 10:30 am) Take home final given on Wed 12/15 due 12/17 5pm

557 Projects Ideas: Relevant conferences
FPL FPT FCCM FPGA DAC ICCAD Reconfig RTSS RTAS ISCA Micro Super Computing HPCA IPDPS

558 Applications on FPGA: Low-level
Implement circuit in VHDL (Verilog) Simulate compiled VHDL Synthesis VHDL into a device independent format Map device independent format to device specific resources Check that device has enough resources for the design Place resources onto physical device locations Route (connect) resources together Completely routed Circuit meets specified performance Download configuration file (bit-steam) to the FPGA

559 Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download

560 (Technology) Map Translate device independent net list to device specific resources

561 (Technology) Map Translate device independent net list to device specific resources

562 (Technology) Map Translate device independent net list to device specific resources

563 (Technology) Map Translate device independent net list to device specific resources

564 Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download

565 Place Bind each mapped resource to a physical device location
User Guided Layout (Chapter 16:Reconfigurable Computing) General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based Heuristics used No efficient means for finding an optimal solution

566 Place (High-level) Netlist from technology mapping in A in B in C RAM
LUT D DFF F DFF G clk out

567 Place (High-level) Netlist from technology mapping
FPGA physical layout I/O I/O I/O I/O in A in B in C I/O LUT BRAM I/O LUT RAM E DFF F I/O I/O LUT D LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O

568 Place (High-level) Netlist from technology mapping
FPGA physical layout clk in C out I/O in A in B in C In A LUT G E I/O D F RAM E In B I/O LUT D DFF F LUT I/O I/O LUT I/O LUT BRAM I/O DFF G LUT I/O I/O clk LUT I/O LUT I/O out I/O I/O I/O I/O

569 Place User Guided Layout (Chapter 16:Reconfigurable Computing
General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based

570 Place (User-Guided) User provide information about applications structure to help guide placement Can help remove critical paths Can greatly reduce amount of time for routing Several methods to guide placement Fixed region Floating region Exact location Relative location

571 Place (User-Guided): Examples
FPGA LUT D DFF F G Part of Map Netlist Fixed region

572 Place (User-Guided): Examples
FPGA LUT D DFF F G Part of Map Netlist Fixed region SDRAM

573 Place (User-Guided): Examples
FPGA Floating region Softcore Processor

574 Place (User-Guided): Examples
FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

575 Place (User-Guided): Examples
FPGA Exact Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT G LUT D F LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

576 Place (User-Guided): Examples
FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT G D F LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

577 Place (User-Guided): Examples
FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT G D F LUT LUT BRAM LUT LUT LUT LUT LUT LUT LUT

578 Place (User-Guided): Examples
FPGA Relative Location LUT D DFF F G Part of Map Netlist LUT BRAM LUT LUT LUT LUT LUT LUT LUT LUT BRAM LUT LUT LUT G D F LUT LUT LUT LUT

579 Place User Guided Layout (Chapter 16:Reconfigurable Computing
General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based

580 Place (General Purpose)
Characteristics: Places resources without any knowledge of high level structure Guided primarily by local connections between resources Drawback: Does not take explicit advantage of applications structure Advantage: Typically can be used to place any arbitrary circuit

581 Place (General Purpose)
Preprocess Map Netlist using Clustering Group netlist components that have local conductivity into a single logic block Clustering helps to reduce the number of objects a placement algorithm has to explicitly place.

582 Place (General Purpose)
Placement using simulated annealing Based on the physical process of annealing used to create metal alloys

583 Place (General Purpose)
Simulated annealing basic algorithm Placement_cur = Inital_Placement; T = Initial_Temperature; While (not exit criteria 1) While (not exit criteria 2) Placement_new = Modify_placement(Placement_cur) ∆ Cost = Cost(Placement_new) – Cost(Placement_cur) r = random (0,1); If r < e^(-∆Cost / T), Then Placement_cur = Placement_new End loop T = UpdateTemp(T);

584 Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

585 Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT LUT G LUT Z B BRAM X A LUT LUT F LUT LUT D LUT

586 Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X A LUT LUT Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

587 Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT X LUT LUT A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

588 Place (General Purpose)
Simulated annealing: Illustration FPGA LUT BRAM LUT LUT LUT LUT X A Z B BRAM LUT LUT G D F LUT LUT LUT LUT LUT

589 Place User Guided Layout (Chapter 16:Reconfigurable Computing
General Purpose (Chapter 14:Reconfigurable Computing) Simulated Annealing Partition-based Structured Guided (Chapter 15:Reconfigurable Computing) Data Path based

590 Place (Structured-based)
Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure.

591 Structure high-level example

592 Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download

593 Route Connect placed resources together Two requirements
Design must be completely routed Routed design meets timing requirements Widely used algorithm “PathFinder” PathFinder (FPGA’95) McMurchie and Ebeling Reconfigurable Computing (Chapter 17) Scott Hauch, Andre Dehon (2008)

594 Route: Route FPGA Circuit

595 Route (PathFinder) PathFinder: A Negotiation-Based Performance- Driven Router for FPGAs (FPGA’95) Basic PathFinder algorithm Based closely on Djikstra’s shortest path Weights are assigned to nodes instead of edges

596 Route (PathFinder): Example
G = (V,E) Vertices V: set of nodes (wires) Edges E: set of switches used to connect wires Cost of using a wire: c_n = (b_n + h_n) * p_n S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

597 Route (PathFinder): Example
Simple node cost cn = bn Obstacle avoidance Note order matters S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

598 Route (PathFinder): Example
cn = b * p p: sharing cost (function of number of signals sharing a resource) Congestion avoidance S1 S2 S3 3 2 1 4 1 3 1 A B C 4 1 1 3 1 3 2 D1 D2 D3

599 Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

600 Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

601 Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

602 Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

603 Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

604 Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

605 Route (PathFinder): Example
cn = (b + h) * p h: history of previous iteration sharing cost Congestion avoidance S1 S2 S3 2 1 2 1 1 A B C 1 2 2 1 1 D1 D2 D3

606 Applications on FPGA: Low-level
Implement Simulate Synthesize Map Place Route Download

607 Download Convert routed design into a device configuration file (e.g. bitfile for Xilinx devices)

608 Next Lecture Project presentations

609 Questions/Comments/Concerns
Write down Main point of lecture One thing that’s still not quite clear If everything is clear, then give an example of how to apply something from lecture OR

610 Place (Structured-based)
Leverage structure of the application Algorithms my work well for a give structure, but will likely give unacceptable results for an design with little regular structure. GLACE “A Generic Library for Adaptive Computing Environments” (FPL 2001) Is an example tool that takes the structure of an application into account. FLAME (Flexible API for Module-based Environments) JHDL (From BYU) Gen (From Lockheed-Martin Advanced Technology Laboratories)

611 GLACE: High-level

612 GLACE: Flow

613 GLACE: Library Modules

614 GLACE: Data Path and Control Path

615 GLACE: FLAME low-level

616 GLACE: Final placement example


Download ppt "Reconfigurable Computing (High-level Acceleration Approaches)"

Similar presentations


Ads by Google