Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review.

Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review

Lecture 13: Mid-term 1 Review October 22, 2013 SRAM-based FPGA SRAM bits can be programmed many times Each programming bit takes up five transistors Larger device area reduces speed versus EPROM and antifuse. Read or Write Data Q Q Programming Bit I1I2 P1 P2 P3 P4 Out 2-Input LUT

Lecture 13: Mid-term 1 Review October 22, 2013 Field Programmable Gate Array

Lecture 13: Mid-term 1 Review October 22, 2013 Connection Box Flexibility F c -> How many tracks does an input pin connect to? If logic cluster is small, F C is large F C = W If logic cluster is large, F c can be less. -Approximately 0.2W for Xilinx XC4000EX, Virtex Logic Cluster IO pin Tracks Out T0T0 T1T1 T2T2 T0T0 T1T1 T2T2 F C = 3 T0T0 T1T1 T2T2

Lecture 13: Mid-term 1 Review October 22, 2013 Switchbox Flexibility Switch box provides optimized interconnection area. Flexibility found to be not as important as F C Six transistors needed for F S = 3 0 1 0 1 01 01

Lecture 13: Mid-term 1 Review October 22, 2013 Switchbox Issues

Lecture 13: Mid-term 1 Review October 22, 2013 Bidirectional vs Directional

Lecture 13: Mid-term 1 Review October 22, 2013 Directional Wiring: Outputs can use switch block muxes Dir Architecture Single-driver Wiring!!! New connectivity constraint

Lecture 13: Mid-term 1 Review October 22, 2013 Fine-grained Approach For 4-input LUTs 16 bits of information available Can be chained together through programmable network. Decoder and multiplexer an issue. Flexibility is a key aspect. Addr AD AD 16X1 LUT1 LUT2

Lecture 13: Mid-term 1 Review October 22, 2013 Hill Climbing Algorithms To avoid getting trapped in local minima, consider “hill- climbing” approach Need to accept worse solutions or make “bad” moves to get global minima. Acceptance is probabalistic. Only accept cost-increasing moves some of the time. Cost Solution space

Lecture 13: Mid-term 1 Review October 22, 2013 Maze Routing Evaluate shortest feasible paths based on a cost function Like row-based device global route allocates channel bandwidth not specific solutions. Formulate cost function as needed to address desired goal. L L C S

Lecture 13: Mid-term 1 Review October 22, 2013 Routing Tradeoffs Bias router to find first, best route. Vary number of node expansions using: pcost i = (1 – a) x pcost i-1 + ncost i + a x dist i

Lecture 13: Mid-term 1 Review October 22, 2013 Architectural Limitation Routing architecture necessitates domain selection. Bigger effect for multi-fanout nets

Lecture 13: Mid-term 1 Review October 22, 2013 Pathfinder Use a non-decreasing history value to represent congestion. Similarities to multi-commodity flow Can be implemented efficiently but does require substantial run time Only update after an interation. c i = (1 + h n * h fac ) * (1 + p n * p fac ) + b n, n-1

Lecture 13: Mid-term 1 Review October 22, 2013 Bipartitioning Perhaps biggest problem in multi-FPGA design is partitioning Partitioner must deal with logic and pin constraints. Could simultaneously attempt partitioning across all devices. Even “simple” algorithms are O(n 3 ) Better to recursively bipartition circuit.

Lecture 13: Mid-term 1 Review October 22, 2013 KLFM Partitioning Identify nodes to swap to reduce overall cut size Lock moved nodes Algorithm continues until no un-locked node can be moved without violating size constraints Bin 1Bin 2

Lecture 13: Mid-term 1 Review October 22, 2013 KLFM Partitioning Key issue is implementing node costs in lists that can be easily accessed and updated. Many extensions to consider to speed up overall optimization Reasonably easy to implement in software

Lecture 13: Mid-term 1 Review October 22, 2013 Partition Preprocessing: Clustering Identify bin size Choose a seed block (node) Identify node with highest connectivity to join cluster Terminate when cluster size met. In practical terms cluster size of 4 works best

Lecture 13: Mid-term 1 Review October 22, 2013 Clustering Technology mapping before partitioning is typically ineffective since frequently area is secondary to interconnect Frequently bipartitioning continues after unclustering as well. Cluster KLFM unclusterKLFM This allows for additional fine-grain moves.

Lecture 13: Mid-term 1 Review October 22, 2013 Logic Replication Attempt to reduce cutset by replicating logic. Every input of original cell must also input the replicated cell. Replication can either be integrated into the partitioning process or used as a post-process technique.

Lecture 13: Mid-term 1 Review October 22, 2013 Logic Emulation Emulation takes a sizable amount of resources Compilation time can be large due to FPGA compiles One application: also direct ties to other FPGA computing applications.

Lecture 13: Mid-term 1 Review October 22, 2013 Are Meshes Realistic? The number of wires leaving a partition grows with Rent’s Rule P = KG B Perimeter grows as G 0.5 but unfortunately most circuits grow at G B where B > 0.5 Effectively devices highly pin limited What does this mean for meshes?

Lecture 13: Mid-term 1 Review October 22, 2013 Multi-FPGA Software Missing high-level synthesis Global placement and routing similar to intra-device CAD

Lecture 13: Mid-term 1 Review October 22, 2013 Virtual Wires Overcome pin limitations by multiplexing pins and signals Schedule when communication will take place.

Lecture 13: Mid-term 1 Review October 22, 2013 Virtual Wires Software Flow Global router enhanced to include scheduling and embedding. Multiplexing logic synthesized from FPGA logic.

Lecture 13: Mid-term 1 Review October 22, 2013 Why Compiling C is Hard °General Language °Not Designed For Describing Hardware °Features that Make Analysis Hard Pointers Subroutines Linear code °C has no direct concept of time °C (and most procedural languages) are inherently sequential Most people think sequentially. Opportunities primarily lie in parallel data

Lecture 13: Mid-term 1 Review October 22, 2013 Variables °Handel-C has one basic type - integer °May be signed or unsigned °Can be any width, not limited to 8, 16, 32 etc. Variables are mapped to hardware registers. void main(void) { unsigned 6 a; a=45; } 101101 = 0x2da = LSBMSB

Lecture 13: Mid-term 1 Review October 22, 2013 DeepC Compiler Consider loop based computation to be memory limited Computation partitioned across small memories to form tiles Inter-tile communication is scheduled RTL synthesis performed on resulting computation and communication hardware

Lecture 13: Mid-term 1 Review October 22, 2013 DeepC Compiler Parallelizes compilation across multiple tiles Orchestrates communication between tiles Some dynamic (data dependent) routing possible.

Lecture 13: Mid-term 1 Review October 22, 2013 Control FSM Result for each tile is a datapath, state machine, and memory block

Lecture 13: Mid-term 1 Review October 22, 2013 Striped Architecture Same basic approach, pipelined communication, incremental modification Functions as a linear pipeline Each stripe is homogeneous to simplify computation Condition codes allow for some control flexibility FPGA Fabric Control Unit Configuration Cache Configuration Control & Next Addr Address Condition Codes Microprocessor Interface

Lecture 13: Mid-term 1 Review October 22, 2013 Piperench Internals Only multi-bit functional units used Very limited resources for interconnect to neighboring programming elements Place and route greatly simplied

Convolutional Encoder Accepts information bits as a continuous stream Operates on the current b-bit input, where b ranges from 1 to 6 and some number of immediately preceding b-bit inputs to produce V output bits, V > b FF + + 1 01 0 0 b =1, V =2

Definitions Constraint Length Number of successive b-bit groups of information bits for each encoding operation Denoted by K Code Rate (or) Rate b/V Typical values K : 7 Rate : 1/2, 1/3

The Viterbi Algorithm Finds a bit-sequence in the set of all possible transmitted bit-sequences that most closely resembles the received data. Maximum likelihood algorithm Each bit received by decoder associated with a measure of correctness. Practical for short constraint length convolutional codes

00 10 11 01 0/00 1/11 1/01 1/10 0/01 0/11 1/00 0/10 State diagram State Encoder memory Branch k/ij, where i and j represent the output bits associated with input bit k

Trellis Diagram 00 01 10 11 00 11 10 01 10 01 00 10 T=0T=1T=2T=3 ENC IN : 0 1 0 ENC OUT : 00 11 10 RECEIVED: 00 11 11 Accumulated metric 2+2,3+0 : 3 0+1,3+1 : 1 2+0,3+1 : 2 0+1,3+1 : 1 00 3 2 2 31 3 02 1 K = 3 Rate ½ Total number of states = 2 K-1

Adaptive Viterbi Algorithm Motivation Extremely large memory and logic for Viterbi Algorithm Fewer number of paths retained Reduced memory and computation Definitions Path – Bit sequence Path metric or cost – Accumulated error metric of a path Survivor – Path which is retained for the subsequent time step

Adaptive Viterbi Algorithm Criterion for path survival 1. A threshold T is introduced such that a path is retained if and only if current path metric is less than d m +T, where d m is the minimum cost among all survivors of the previous time step. 2. The total number of survivors per time step is limited to a critical number called N max selected by user. Only best N max paths have to be retained at any time.

Trellis Diagram for AVA

Architecture (contd.) Add b1 sum1 b2 sum2 d i < d m + T Count paths Count < N max T = T-2 yes no Update memory yes Elimination of sorting

42 Virtual Router  Independent routing policies for each virtual router  Key challenges Isolation Performance Flexibility Scalability

43 Virtualization using FPGAs  A novel network virtualization substrate which Uses FPGA to implement high performance virtual routers Introduces scalability through virtual routers in host software Exploits reconfiguration to customize hardware virtual routers

44 Partial Reconfiguration  Use partial reconfiguration to independently configure virtual routers

45 Full FPGA Reconfiguration  Two virtual routers (A, B) initially in FPGA  During reconfiguration router A migrated to software, the other eliminated  After reconfiguration two virtual routers (A, B’) again in FPGA Reduced Throughput

46 Partial FPGA Reconfiguration  A remains in hardware and operates at full speed  20X speedup in reconfiguration down time due to partial reconfiguration Sustained Throughput

Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review.

Similar presentations

Presentation on theme: "Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review.

Similar presentations

Presentation on theme: "Lecture 13: Mid-term 1 Review October 22, 2013 ECE 636 Reconfigurable Computing Lecture 13 Mid-term I Review."— Presentation transcript:

Similar presentations

About project

Feedback