The Future of FPGA Interconnect Guy Lemieux The University of British Columbia Tuesday, December 8, 2009 FPT 2009 Workshop Getting the LUT-heads to work…

Layman’s viewpoint How do I explain FPGA interconnect to mom? Imagine planning a city on a grid – Maximum of 100,000 people, “LUT-heads” – For every LUT-head, given two things Home location Work location (often multiple work locations…) Problem: Getting the LUT-heads to work! Problem: Getting the LUT-heads to work! – Design a fixed road network – Every LUT-head drives in own lane (no time-sharing or bus) – Very expensive, lots of infrastructure “logic family” 2

Layman’s viewpoint (2) Problem, Version 2 – After 25yrs, every LUT-head changes home & work LUT-head population may grow or shrink – Same road network must still be used Can only ‘reconfigure lanes’ by changing road paint Problem, Version 3 – Start over, assuming 1,000,000 LUT-heads – New issues when the problem scales? Average trip length ? Average number of lanes in road ? 3

Overview What’s in FPGA interconnect? – Review of typical design What are the main application areas? – Driving the future of interconnect design What are the interconnect metrics? – Pushing the envelope, then becoming practical Open research problems? – Driving the future of interconnect design 4

Input connections S Block C Block Altera Stratix Interconnect CLB aka LAB 6

Input connections IIB: input interconnect block Altera Stratix Interconnect 7

Input connections, neighbours 1 S Block C Block Connections in CLB grow bigger 8

Input connections, neighbours 2 S Block C Block Connections in C Block grow bigger 9

Output connections, local S Block C Block Altera Stratix Interconnect Single-driver: LUT outputs must only feed muxes 10

Output connections, global S Block C Block Altera Stratix Interconnect Single-driver: LUT outputs must only feed muxes extended to include LUT outputs 11

Design considerations Design of C Block / IIB – Selects LUT inputs – Overall function: ‘M’ choose ‘kN’ M = 100..500 wires (H + V) N = 8.. 16 LUTs k = 4..6 inputs/LUT 12

Design considerations Design of S Block – Steers M signals throughout array (turns) Also accepts N LUT outputs – Topologically simple Fs = 3: each wire connects to only 3 outgoing wires Exception: LUT outputs connect to > 3 wires – Strongly influenced by circuit implementation Bidirectional vs directional 13

Array of CLBs, C and S Blocks 14

Bidirectional vs. Directional Wiring bidir/dir == S Block Design + single-driver == C Block Design

Bidirectional Wires Logic C Block S Block 16

Bidirectional Wires Problem Half of tristate buffers left unused Buffers + input muxes dominate interconnect area 17

Bidirectional vs Directional 18

Bidirectional Switch Block 22

Directional Switch Block 23

Bidirectional vs Directional Switch Element Same quantity and type of circuit elements, twice the wiring Switch Block Directional half as many Switch Elements 24

Quantization of Channel Width Bidirectional (Q=1) 4 Switch Elements Ch. Width = 4 * Q = 4 * 1 Directional (Q=2) 2 Switch Elements Ch. Width = 2 * Q = 2 * 2 No “partial” switch elements with < Q wires 25

S Blocks with Long Wires Long wires, span L tiles – Example L = 3 Changes Q Q = L for bidirectional Q = 2L for directional 123 CLB 26

Building up Long Wires Start with One Switch Element Wire ends for straight connections. CLB 27

Building up Long Wires Connect MUX Inputs Extend MUX inputs CLB 28

Building up Long Wires Connect MUX Inputs TURN UP from wire-ends to mux CLB 29

Building up Long Wires Connect MUX Inputs TURN DOWN from wire-ends to mux CLB 30

Building up Long Wires Add +2 More Wires (4 total) Add LONG WIRES, turning UP and DOWN. CLB 31

Building up Long Wires Add +2 More Wires (6 total) Add LONG WIRES, turning UP and DOWN CLB 32

Building up Long Wires Twisting to Next Tile Add wire twisting CLB 33

CLB Full S Block with Long Wires Using One L=3 Switch Element (Q = 2L = 6) 34

Scaling Channel Width Using L=3 Switch Element CLB 2 Switch Elements Channel width = 2Q = 12 1 Switch Element Channel width = Q = 6 VERY IMPORTANT: Area growth is linear with channel width 35

Long Wires  Changes Quantum Long wires, span L tiles – Example L = 3 Q = L for bidirectional Q = 2L for directional 123 CLB 36

Multi-driver Wiring Logic outputs use tristate buffers (C Block) Directional & multi-driver wiring C Block S Block CLB 37

Single-driver Wiring Logic outputs use muxes (S Block) Directional & single-driver wiring New connectivity constraint S Block CLB 38

Directional, Single-driver Benefits Average improvements 0% channel width 9% delay 14% tile length of physical layout 25% transistor count 32% area-delay product 37% wiring capacitance Any reason to use bidir? – Important implications on future interconnect! 39

C Block design C Block 40

C Block design 41 M inputs (100 … 500) Up to kN outputs (4*8... 8*10)

C Block design 42

C Block design Sparse crossbar Similar # switchpoints – On inputs – On outputs Spread out pattern – Two columns have maximum Hamming distance (most # of different switch points) – True for all pairs of columns 43

What are the main application areas? What are FPGAs used for? – A long long time ago… small glue logic Modern… – Internet routers (table lookups, multiplexing) – Embedded systems design (NIOS II, MicroBlaze) – Cell phone basestations (communications DSP) – HDTV sets / set-top boxes (video/image DSP) Future? 45

Application drivers What we know – FPGAs increasingly more powerful, constant cost – ASIC design costs escalating wildly Most ASICs use older technology (0.18/0.13  m) Increasingly, ASICs implemented as FPGAs instead – FPGAs only in low-volume E.g., being designed-out of HDTV sets Extrapolate to find new emerging markets … 46

Application drivers (2) Extrapolating… low volume, high margin – Industrial/scientific instruments: low volume, high margin Medical sensing, imaging (ultrasound, PET, …) Electronics test & measurement (router tester, …) Physics (neutrino detection, …) mixed volume, mixed margin – Computation: mixed volume, mixed margin Computer system simulation (RAMP, …) Molecular dynamics, financial modeling, seismic / oil & gas mixed volume, mixed margin – Portable/handheld: mixed volume, mixed margin Consumer Industrial/Medical 47

Application drivers (3) Problems with FPGAs – Expensive for high-volume markets Need cost-reduction strategy – Insufficient capacity Could just wait for Moore’s Law to catch up Capture emerging markets early: ultra-capacity FPGA – Hard to program Particularly important when used for computation Domain-specific languages help – Power – Slow 48

Interconnect metrics Typical – Area – Delay (latency) – Power Obscure, but important! – Co$t – Bandwidth – Programmability/Ease of use – Reliability/Integrity – Flexibility/Runtime reconfigurability 50

Pushing the envelope Research is about discovery, ideas, exploration – Also evaluation, limit studies, potential uses One general research strategy – Pick a metric – Push the envelope How far did you get? – Back off until practical – Re-integrate with reality 51

Pushing the envelope (2) Example: Area – Cyclone/Spartan are low-cost (low-area) FPGAs Push area to the limits? – Reduce every routing buffer to 1x inverter – Extensive use of pass transistor switches – Reduce connectivity, force sparse logic – Bit-serial logic + routing for datapath How small can we get? – Is this practical? Is there a market? Is it cost-effective? – Increased parallelism? Prototype future FPGA designs now? 52

Pushing the envelope (3) Example: Bandwidth – Virtex/Stratix are high-performance FPGAs Push bandwidth to the limits? – E.g., pipeline every routing wire / switch – Use registers or wave-pipeline How much throughput can we get? – Wave-pipelining ~5Gbps in 65nm [FPGA2009] – Is this practical? Is there a market? 53

Pushing the envelope (4) Example: Flexibility/Runtime reconfigurability – Limited reconfigurability on Xilinx, not on Altera Push flexibility/RTR to the limits? – Note: not a naïve “fully connected” graph – Every switch is dynamically addressable, reconfigurable – Every route has an alternative/backup What can we gain? – Choose-your-own adventure routing [FPGA2009] – Improved NoC-on-FPGA (?) – Is this practical? Is there a market? 54

Pushing the envelope (5) Pushing envelope for other metrics – Power [Kaptanoglu, keynote FPT2007] – Co$t (area?) – Programmability/Ease of use (a CPU?) – Reliability/Integrity (built-in TMR & Razor?) 55

Pushing the envelope (5’) Pushing envelope for other metrics – Power [Kaptanoglu, keynote FPT2007] Portable/handheld Portable/handheld – Co$t (area?) Portable/handheld, computation Portable/handheld, computation – Programmability/Ease of use (a CPU?) Computation Computation – Reliability/Integrity (built-in TMR & Razor?) Scientific/industrial instruments Scientific/industrial instruments Markets exist for these metrics! 56

Open research problems Defect tolerance IIB design – Hard core integration Memory-footprint // Runtime optimized Performance guarantees Layout-aware methods Efficient datapaths Expose the muxes Low-latency, area-efficient repeaters/switches 58

Open research problems (2) Defect tolerance – Future semiconductor technologies expected to be less reliable – Interconnect has built-in redundancy (by design) Issues – Defect localization – Delay-oriented defects – Abstraction suitable for CAD or bitstream-load – Intentional redundancy: how, where, quantity 59

Open research problems (3) IIB (input interconnect block) design – Function: ‘M’ choose ‘kN’ – Conserve ‘switchpoints’, area (# muxes, mux size), delay (levels) – Maximize ‘entropy’ == # of unique functional configurations Are some configurations more important than others? How to count # of configurations? – Generally, difficult topological design problem Most promising ‘type 3’ IIB [TRETS2008] ≈ Clos network ? IIB: input interconnect block M inputs kN outputs 60

Open research problems (4) Hard core integration – Heterogeneous instance of IIB design problem Issues – Each hard core has different # inputs, # outputs Complicates uniformity – Some have large # inputs, outputs Creates congestion ‘pinch points’ Need to design for ‘worst case’ routability – Would prefer ‘average case’ 61

Open research problems (5) Memory-footprint / Runtime optimized – Architecture graph – Netlist search graph Issues – Entire architecture graph is huge, static – Netlist search graph dynamic, alloc/dealloc – Random pointer-chasing – Cache-unfriendly, cache-DRAM bandwidth – Can architecture changes make improvements? 62

Open research problems (6) Performance guarantees – FPGA routers work well, nobody complains Thank you, PathFinder [McMurchie & Ebeling] Issues – Not guaranteed to find a solution (no detection!) Want ‘Just (unoptimally) route it!’ algorithm – No performance bounds on metrics Within X% tracks, Y% delay from minimum 63

Open research problems (7) Layout-aware methods – Altera, Xilinx know how to lay out interconnect – 10+ levels of metal, metal-over-switches, integration of switches and logic Issues – Arbitrary ‘topology’ graphs not practical to build – “One size fits all” FPGA diminishing “Application-specific” FPGA likely to arrive – Automated layout, automated circuit design tools Aware of FPGA architecture / structure 64

Open research problems (8) Efficient datapaths – Multi-bit connections; same source, same sink – Datapath connections coherent, seemingly simple – Very common in computation designs Issues – No successful datapath circuit-switched architecture Dedicated datapath interconnect only 5-10% smaller Abandon circuit switching?  power – How wide? 4b, 8b, 32b? – How to build? 65

Open research problems (9) Expose the muxes (1) – LUTs terrible for implementing multiplexers 2 x 4LUTs = 1 x 6LUT = 4:1 mux Imagine 54b barrel shifter (IEEE double-precision) 1 CLB ≈ 8 x 6LUTs ≈ 2 x 16:1 muxes – Interconnect is full of muxes 1 CLB ≈ 60 x 16:1 muxes Issues – How to ‘expose’ interconnect muxes to users? – Put routing mux select bits under user control – How to guarantee signal ordering? 66

Open research problems (9’) Expose the muxes (2) – Many systems use lots of 32b muxes NIOS, MicroBlaze, NoC, Compute engines – Can we use fast run-time reconfiguration instead of building muxes? Issues – How to expose programming bits to user? – How to enumerate & pre-p&r all configurations? 67

Summary Interconnect design is fun and challenging Many ‘practical’ of issues solved – Lots of ‘academically interesting’ problems remain – Can still ‘push the envelope’ Promising open problems Final thoughts… – Circuit design ↔ Topology ↔ Layout  CAD – Architectural models (C block, S block) restrictive 68

EOF 69

The Future of FPGA Interconnect Guy Lemieux The University of British Columbia Tuesday, December 8, 2009 FPT 2009 Workshop Getting the LUT-heads to work…

Similar presentations

Presentation on theme: "The Future of FPGA Interconnect Guy Lemieux The University of British Columbia Tuesday, December 8, 2009 FPT 2009 Workshop Getting the LUT-heads to work…"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Future of FPGA Interconnect Guy Lemieux The University of British Columbia Tuesday, December 8, 2009 FPT 2009 Workshop Getting the LUT-heads to work…

Similar presentations

Presentation on theme: "The Future of FPGA Interconnect Guy Lemieux The University of British Columbia Tuesday, December 8, 2009 FPT 2009 Workshop Getting the LUT-heads to work…"— Presentation transcript:

Similar presentations

About project

Feedback