Download presentation
Presentation is loading. Please wait.
1
Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada
2
Overview Introduction, Goals and Motivation –Reduce channel width, lower cost, make circuits “routable” Reducing Channel Width By Depopulation Large Benchmark Circuits New Clustering Technique –Selective Depopulation Conclusions and Future Work
3
Mesh-Based FPGA Architecture Channel width –Number of routing tracks per channel LLLLLLLLLLLLLLLLLLLLL L L L L Larger FPGA devices: more tiles –Channel width is fixed
4
Motivation: Area of FPGA Devices Number of Layout Tiles SIZE of Layout Tile Total Layout AREA = SIZE * Number MCNC Circuits Mapped onto an FPGA
5
Motivation: Channel Width Demand Logic Range User buys bigger device. Interconnect Range User has no choice! Devices built for worst-case channel width (fixed width) Interconnect cost dominates (>70%) MCNC Circuits Mapped onto an FPGA
6
Goal: Reduce Channel Width But { apex4, elliptic, frisc, ex1010, spla, pdc } are unroutable…. Can we make them routable in a Constrained FPGA? Altera Cyclone Channel width constraint of 80 routing tracks Constrained FPGA Channel width constraint of 60 routing tracks Smaller area, lower cost for low-channel-width circuits
7
Possible Solution Trade-off logic utilization for channel width –User can always buy more logic…. (not more wires) FPGA 1FPGA 2 LLLL LLLL LLLL LLLL LLLL LLLL LLLL LLLL L L L L LLLLL Trade-off: CLB count for Channel width But….. can we achieve lower Total Area? ( = SIZE * CLB Count)
8
Logic Element: BLE and CLB Basic Logic Element (BLE) –‘k’-input LUT + FF Clustered Logic Block (CLB) –‘N’ BLEs, ‘N’ outputs –‘ I ’ shared inputs ‘ I ’ Inputs ‘N’ Outputs BLE #1 BLE #2 BLE #3 BLE #4 BLE #5 CLB LLLL LLLL LLLL LLLL Note: I < k*N
9
CLB Depopulation Normally: CLBs fully packed –Reduces total # of CLBs needed for circuit CLB Depopulation: Tessier, DeHon –Do not use all BLEs –Increase # CLBs used –Decrease channel width –Decrease overall area Problem –Increase in # CLBs high for large circuits –Our work: limits # CLB increase ‘I’ Inputs ‘N’ Outputs BLE #1 BLE #2 BLE #3 BLE #4 BLE #5 CLB
10
Uniform Depopulation Previous work –Depopulate each CLB by equal amount But… circuit observations –regions of high routing demand –regions of low routing demand Depopulate in low congestion areas ?? –Unnecessary increase in area
11
Non-Uniform Depopulation Our depopulation method: –Assume congestion is localized –Depopulate only congested areas We show non-uniform de- population –Effective method of channel width reduction –Graceful tradeoff between channel width and area –Makes unroutable circuits routable
12
Depopulation Methods to Reduce Channel Width
13
CLB Depopulation General Approach –Use existing clustering tools –Do not fill CLB while clustering 1.Input-Limited Eg. Maximum 67% input utilization per CLB Might use all BLEs 2.BLE-Limited Eg. Maximum 60% BLE utilization per CLB Might use all Inputs ‘I’ Inputs ‘N’ Outputs BLE #1 BLE #2 BLE #3 BLE #4 BLE #5 CLB
14
Reducing Channel Width Results (max cluster size 16) Input-Limited No channel width control BLE-Limited (almost) monotonically increasing good channel width control
15
Benchmark Circuit Creation (We want BIG circuits!) (What do REALLY BIG circuits look like?)
16
Benchmarking Circuits: Some Observations Altera has bigger benchmarks than academics –We noted similar characteristics: Some LARGE circuits routable with NARROW routing channels Some SMALL circuits need WIDE routing channels What if each circuit is IP Block in larger system… ?? 20 Largest MCNC Benchmarks Altera Cyclone Benchmarks [CICC 2003] LUT Range 10:1 (1,000..10,000 LUTs) 10:1 (2,500..25,000 LUTs) Channel Width Range 4:1 (20..80 tracks) 3:1 (40..120 tracks)
17
Benchmark Creation – IP Blocks Mimic process of creating large designs –“IP Blocks” MCNC Circuits –SoC Randomly integrate/stitch together “IP Blocks” –IP Blocks have varied interconnect needs Real-life large designs: System-on-Chip Methodology –IP blocks (own, 3 rd party) Re-use improves productivity –Primarily integration and verification effort
18
Benchmark Creation – Large Designs Considered 3 stitching schemes… –Independent IP Blocks are not connected to each other –Pipeline Outputs of one IP block connected to inputs of next IP block –Clique Outputs of each IP block are uniformly distributed to inputs of all other IP blocks
19
MetaCircuit: Reducing Routed Channel Width? Observations –IP blocks are tightly-connected internally –IP blocks have varied channel width needs Hypotheses 1.Placement keeps each “IP block” together 2.IP blocks has large routed channel width MetaCircuit has large routed channel width
20
Hypothesis Testing: MetaCircuit P&R Results Use VPR FPGA tools from University of Toronto Hypothesis 1 –VPR placer successfully groups IP blocks from random initial placement Hypothesis 2 –VPR router confirms channel width of MetaCircuit is dominated by a few IP blocks { pdc, clma, ex1010 }
21
Consequences of Hypothesis 2 Question –Shrink channel width of few IP blocks ? ? shrink channel width of MetaCircuit? How to shrink channel widths? –Selective CLB Depopulation !! –Depopulate hard-to-route IP blocks the most How much to depopulate? –Channel width profiling of IP block…
22
Meeting Channel Width Constraints: Selective Depopulation Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)
23
IP Block Properties Cluster IP Blocks into N=16, k=6 VPR: determine minimum channel width for each IP Block Sort IP Blocks based on channel width Hard-to-Route Circuits Easy-to-Route Circuits
24
Channel Width Profiling of IP Block Cluster sizes –N A = FPGA Architecture Cluster Size (fixed) –N C = BLE-Limit Size (variable) Sweep N C for each IP block
25
Analysis with Constraint Given channel-width constraint of 60 tracks –tseng routable (easy) –clma routable for N C <= 10 –clma not routable for N C > 10
26
Our Technique: Selective Depopulation Step 1: Channel Width Profiling of IP Blocks (Congestion Estimation) Step 2: Re-cluster Only Congested IP Blocks (Selective Depopulation)
27
Uniform Depopulation Minimum N C Cluster Size –De-populate all clusters equally –Eg, use N C =10 for both IP Blocks
28
Non-Uniform Depopulation Maximal N C Cluster Size –Depopulate each IP block according to maximal cluster size –Eg, clma N C =10, tseng N C =16
29
Uniform vs. Non-Uniform Non-Uniform depopulation better than Uniform –Lower CLB count –Higher LUT utilization Channel Width Constraint UniformNon-Uniform LUT UtilizationTotal CLBs Needed Channel Width Constraint x 1,000 UniformNon-Uniform
30
MetaCircuit Clustering Results Depopulate the most- congested IP blocks –(BLE-Limit) of each IP block shown (max=16) –Some IP blocks are depopulated more than others
31
1 Channel Width Constraint Normalized Area MetaCircuit P&R Results Clique MetaCircuit –P&R channel width results closely match “constraints” Shrink Channel Width by ~20% (from 95 to 75), NO AREA INCREASE by ~50% (from 95 to 50), 1.7x area increase Channel Width Constraint Channel Width ConstraintRouted
32
Other MetaCircuit Results CircuitClustering Tool Channel Width Decreases ( < 1.05 x Area ) ( 1.7 x – 3.5 x Area ) Clique T-VPack iRAC Rep. 20% 7% 50% 29% Independent* T-VPack iRAC Rep. 24% 27% 42% 30% Pipeline* T-VPack iRAC Rep. 25% 11% 55% 27% * These latest results are better than those given in paper
33
Critical Path Delay and Average Wirelength Expect critical path delay to increase under tighter constraints –Delay “noise” due to instability of floorplan locations Average wirelength / net increases under tighter constraints
34
Conclusion System-level technique to map large System-on-Chip (SoC) designs to channel-width constrained FPGAs using fewer routing resources Depopulating CLBs effective at reducing channel width Non-uniform depopulation important to limit area inflation Channel width reduced –by 0-20% with < 5% area increase –by up to 50% with 3.3 X area increase Effective solution to trade-off CLBs for Interconnect !!! –UNROUTABLE circuits (channel width TOO LARGE) can be made ROUTABLE (reduced channel width) by buying an FPGA with MORE LOGIC!!!
35
End of Talk
36
Future Work Real-Life SoC Benchmark –Licensed IP: Bluetooth baseband processor –325,000 ASIC gates –Numerous IP blocks of varying complexity –Needed to authenticate “Synthetic” results Automated technique to find “hard” IP blocks –Granularity is based on design hierarchy (?) –Replaces time-consuming Step 1 of process
37
Motivation: Reduce Cost Observations –Interconnect dominates, layout area >= 70% –Fixed interconnect architecture Designed for near-worst-case demand –Same interconnect architecture across entire family Eg, Altera Cyclone: 80 tracks-per-channel for all devices Choice for logic capacity (device selection) No choice for interconnect capacity Result –Overcapacity in interconnect –Interconnect dominates cost –User has no way to reduce dominant cost
38
Fixed Channel-Width Constraints Real FPGA Device: fixed Channel Width –Some hard-to-route circuits (routing intensive) won’t (reword?) fit Problem –Find way to make circuit fit Our Approach –Divide circuit into large-sized chunks, eg IP Blocks –Make “hard-to-route” IP Blocks “easy-to-route” by CLB depopulation This increases CLB usage –Leave “easy” ones alone: limit CLB increase
39
Overview of Clustering Approach Two methods for choosing N C –Uniform Depopulation: use fixed N C <= N A –Non-Uniform Depopulation: use best N C <= N A As expected, Non-Uniform gives better results Cluster each IP block separately –Compare with 2 clustering tools –T-VPACK vs. iRAC replica Channel Width Prediction Largest Channel Width of IP blocks <= Channel Width of MetaCircuit
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.