Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.

Slides:



Advertisements
Similar presentations
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
Advertisements

Cadence Design Systems, Inc. Why Interconnect Prediction Doesn’t Work.
Interconnect Complexity-Aware FPGA Placement Using Rent’s Rule G. Parthasarathy Malgorzata Marek-Sadowska Arindam Mukherjee Amit Singh University of California,
BSPlace: A BLE Swapping technique for placement Minsik Hong George Hwang Hemayamini Kurra Minjun Seo 1.
Wen-Hao Liu1, Yih-Lang Li, and Cheng-Kok Koh Department of Computer Science, National Chiao-Tung University School of Electrical and Computer Engineering,
FastPlace: Efficient Analytical Placement using Cell Shifting, Iterative Local Refinement and a Hybrid Net Model FastPlace: Efficient Analytical Placement.
Congestion Driven Placement for VLSI Standard Cell Design Shawki Areibi and Zhen Yang School of Engineering, University of Guelph, Ontario, Canada December.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. Yu August 15, 2005.
A System-Level Stochastic Benchmark Circuit Generator for FPGA Architecture Research Cindy Mark Prof. Steve Wilton University of British Columbia Supported.
Intrinsic Shortest Path Length: A New, Accurate A Priori Wirelength Estimator Andrew B. KahngSherief Reda VLSI CAD Laboratory.
SCOTT MILLER, AMBROSE CHU, MIHAI SIMA, MICHAEL MCGUIRE ReCoEng Lab DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING UNIVERSITY OF.
Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.
Study of Floating Fill Impact on Interconnect Capacitance Andrew B. Kahng Kambiz Samadi Puneet Sharma CSE and ECE Departments University of California,
Fuzzy Simulated Evolution for Power and Performance of VLSI Placement Sadiq M. SaitHabib Youssef Junaid A. KhanAimane El-Maleh Department of Computer Engineering.
Fast Force-Directed/Simulated Evolution Hybrid for Multiobjective VLSI Cell Placement Junaid Asim Khan Dept. of Elect. & Comp. Engineering, The University.
Stochastic Physical Synthesis for FPGAs with Pre-routing Interconnect Uncertainty and Process Variation Yan Lin and Lei He EE Department, UCLA
© 2005 Altera Corporation © 2006 Altera Corporation Placement and Timing for FPGAs Considering Variations Yan Lin 1, Mike Hutton 2 and Lei He 1 1 EE Department,
FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.
Lecture 9: Multi-FPGA System Software October 3, 2013 ECE 636 Reconfigurable Computing Lecture 9 Multi-FPGA System Software.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.
Decoupling Capacitance Allocation for Power Supply Noise Suppression Shiyou Zhao, Kaushik Roy, Cheng-Kok Koh School of Electrical & Computer Engineering.
StaticRoute: A novel router for the dynamic partial reconfiguration of FPGAs Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt 2/9/2013.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
MGR: Multi-Level Global Router Yue Xu and Chris Chu Department of Electrical and Computer Engineering Iowa State University ICCAD
Placement by Simulated Annealing. Simulated Annealing  Simulates annealing process for placement  Initial placement −Random positions  Perturb by block.
Power Reduction for FPGA using Multiple Vdd/Vth
Titan: Large and Complex Benchmarks in Academic CAD
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs Supervisor: Dr. Guy Lemieux October 20, 2011 Chris Wang.
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.
Solving Hard Instances of FPGA Routing with a Congestion-Optimal Restrained-Norm Path Search Space Keith So School of Computer Science and Engineering.
New Modeling Techniques for the Global Routing Problem Anthony Vannelli Department of Electrical and Computer Engineering University of Waterloo Waterloo,
University of British Columbia Dept. of Electrical and Computer Engineering November 30, 2007 A Combined Clustering and Placement Algorithm for FPGAs Mark.
Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.
A Routing Approach to Reduce Glitches in Low Power FPGAs Quang Dinh, Deming Chen, Martin D. F. Wong Department of Electrical and Computer Engineering University.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.
Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Incremental Placement Algorithm for Field Programmable Gate Arrays David Leong Advisor: Guy Lemieux University of British Columbia Department of Electrical.
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Topics Architecture of FPGA: Logic elements. Interconnect. Pins.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,
An Improved “Soft” eFPGA Design and Implementation Strategy
FPGA CAD 10-MAR-2003.
A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.
1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.
International Symposium on Physical Design San Diego, CA April 2002ER UCLA UCLA 1 Routability Driven White Space Allocation for Fixed-Die Standard-Cell.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.
SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.
ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee, Guy Lemieux & Shahriar Mirabbasi University of British Columbia, Canada Electrical & Computer.
Xiao Patrick Dong Supervisor: Guy Lemieux. Goal: Reduce critical path  shorter period Decrease dynamic power 2.
Placement and Routing Algorithms. 2 FPGA Placement & Routing.
Unified Adaptivity Optimization of Clock and Logic Signals Shiyan Hu and Jiang Hu Dept of Electrical and Computer Engineering Texas A&M University.
A Study of the Scalability of On-Chip Routing for Just-in-Time FPGA Compilation Roman Lysecky a, Frank Vahid a*, Sheldon X.-D. Tan b a Department of Computer.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing
HeAP: Heterogeneous Analytical Placement for FPGAs
Incremental Placement Algorithm for Field Programmable Gate Arrays
Verilog to Routing CAD Tool Optimization
Presentation transcript:

Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of Electrical and Computer Engineering Vancouver, BC, Canada

Outline Motivation and background Algorithm Results Conclusion Future Work

Example: Unroutable Situation: Run circuit through VPR Circuit is unroutable at specified target channel-width

Example: Unroutable Situation: Run circuit through VPR Circuit is unroutable at specified target channel-width Only localized area is actually unroutable Routing congestion happens locally

Motivation Goal: Must meet hard channel-width constraint –Number of routing tracks is fixed at manufacture time –Must meet channel-width constraint everywhere on the FPGA Presented with an unroutable circuit –Increase available interconnect (use larger device) More interconnect everywhere = more expensive FPGA device –Decrease local interconnect demand Create more aggregate interconnect for congested regions only

Reclustering Congested Regions Find congested regions and reduce routing demand –Increase CLB usage to spread interconnect usage –Controlled tradeoff between CLB and interconnect usage –Cost savings: can use the same FPGA, just need to recluster

7 Un/DoPack CAD Flow Previous work by Marvin Tom [ICCAD2006] Target a channel width constraint Spread regional logic to reduce local routing demand –Identify congested local regions –Iteratively recluster, replace, reroute –Whitespace insertion: recluster with reduced cluster size Leave uncongested regions alone

Background: Un/DoPack Cluster Place Route Re-cluster Identify Place Route Target CW Met? NO VPR Un/DoPack YES

Contributions Improve Reduce/Area tradeoff of Un/DoPack Flow –Simultaneous area and runtime savings Use congestion information to perform better reclustering –New approach to selecting congested regions –Use of interconnect-demand model to determine how much to spread logic Findings –Up to 5x runtime speedup versus Baseline –Up to 25% area savings versus Baseline

Contributions Recently accepted to FPT 2009 –D. Chiu, G. Lemieux, S. Wilton, “Congestion- Driven Re-Clustering for Low-cost FPGAs”

Previous Depopulation Schemes Single versus Multiregion: Region Selection: –Select all CLBs in area centered on most congested CLB (Single Region) –Select all CLBs in area centered on most congested CLB, not already chosen (Multiregion) Whitespace insertion: –Baseline: insert CLB in 1 row and 1 column in FPGA –Fine-grained : insert CLB in 1 row and 1 column in region Excellent area tradeoff, but slow –Multiregion: insert CLBs proportional to congestion Good runtime performance

Algorithm Region Selection: Try to select regions more intelligently –Capture congested regions instead of just CLBs Whitespace Insertion: Try to estimate appropriate cluster size –Use interconnect demand model to predict outcome for depopulation

Un/DoPack Cluster Place Route Re-cluster Identify Place Route Target CW Met? NO VPR Un/DoPack YES Region selectionWhitespace Insertion

Benchmark Circuits Metacircuits designed to emulate large SOC circuits –Cluster size 16 –Built using benchmark generator GNL –Large circuit composed of smaller subcircuits (SoC style) –Each subcircuit emulates the interconnect complexity (Rent parameter) of individual MCNC circuits –The standard deviation of the rent parameter is varied to create benchmark suite

Region Selection Find congested regions –Post Routing congestion information Center region on most congested CLB

Region Selection Use congestion values to generate direction to move region

Region Selection Binary Search –Find region with highest average congestion

Region Selection Mark Next Region Sort by average congestion and depopulate in sorted order

Budgeted Multiregion Un/DoPack (BMR) Multiple region approach Grow number of CLBs according to budget at each iteration –Number of CLBs in a row and column of the FPGA Each region grows equal to 1 row and 1 column of region

Adding Whitespace Congestion-Model Driven –Use interconnect demand information to estimate how much whitespace to add –Interconnect Model Estimate post clustering channel width for region

Regional Interconnect Adapt demand model for regions of CLBs instead of whole FPGA –Most wiring is from inside the region –Cannot affect wiring across region directly through depopulation

Regional Interconnect Assume external interconnect demand stays fixed Solve for internal interconnect demand region interconnect demand = Internal demand + external demand

Interconnect-Demand Model where W. Fang and J. Rose. “Modeling routing demand for early-stage FPGA architecture development”

Interconnect-Demand Model Use congestion map to determine equation constants –Calibrate equation separately for each region Solve for lambda that gives desired channel width –Re-cluster region with lower cluster size until lambda target is met

Congestion-Model Multiregion Un/DoPack (CMR) Same region selection method as BMR No constraint on new CLBs in each iteration Whitespace insertion using model

Results Typical results –Stdev004 CMR Speedup comparable to Multiregion Un/DoPack BMR Slightly faster than Baseline

Results Typical results –Stdev004 BMR area better than Multiregion CMR area better than Multiregion

Runtime / Area Tradeoff Previous Multiregion Approach (Fast) Previous Fine- Grained Approach (Good Area) Speed-Area Tradeoff

Runtime / Area Tradeoff BMR Improved runtime Good area performance

Runtime / Area Tradeoff CMR Better runtime Good area

Critical Path

Congestion Driven Placement We can further improve area performance using a congestion driven placer

Conclusions Use congestion information to perform better re-clustering –Up to 5x runtime speedup versus baseline –Up to 25% area savings versus baseline Improve Reduce/Area tradeoff of Un/DoPack Flow –Simultaneous area and runtime savings

Future Work Consider effect of neighboring Regions Other congestion-driven tools –Fast Placement

Questions?

Outline Motivation Background Multiregion approach Congestion-Driven Whitespace insertion

Related Work Un/DoPack [1] –Reduce interconnect usage to meet target channel width constraint Congestion Driven Clustering –iRAC, ISPL –Single-Pass Clustering