Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing

Slides:



Advertisements
Similar presentations
ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.
Advertisements

BSPlace: A BLE Swapping technique for placement Minsik Hong George Hwang Hemayamini Kurra Minjun Seo 1.
Improving Placement under the Constant Delay Model Kolja Sulimma 1, Ingmar Neumann 1, Lukas Van Ginneken 2, Wolfgang Kunz 1 1 EE and IT Department University.
Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.
Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.
Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.
A System-Level Stochastic Benchmark Circuit Generator for FPGA Architecture Research Cindy Mark Prof. Steve Wilton University of British Columbia Supported.
SCOTT MILLER, AMBROSE CHU, MIHAI SIMA, MICHAEL MCGUIRE ReCoEng Lab DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING UNIVERSITY OF.
Reconfigurable Computing (EN2911X, Fall07)
An FPGA Based Adaptive Viterbi Decoder Sriram Swaminathan Russell Tessier Department of ECE University of Massachusetts Amherst.
Lecture 9: Multi-FPGA System Software October 3, 2013 ECE 636 Reconfigurable Computing Lecture 9 Multi-FPGA System Software.
The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.
ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.
Dr. Konstantinos Tatas ACOE201 – Computer Architecture I – Laboratory Exercises Background and Introduction.
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
Yehdhih Ould Mohammed Moctar1 Nithin George2 Hadi Parandeh-Afshar2
Automating Shift-Register-LUT Based Run-Time Reconfiguration Karel Heyse, Brahim Al Farisi, Karel Bruneel, Dirk Stroobandt
Overview Dynamic reconfiguration of FPGAs:
An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.
Power Reduction for FPGA using Multiple Vdd/Vth
Titan: Large and Complex Benchmarks in Academic CAD
Scalable and Deterministic Timing-Driven Parallel Placement for FPGAs Supervisor: Dr. Guy Lemieux October 20, 2011 Chris Wang.
LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.
Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.
Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.
March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,
Mapping Logic to Reconfigurable FPGA Routing Karel Heyse Karel Bruneel and Dirk Stroobandt 1 FACULTY OF ENGINEERING AND ARCHITECTURE.
Placement. Physical Design Cycle Partitioning Placement/ Floorplanning Placement/ Floorplanning Routing Break the circuit up into smaller segments Place.
Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.
Congestion Estimation and Localization in FPGAs: A Visual Tool for Interconnect Prediction David Yeager Darius Chiu Guy Lemieux The University of British.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
Incremental Placement Algorithm for Field Programmable Gate Arrays David Leong Advisor: Guy Lemieux University of British Columbia Department of Electrical.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.
QCAdesigner – CUDA HPPS project
1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.
Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation
1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.
FPGA CAD 10-MAR-2003.
FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.
1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee University of British Columbia Electrical & Computer Engineering MASc Thesis Presentation.
Architecture and algorithm for synthesizable embedded programmable logic core Noha Kafafi, Kimberly Bozman, Steven J. E. Wilton 2003 Field programmable.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
SEMI-SYNTHETIC CIRCUIT GENERATION FOR TESTING INCREMENTAL PLACE AND ROUTE TOOLS David GrantGuy Lemieux University of British Columbia Vancouver, BC.
Congestion-Driven Re-Clustering for Low-cost FPGAs MASc Examination Darius Chiu Supervisor: Dr. Guy Lemieux University of British Columbia Department of.
Interconnect Driver Design for Long Wires in FPGAs Edmund Lee, Guy Lemieux & Shahriar Mirabbasi University of British Columbia, Canada Electrical & Computer.
Dirk Stroobandt Ghent University Electronics and Information Systems Department Multi-terminal Nets do Change Conventional Wire Length Distribution Models.
6/19/ VLSI Physical Design Automation Prof. David Pan Office: ACES Placement (3)
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.
Oleg Petelin and Vaughn Betz FPL 2016
Programmable Hardware: Hardware or Software?
Time-borrowing platform in the Xilinx UltraScale+ family of FPGAs and MPSoCs Ilya Ganusov, Benjamin Devlin.
Placement study at ESA Filomena Decuzzi David Merodio Codinachs
Floating-Point FPGA (FPFPGA)
Ph.D. in Computer Science
James Coole PhD student, University of Florida Aaron Landy Greg Stitt
HeAP: Heterogeneous Analytical Placement for FPGAs
Andy Ye, Jonathan Rose, David Lewis
Incremental Placement Algorithm for Field Programmable Gate Arrays
Verilog to Routing CAD Tool Optimization
Chin Hau Hoo, Akash Kumar
Topics Logic synthesis. Placement and routing..
FPGA Interconnection Algorithm
Register-Transfer (RT) Synthesis
Multi-Commodity Flow-Based Spreading in a Commercial Analytic Placer
CprE / ComS 583 Reconfigurable Computing
Research: Past, Present and Future
Reconfigurable Computing (EN2911X, Fall07)
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing FACULTY OF ENGINEERING AND ARCHITECTURE Runtime-Quality Tradeoff in Partitioning Based Multithreaded Packing Dries Vercruyce Elias Vansteenkiste and Dirk Stroobandt Dries.Vercruyce@UGent.be Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Toolflow HDL description Synthesis Technology mapping Placement Routing Packing Packing FPGA configuration Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Packing Seed based Partitioning based Bottom-up approach Seed block Affinity metric Top-down approach Hierarchical partitioning of the circuit Fast Tight packing Slow Constraints Local minima No multithreading Quality of results Multithreading Once a circuit is split in half, we thread both subcircuits independently during partitioning. This leads to the opportunity of multithreading. QoR Wirelength and channelwidth Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Constraints Fixed # LUT/FF Fixed # input pins Complete/sparse crossbar Local interconnect LUT FF BLE Fixed # LUT/FF Fixed # input pins Complete/sparse crossbar

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Related work Constraints enforcing step required Simplified architectures Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Contributions Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Contributions No constraints enforcing step required Fast multithreaded packing Multithreaded seed based packing (MultiPart) Realistic heterogeneous architectures (MultiPart) Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Outline Packing Contributions Circuit partitioning PartSA MultiPart Experiments Conclusions and Future work Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Circuit partitioning A FF FF MULT B FF FF LUT LUT FF FF Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Circuit partitioning A B Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 PartSA N 1 1 1 1 1 1 1 1 1 Clustering based on design hierarchy Simulated annealing fine-tuning cost function Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Simulated annealing: cost function Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Simulated annealing: cost function PTH PMAX Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Problem: cutting critical paths Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Problem: cutting critical paths Wedge

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Problems with PartSA Partitioning runtime increases as you go deeper in the hierarchy Unused threads on first hierarchy levels Large amount of subcircuits Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Problems with PartSA Partitioning runtime increases as you go deeper in the hierarchy Hard to target commercial architectures Commercial architectures contain sparse local interconnect crossbars Legal solution after block swap? Detailed routing required in kernel of simulated annealing Infeasible due to the large amount of required swaps Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 MultiPart No partitioning required on deep hierarchical levels Detailed routing is feasible with seed based packing Subcircuits are threaded independently Multithreaded seed based packing Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Partition depth Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Problem: cutting critical paths SDC File Even though timing edges are added during partitioning, there is a chance that a critical path is cut during partitioning. Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Experimental results None of the packers shown before is able to pack the VTR benchmarks and is not publicly available. All results are related to AAPack Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Total wirelength Related to AAPack! Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Minimum channel width Smaller and cheaper FPGA’s Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Execution time and scaling behaviour Name Area Runtime speed-up PartSA MultiPart LU8PEEng 770K 1.7x 2.6x LU32PEEng 2.7M 2x 3.3x LU64PEEng 5.3M 2.3x 4x

Summary Total wirelength Critical path delay Runtime speed-up K6_N10_40nm (complete crossbar) PartSA -26% -1.5% 1.8x MultiPart -12% -2.6% 2.7x K6_N10_gate_boost_0.2V_22nm (sparse crossbar) -20% -3.7% 2.9x Ghent University – Computer Systems Lab – FPL 2012 – 30 August 2012

Conclusion and future work Partitioning based packing methods Design hierarchy preserved Multithreaded parallelism Higher quality packing in less runtime Total wirelength Minimum channel width Critical path delay Future work: Extend MultiPart Titan benchmark design suite Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Extra: Results for Titan Total wirelength Critical path delay Runtime speed-up VTR -20% -3.7% 2.9x Titan -28% -6% 3.6x Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016 Acknowledgement Supported by European Commission H2020-FETHPC EXTRA project: The author is supported by a PhD grant of the Research Foundation Flanders (FWO) Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016

Ghent University – Computer Systems Lab – FPL 2012 – 30 August 2012 ADDITIONAL SLIDES Ghent University – Computer Systems Lab – FPL 2012 – 30 August 2012

Multithreaded partitioning CPU with 4 cores Ghent University – Computer Systems Lab – FPL 2016 – 30 August 2016