ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu.

Slides:

Advertisements

Similar presentations

Review of Topology and Access Techniques / Switching Concepts BSAD 141 Dave Novak Sources: Network+ Guide to Networks, Dean 2013.

Advertisements

ECE 506 Reconfigurable Computing ece. arizona

BSPlace: A BLE Swapping technique for placement Minsik Hong George Hwang Hemayamini Kurra Minjun Seo 1.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Clustering of Large Designs for Channel-Width Constrained FPGAs Marvin TomGuy Lemieux University of British Columbia Department of Electrical and Computer.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.

Power Modeling and Architecture Evaluation for FPGA with Novel Circuits for Vdd Programmability Yan Lin, Fei Li and Lei He EE Department, UCLA

Lecture 3: Field Programmable Gate Arrays II September 10, 2013 ECE 636 Reconfigurable Computing Lecture 3 Field Programmable Gate Arrays II.

Reconfigurable Computing (EN2911X, Fall07)

Evolution of implementation technologies

Lecture 4: FPGA Placement September 12, 2013 ECE 636 Reconfigurable Computing Lecture 4 FPGA Placement.

Stochastic Physical Synthesis for FPGAs with Pre-routing Interconnect Uncertainty and Process Variation Yan Lin and Lei He EE Department, UCLA

CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.

Lecture 3 1 ECE 412: Microcomputer Laboratory Lecture 3: Introduction to FPGAs.

User Programmable Logic Device: Architecture and CAD Tools TingTing Hwang and C. L. Liu.

CS 151 Digital Systems Design Lecture 38 Programmable Logic.

Dynamic Power Consumption In Large FPGAs WILLIAM GARCIA, ANDREW MORTELLARO.

ECE 506 Reconfigurable Computing Lecture 7 FPGA Placement.

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n Circuit design for FPGAs: –Logic elements. –Interconnect.

Power Reduction for FPGA using Multiple Vdd/Vth

Titan: Large and Complex Benchmarks in Academic CAD

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Un/DoPack: Re-Clustering of Large System-on-Chip Designs with Interconnect Variation for Low-Cost FPGAs Marvin Tom* Xilinx Inc.

Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Impact of Interconnect Architecture on VPSAs (Via-Programmed Structured ASICs) Usman Ahmed Guy Lemieux Steve Wilton System-on-Chip Lab University of British.

FPGA Global Routing Architecture Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Combinational and Sequential Mapping with Priority Cuts Alan Mishchenko Sungmin Cho Satrajit Chatterjee Robert Brayton UC Berkeley.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

Topics Architecture of FPGA: Logic elements. Interconnect. Pins.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Circuit design for FPGAs n Static CMOS gate vs. LUT n LE output drivers n Interconnect.

Section 1  Quickly identify faulty components  Design new, efficient testing methodologies to offset the complexity of FPGA testing as compared to.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

1 Carnegie Mellon University Center for Silicon System Implementation An Architectural Exploration of Via Patterned Gate Arrays Chetan Patel, Anthony Cozzie,

Introduction to Clock Tree Synthesis

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

Directional and Single-Driver Wires in FPGA Interconnect Guy Lemieux Edmund LeeMarvin TomAnthony Yu Dept. of ECE, University of British Columbia Vancouver,

Give qualifications of instructors: DAP

FPGA CAD 10-MAR-2003.

Net Criticality Revisited: An Effective Method to Improve Timing in Physical Design H. Chang 1, E. Shragowitz 1, J. Liu 1, H. Youssef 2, B. Lu 3, S. Sutanthavibul.

FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

ECE 506 Reconfigurable Computing Lecture 5 Logic Block Architecture Ali Akoglu.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

Placement and Routing Algorithms. 2 FPGA Placement & Routing.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Give qualifications of instructors: DAP

Topics Circuit design for FPGAs: Logic elements. Interconnect.

FPGA Glitch Power Analysis and Reduction

ECE 697F Reconfigurable Computing Lecture 4 FPGA Placement

Give qualifications of instructors: DAP

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Fast Min-Register Retiming Through Binary Max-Flow

CprE / ComS 583 Reconfigurable Computing

Chapter 3b Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction Prof. Lei He Electrical Engineering Department.

Reconfigurable Computing (EN2911X, Fall07)

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

ECE 506 Reconfigurable Computing Lecture 6 Clustering Ali Akoglu

Before Placement: Clustering °Intra-cluster connections: fast °Inter-cluster connections: slow Need to pack BLEs °Goals: Reduce stress on routing Take advantage of local fast interconnect Reduce inter-cluster wiring Minimize critical path (timing- driven) °How do we do this Take advantage of cluster architecture °Tradeoffs

Basic Clustering (Betz) °How many distinct inputs should be provided to a cluster of N 4-LUTs? °How many 4 LUTs should be included in a cluster to create the most area-efficient logic block?

VPACK

Basic Clustering (Betz) °Flow Iterate until all BLEs consumed Start new cluster by selecting a random BLE -select the currently unclustered BLE with the most used inputs, Add BLE with most shared inputs with current cluster to cluster -to minimize the number of inputs that must be routed to each cluster. Keep adding until either cluster full or input pins used up Hill climbing – if some cluster BLEs unused -Add another BLE even if cluster input count temporarily overflowed -If input count not eventually reduced select best choice from before hill climbing

Logic Utilization

Number of Inputs per Cluster Lots of opportunities for input sharing in large clusters (Betz – CICC’99) Reducing inputs reduces the size of the device and makes it faster. Most FPGA devices (Xilinx, Lucent) have 4 BLE per cluster with more inputs than actually needed.

TVPACK

Architecture Modeling Tri-state buffer and pass transistor distribution Cluster Size vs. Routing resources (Tile size) Transistor and Buffer Scaling based on segment length Flexibility of Switches (Fc=W for large cluster size is a waste?)

Logic Cluster Structure

Timing-Driven Clustering – T-VPACK °Optimization goals of VPack Pack each cluster to its capacity -Minimize number of clusters Minimize number of inputs per cluster -Reduce the number of external connections

Timing-Driven Clustering – T-VPACK °Optimization goal of T-VPack Minimize number of external connections on critical path Why? -External connections have higher delay and internal connections -Reducing number of external nets on critical path will reduce delay

Timing-Driven Clustering – T-VPACK °First stage Identify connections that are on the critical path °Second Stage Pack BLEs sequentially along the critical path Recompute criticality of remaining BLEs

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI Arrival Times

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI Arrival Times

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI Arrival Times

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI Arrival Times

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI Arrival Times

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI /22 18/22 arrival time/required time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI / / 18 22/22 18/22 15 / 15 arrival time/required time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI / / 18 22/22 18/22 15 / 15 7 / 15 arrival time/required time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI / / / 18 22/22 18/22 15 / 15 7 / 15 arrival time/required time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI / 9 9 / 9 7/ / / 18 22/22 18/22 15 / 15 7 / 15 arrival time/required time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI / 5 3 / 3 1 / 9 7 / 9 9 / 9 7/ / / 18 22/22 18/22 15 / 15 7 / 15 arrival time/required time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI / 4 0 / 0 0 / 8 1 / 5 3 / 3 1 / 9 7 / 9 9 / 9 7/ / / 18 22/22 18/22 15 / 15 7 / 15 Slack = required time - arrival time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI Slack = required time - arrival time

Slack and Criticality Calculation PO1 PO2 PO3 PI1 PI2 PI / 4 0 / 0 0 / 8 1 / 5 3 / 3 1 / 9 7 / 9 9 / 9 7/ / / 18 22/22 18/22 Critical Path 18/22 15 / 15 7 / 15

Timing-Driven Clustering – T-VPACK °Cost metric now considers both connectivity and timing criticality °Perform an analysis of criticality at beginning considering all wires to be inter-cluster °Determine “Base” BLE criticality

Base Criticality

How to break ties? °Initially, many paths may have the same number of BLEs °Include “tie-breaking” in performance cost function

Results for T-VPACK versus VPACK Why does the gap between VPack and T-VPack increase as N increases?

Results for T-VPACK versus VPACK °T-VPack prefers to cluster a BLE with BLEs that are in its fan-in or fan-out °VPack favors input sharing °T-VPack completely absorbs many low-fanout nets Fewer nets to route!

Results for T-VPACK versus VPACK Why does area-delay product show an increasing trend beyond cluster size of 10?

Results for T-VPACK versus VPACK °Increased number of nets that are completely absorbed by T-Vpack °Area- delay product Cluster size 7-10 best choice (36-34% better than N=1) °N=7 vs N=1 30% less delay, 8% les area

Results for T-VPACK, DELAY !!! Why do we see a circuit speedup?

Results for T-VPACK, DELAY !!! 18% 40% °Intra-cluster: Fast, Inter-cluster: Slow ! °As N increases Number of internal connections on the critical path increase Number of external connections on the critical path decrease

Why are inter-cluster connections becoming faster? Reduction in Number of external connections (internal connections are faster) External connections on the critical path are becoming faster Reduction in routing requirements

Drawback of VPack and T-VPack