NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha†

Slides:

Advertisements

Similar presentations

NanoFabric Chang Seok Bae. nanoFabric nanoFabric : an array of connect nanoBlocks nanoBlock : logic block that can be progammed to implement Boolean function.

Advertisements

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Lecture 2: Field Programmable Gate Arrays I September 5, 2013 ECE 636 Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays I.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day10: October 25, 2000 Computing Elements 2: Cascades, ALUs,

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.

Reconfigurable Computing (EN2911X, Fall07)

Evolution of implementation technologies

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Programmable logic and FPGA

Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.

CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.

FPGA Defect Tolerance: Impact of Granularity Anthony YuGuy Lemieux December 14, 2005.

CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.

The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays The Memory/Logic Interface in FPGA’s with Large Embedded Memory Arrays Steven J.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

ECE 506 Reconfigurable Computing Lecture 8 FPGA Placement.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

EE 261 – Introduction to Logic Circuits Module #8 Page 1 EE 261 – Introduction to Logic Circuits Module #8 – Programmable Logic & Memory Topics A.Programmable.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

Power Reduction for FPGA using Multiple Vdd/Vth

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Channel Width Reduction Techniques for System-on-Chip Circuits in Field-Programmable Gate Arrays Marvin Tom University of British Columbia Department of.

1. NATURE: Non-Volatile Nanotube RAM based Field-Programmable Gate Arrays Wei Zhang†, Niraj K. Jha† and Li Shang ‡ †Dept. of Electrical Engineering Princeton.

March 20, 2007 ISPD An Effective Clustering Algorithm for Mixed-size Placement Jianhua Li, Laleh Behjat, and Jie Huang Jianhua Li, Laleh Behjat,

1 Moore’s Law in Microprocessors Pentium® proc P Year Transistors.

Implementation of Finite Field Inversion

Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.

J. Christiansen, CERN - EP/MIC

Tools - Implementation Options - Chapter15 slide 1 FPGA Tools Course Implementation Options.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

Programmable Logic Devices

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Field Programmable Gate Arrays (FPGAs) An Enabling Technology.

Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.

Introduction to FPGAs Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

1 A Min-Cost Flow Based Detailed Router for FPGAs Seokjin Lee *, Yongseok Cheon *, D. F. Wong + * The University of Texas at Austin + University of Illinois.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

Routability-driven Floorplanning With Buffer Planning Chiu Wing Sham Evangeline F. Y. Young Department of Computer Science & Engineering The Chinese University.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

An Improved “Soft” eFPGA Design and Implementation Strategy

FPGA CAD 10-MAR-2003.

In-Place Decomposition for Robustness in FPGA Ju-Yueh Lee, Zhe Feng, and Lei He Electrical Engineering Dept., UCLA Presented by Ju-Yueh Lee Address comments.

A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.

1 Field-programmable Gate Array Architectures and Algorithms Optimized for Implementing Datapath Circuits Andy Gean Ye University of Toronto.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

1 WireMap FPGA Technology Mapping for Improved Routability Stephen Jang, Xilinx Inc. Billy Chan, Xilinx Inc. Kevin Chung, Xilinx Inc. Alan Mishchenko,

© PSU Variation Aware Placement in FPGAs Suresh Srinivasan and Vijaykrishnan Narayanan Pennsylvania State University, University Park.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Placement study at ESA Filomena Decuzzi David Merodio Codinachs

CS184a: Computer Architecture (Structure and Organization)

Andy Ye, Jonathan Rose, David Lewis

Verilog to Routing CAD Tool Optimization

FPGA Glitch Power Analysis and Reduction

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

A New Hybrid FPGA with Nanoscale Clusters and CMOS Routing Reza M. P

Programmable logic and FPGA

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha† Dept. of Electrical Engineering Princeton University† Dept. of Electrical and Computer Engineering Queen’s University ‡

Outline  Temporal Logic Folding  Background on NRAMs  Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006)  NanoMap: Design Optimization Flow  Experimental Results  Conclusions

 Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles Temporal Logic Folding i =abc’ LUT 1 LUT 1 LUT 2 LUT 3 MEM l =(I’+e’+f’)h’ OUT =d’g’+l LUT 2 LUT 3 LUT 3 LUT 2 LUT 1

NATURE CMOS fabrication compatible CMOS fabrication compatible NRAM-based Run-time reconfiguration Run-time reconfiguration Temporal logic folding Temporal logic folding Design flexibility Design flexibility Logic density Logic density Overview of NATURE  Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits  Fine-grain reconfiguration (even cycle-by-cycle) and logic folding Area-delay trade-off flexibility More than an order of magnitude increase in logic density More than an order of magnitude reduction in area- time product Comparisons assume NRAMs/ CMOS logic implemented in the same technology Non-volatility: useful in low power & secure processing

Overview of NATURE (Contd.)  Challenges in nano-circuits/architectures Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. Lack of a mature fabrication process Fabrication defects and run-time failures (between 1% and 10%)  Regular, reconfigurable architectures, such as an FPGA, favored Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible fabrication process

Source:  Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future NRAMTM by Nantero

NRAMs  Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable  NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM

 Island-style logic blocks (LBs) connected by various levels of interconnects  An LB contains a super macroblock (SMB) and a local switch matrix Architecture of NATURE

 n 1 macroblocks (MBs) comprise an SMB: here n 1 = 4 Architecture of a Super Macroblock (SMB)

 n 2 logic elements (LEs) comprise an MB: here n 2 = 4 Architecture of a Macroblock (MB)

Logic Element (Basic Configuration)  An LE implements a computation and contains: An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output and a primary input

Folding Levels  Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs  Level-p folding: LE reconfiguration after the execution of p LUT computations Reconfiguration time: 160ps  Larger folding level, typically delay decrease, area increase (a) level-1 folding (b) level-2 folding

Design Optimization Flow: NanoMap  Optimize and implement design on NATURE  Integrate temporal logic folding Choose a proper folding level Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles  Input design specified in register-transfer level (RTL) and/or gate-level VHDL

Motivational Example  Different planes should have same number of folding stages to guarantee global synchronization  Key issue: how to achieve the optimization objective Appropriate folding level Assign the logic to folding stages Level 1 register Level 2 register Plane Logic in Plane Plane cycle Folding stage Folding cycle

Motivational Example (Contd.)  Example optimization objective Minimize circuit delay under an area constraint of 32 LEs Assume each LE contains one LUT and two flip- flops: 32 LEs provide 32 LUTs and 64 flip-flops 50 LUTs 14 flip-flops 8 LUTs Logic depth: 4 38 LUTs Logic depth: 7 Plane depth: 9

Iterative Design Flow  Start with initial guess for folding level and iteratively refine it Large folding level -> better circuit delay, but large area cost Initial #folding stages: Initial folding levels:  Partition RTL modules into a series of connected LUT clusters logic depth at most equal to the folding level Significantly speeds up the mapping procedure

Iterative Design Flow (Contd.) Cluster size should be smaller than the area constraint 34 LUTs > 32 LUTs Level-5 foldingLevel-4 folding

Solution for the Example  Three folding stages using level-4 folding  32 LEs required for mapping the RTL circuit; area constraint satisfied  Circuit delay = 3 * folding cycle delay

NanoMap: Flow Diagram Logic Mapping Temporal clustering Temporal placement Routing Input network Module library Folding level computation Delay estimation Schedule each LUT/ LUT cluster using FDS Perform logic folding? Yes No Placement routable? No Yes Satisfy area constraints? Yes Final placement using modified VPR placer Satisfy delay constraints? Yes Output reconfiguration bits Optimization objective No RTL module partition Final routing using VPR router 16 User constraint Circuit parameter search 2 Map each LUT/LUT cluster to SMBs 7 Fast placement using modified VPR placer 9 Refine placement? Yes No 13

Force-Directed Scheduling  Perform FDS on RTL modules partitioned into LUTs/LUT clusters  Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage  Model resource usage as a force: F = Kx K: distribution graphs (DGs) that describe the probability of resource usage Aim of FDS: minimize force, indicating minimum increase in resource usage  LE usage depends on LUT computations and register storage operations: two DGs needed

Temporal Clustering  For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs Unpacked LUT with a maximal number of inputs selected as initial seed New LUTs with high attractions to the seed selected and assigned to the SMB  Attractions depend on timing criticality and input pin sharing  Considers attractions across all the folding cycles

Placement and Routing  VPR (U. Toronto) modified to perform placement and support temporal logic folding Simulated annealing approach Cost function computed across the folding stages  Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects

23 Experimental Setup  Instance of architecture: 4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops  Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs  Results based on 100nm technology parameters to implement CMOS logic and NRAMs

Experimental Results (Contd.)

Reduction in #LEs Maximum AT improvement Average AT improvement Circuit delay increase k enough14.8X16.2X11.0X31.8% k = 169.2X9.3X7.8X19.4% Improvement under AT optimization for RTL Benchmarks  LE utilization around 100%  50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous

Experimental Results (Contd.)  Flexibility in choosing the best folding level and performing area-delay trade-offs  Mapping results for typical optimizations using Paulin benchmark as an example Opt. obj. Area const. (#LEs) Delay const. (ns) Folding level Case1ATNo 1 Case2DelayNo Case3AreaNo274 Case4Delay210No3 Typical optimizations

Conclusions  NATURE: A new high-performance run-time reconfigurable architecture  NanoMap: an integrated optimization design flow for NATURE  Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages  Can be very useful for cost-conscious embedded systems and improvement of future FPGAs  Non-volatility: helpful in secure and low power processing