Presentation is loading. Please wait.

Presentation is loading. Please wait.

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha†

Similar presentations


Presentation on theme: "NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha†"— Presentation transcript:

1 NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha† Dept. of Electrical Engineering Princeton University† Dept. of Electrical and Computer Engineering Queen’s University ‡

2 Outline  Temporal Logic Folding  Background on NRAMs  Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006)  NanoMap: Design Optimization Flow  Experimental Results  Conclusions

3  Basic idea: Use run-time reconfiguration to realize different functions in the same resource every few cycles Temporal Logic Folding i =abc’ LUT 1 LUT 1 LUT 2 LUT 3 MEM l =(I’+e’+f’)h’ OUT =d’g’+l LUT 2 LUT 3 LUT 3 LUT 2 LUT 1

4 NATURE CMOS fabrication compatible CMOS fabrication compatible NRAM-based Run-time reconfiguration Run-time reconfiguration Temporal logic folding Temporal logic folding Design flexibility Design flexibility Logic density Logic density Overview of NATURE  Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits  Fine-grain reconfiguration (even cycle-by-cycle) and logic folding Area-delay trade-off flexibility More than an order of magnitude increase in logic density More than an order of magnitude reduction in area- time product Comparisons assume NRAMs/ CMOS logic implemented in the same technology Non-volatility: useful in low power & secure processing

5 Overview of NATURE (Contd.)  Challenges in nano-circuits/architectures Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. Lack of a mature fabrication process Fabrication defects and run-time failures (between 1% and 10%)  Regular, reconfigurable architectures, such as an FPGA, favored Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible fabrication process

6 Source: http://www.nantero.com/nram.html  Non-volatile nanotube random-access memory (NRAM) Mechanically bent or not: determines bistable on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future NRAMTM by Nantero

7 NRAMs  Properties of NRAMs Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable  NATURE not tied to NRAMs Phase change RAM Magnetoresistive RAM Ferroelectric RAM

8  Island-style logic blocks (LBs) connected by various levels of interconnects  An LB contains a super macroblock (SMB) and a local switch matrix Architecture of NATURE

9  n 1 macroblocks (MBs) comprise an SMB: here n 1 = 4 Architecture of a Super Macroblock (SMB)

10  n 2 logic elements (LEs) comprise an MB: here n 2 = 4 Architecture of a Macroblock (MB)

11 Logic Element (Basic Configuration)  An LE implements a computation and contains: An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output and a primary input

12 Folding Levels  Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs  Level-p folding: LE reconfiguration after the execution of p LUT computations Reconfiguration time: 160ps  Larger folding level, typically delay decrease, area increase (a) level-1 folding (b) level-2 folding

13 Design Optimization Flow: NanoMap  Optimize and implement design on NATURE  Integrate temporal logic folding Choose a proper folding level Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles  Input design specified in register-transfer level (RTL) and/or gate-level VHDL

14 Motivational Example  Different planes should have same number of folding stages to guarantee global synchronization  Key issue: how to achieve the optimization objective Appropriate folding level Assign the logic to folding stages Level 1 register Level 2 register Plane Logic in Plane Plane cycle Folding stage Folding cycle

15 Motivational Example (Contd.)  Example optimization objective Minimize circuit delay under an area constraint of 32 LEs Assume each LE contains one LUT and two flip- flops: 32 LEs provide 32 LUTs and 64 flip-flops 50 LUTs 14 flip-flops 8 LUTs Logic depth: 4 38 LUTs Logic depth: 7 Plane depth: 9

16 Iterative Design Flow  Start with initial guess for folding level and iteratively refine it Large folding level -> better circuit delay, but large area cost Initial #folding stages: Initial folding levels:  Partition RTL modules into a series of connected LUT clusters logic depth at most equal to the folding level Significantly speeds up the mapping procedure

17 Iterative Design Flow (Contd.) Cluster size should be smaller than the area constraint 34 LUTs > 32 LUTs Level-5 foldingLevel-4 folding

18 Solution for the Example  Three folding stages using level-4 folding  32 LEs required for mapping the RTL circuit; area constraint satisfied  Circuit delay = 3 * folding cycle delay

19 NanoMap: Flow Diagram Logic Mapping Temporal clustering Temporal placement Routing Input network Module library Folding level computation Delay estimation Schedule each LUT/ LUT cluster using FDS Perform logic folding? Yes No Placement routable? No Yes Satisfy area constraints? Yes Final placement using modified VPR placer Satisfy delay constraints? Yes Output reconfiguration bits Optimization objective No RTL module partition 1 3 4 5 6 7 8 10 11 12 14 15 Final routing using VPR router 16 User constraint Circuit parameter search 2 Map each LUT/LUT cluster to SMBs 7 Fast placement using modified VPR placer 9 Refine placement? Yes No 13

20 Force-Directed Scheduling  Perform FDS on RTL modules partitioned into LUTs/LUT clusters  Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage  Model resource usage as a force: F = Kx K: distribution graphs (DGs) that describe the probability of resource usage Aim of FDS: minimize force, indicating minimum increase in resource usage  LE usage depends on LUT computations and register storage operations: two DGs needed

21 Temporal Clustering  For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs Unpacked LUT with a maximal number of inputs selected as initial seed New LUTs with high attractions to the seed selected and assigned to the SMB  Attractions depend on timing criticality and input pin sharing  Considers attractions across all the folding cycles

22 Placement and Routing  VPR (U. Toronto) modified to perform placement and support temporal logic folding Simulated annealing approach Cost function computed across the folding stages  Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects

23 23 Experimental Setup  Instance of architecture: 4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops  Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs  Results based on 100nm technology parameters to implement CMOS logic and NRAMs

24 Experimental Results (Contd.) 1 1 1 11 1 1 22 2 2 2 2 1 1 1 1 1 1 1 1 22 2 2 2 2 1

25 Reduction in #LEs Maximum AT improvement Average AT improvement Circuit delay increase k enough14.8X16.2X11.0X31.8% k = 169.2X9.3X7.8X19.4% Improvement under AT optimization for RTL Benchmarks  LE utilization around 100%  50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding – indicates trading interconnect area for NRAM area advantageous

26 Experimental Results (Contd.)  Flexibility in choosing the best folding level and performing area-delay trade-offs  Mapping results for typical optimizations using Paulin benchmark as an example Opt. obj. Area const. (#LEs) Delay const. (ns) Folding level Case1ATNo 1 Case2DelayNo Case3AreaNo274 Case4Delay210No3 Typical optimizations

27 Conclusions  NATURE: A new high-performance run-time reconfigurable architecture  NanoMap: an integrated optimization design flow for NATURE  Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages  Can be very useful for cost-conscious embedded systems and improvement of future FPGAs  Non-volatility: helpful in secure and low power processing


Download ppt "NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture Wei Zhang†, Li Shang‡ and Niraj K. Jha†"

Similar presentations


Ads by Google