Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis Athanasopoulos 1,2 Hadi Parandeh-Afshar 2 Paolo Ienne 2 Yusuf Leblebici 1 Ajay K. Verma 2 Philip Brisk 2 Frank K. Gurkaynak 1 1 2 16 th ACM/SIDA International Symposium on FPGAs Monterey, California, USA, February 26, 2008

Motivation and Contribution Goal: Improve FPGA performance for arithmetic circuits. Field Programmable Counter Array (FPCA): [Brisk et al., DAC 2007] Programmable IP core to accelerate compressor trees Hybrid FPGA/FPCA device Contributions: Completely new FPCA architecture Reduced routing delay More flexibility and better mapping Simplified integration process 1/11

FPGA Commentary Logic cells with dedicated addition circuitry and fast carry chains Support for ternary addition [Altera Stratix II/III, Xilinx Virtex-5] Parallel accumulation uses adder trees ASIC designers use compressor trees! Compressor tree synthesis on FPGAs via GPC mapping [Parandeh-Afshar et al., ASPDAC 2008, DATE 2008] Faster than ternary adder trees IP Cores DSP48, BlockRAM, etc. [Xilinx, Altera] FP cores [Beauchamp et al., TVLSI 2008] Mismatches in bitwidth limit gains [Kuon and Rose, FPGA 2006, TCAD 2007] 2/11

Methodology and Solution 1. Transform circuit to merge disparate addition and multiplication operations to expose compressor trees [Verma and Ienne, ICCAD 2004] 2. Synthesize compressor tree onto FPCA [Brisk et al., DAC 2007] 3. Map everything else onto traditional FPGA Standard approach 4. Integrate FPGA+FPCA onto same die Ongoing research at EPFL FPCA : programmable compressor tree ∑ + 3/11

Previous Work Initial FPCA architecture [Brisk et al., DAC 2007] Routing network delay Performance bottleneck Poor area utilization Many resources unused Large counters implement the functionality of smaller counters “Pitch matching” problem FPCA routing channels must align with FPGA routing channels Leads to unnecessarily large counters 4/11

Recurring Patterns in Compressor Tree Synthesis New FPCA architecture: Counter Slice (CSlice) Compress one column at a time Propagate carry bits to neighboring CSlices Eliminates FPGA-style routing network No routing delay between counters Pitch matching problem disappears 5/11

FPCA v2.0 Area Utilization CSlice Architecture Configurable GPC 6/11

FPCA V2.0 Mapping Heuristic FPCA synthesis heuristic: Map columns of input bits onto FPCA Minimize the height of the compressor tree Avoid vertical configurations, when possible FPCA … Horizontal Vertical Multi-FPCA Configurations Routing Delay 7/11

CSlice Synthesis CSlice V2.0 rank-3 with 16 input bits per CSlice 90nm Artisan standard cell library CsliceRank-1Rank-2Rank-3 Area [µm 2 ]124023472770 Delay [ns]0.400.710.73 CPA delay [ns]0.040.050.07 FPCA Synthesis: Rank-3 CSlices used in experiments 8 CSlices per FPCA Similar to dimensions of a DSP block in current FPGAs Simplifies integration process DFFs store configuration bitstream Semi-custom design Standard cells are predominant 8/11

FPCA Delay Extraction Methodology: Each FPCA instance is replaced with F* instance (same I/0) Extract Delay Between F* instances Combined these Delay with Combinational Delay extracted for the FPCA Input Pins Output Pins SUM Define a pre-placed soft IP core : F* Same dimensions and I/O as FPCA Map onto Stratix II FPGA Extract critical path delay Replace all sum operations with F* Map compressor tree onto FPCA Configuration DFF values set to constant values ; not optimized Measure critical path delay For each compressor tree in the circuit Subtract delay of F* Add FPCA delay Methodology: F* FPCA 9/11

Experimental Results Experimental Results Comparison GPC Mapping [Parandeh-Afshar et al., ASP-DAC 2008] FPCA mapping (6 FPCAs per device) 2.40x 1.60x 10/11

Conclusion Conclusion Future Work New FPCA architecture Hardwired connections between counters Counters of multiple sizes organized into CSlices Carry chains between CSlices Avg./Max. speedups of 1.60x/2.40x compared to GPC mapping Add pipeline registers to FPCA Increase latency, increase clock frequency, throughput Demonstrator chip taped out in October 2007 Returned from the foundry in January 2008; PCBs ready next week Measure power consumption, clock frequency, I/O interface, etc. 11/11

Demonstrator Chip

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Similar presentations

Presentation on theme: "Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

Similar presentations

Presentation on theme: "Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis."— Presentation transcript:

Similar presentations

About project

Feedback