A Reconfigurable Signal Processing IC with embedded FPGA and Multi-Port Flash Memory M. Borgatti, L. Calì, G. De Sandre, B. Forêt, D. Iezzi, F. Lertora,

1 A Reconfigurable Signal Processing IC with embedded FPGA and Multi-Port Flash Memory M. Borgatti, L. Calì, G. De Sandre, B. Forêt, D. Iezzi, F. Lertora, G. Muzzi, M. Pasotti, M. Poles, P.L. Rolandi STMicroelectronics - Central R&D - Italy

2 Outline of Presentation Project motivation and background System architecture –Reconfigurable core –Memory subsystem System performance –Application example: embedded face recognition system Energy efficiency, measurements SoC integration and design flow –System 2 RTL and RTL 2 Layout Summary 2 2

3 Project motivation and background Conflicting industry trends –Economics of system integration Even more complex SoC More integration Cost effectiveness and performance (per unit) –Increasing design complexity and risks –Increasing NREs –Shorter time-to-market and product life Strong need for: –Faster project turnaround –Lower risk Usage of re-configurable silicon fabrics 3 3

4 Project motivation and background Pragmatic approach proposed: –Reconfigurable architecture –Joins a statically extensible processor with e-FPGA –Tight connection to Flash memory subsystem –Open architecture with flexible programmable I/O Programmable platform approach –Simple model for programmers 4 4

5 Programmable Platform Approach System Applications Family System Application Silicon process + Enabling technologies Platform Compilation Config. Proc + e-FPGA Application Compilation Programmable platform 5 5

6 8KB D$ System Architecture Inst. Ext I/F Extensible MPU bus bridge e-FPGA General Purpose I/O Lines 8KB D$ 8KB I$ I2C BUS M/S AHB I/F INTs DMA & FPGA Prog. I/F Buffer I/F GP I/O 64 bit APB BUS 1kB Buffer AHB/APB Bridge 64 bit AHB BUS I2C Master I/O registers 48 kB SRAM FPCPDP Flash Mem Instr. Ext. 6 6

7 e-FPGA Purposes Processor ISA extensions –Simplest programmer’s model –Specific interface to the MPU datapath –Impact on processor performance –Impact on processor energy efficiency –Efficiency limited by instruction stream decoding Bus-mapped co-processor –Maximum benefits in speed/power Flexible I/O 7 7

8 e-FPGA – Microprocessor interface E Clock Ctrl Other FPGA Purposes Instruction extension R Pipe Control Decode Register File Instruction Result Microprocessor clock e-FPGA Clock 8 8

9 Flash Memory Architecture DPCPFP 8-bit  P  P I/F PMA DFT Power Block 2Mb #0 FPGA PortCode PortData Port 2Mb #1 2Mb #2 2Mb #3 128-bit Memory Sub-System Crossbar

10 Flash Memory Subsystem Modular approach –Customizable array of N independent 2Mb modules 3 content-specific ports (CP, DP, FP) HW support for filesystem implem. (DP) –Defrag –Compression –Virtual erase 2Mb Module features: –128b I/O –40ns access time (400MB/s peak throughput) –Power management and arbitration 10

11 System Memory Hierarchy 64-bit AHB Bus 32-bit uP RegisterFile 6x4 128-bit Crossbar 4 x x 128-bit Memory Module AHB Bridge 4 x Flash Memory Controller Logic 64 bit Port CP 32-bit Port FP 2 x x 32-bit Memory Port I/Fs 64-bit CP I/F64-bit DP I/F DMA 64-bit AHB 32-bit FPGA PI/F 32-bit 512-B Buffer 64-bit Port DP AHB Peak Throughput: –800MB/s e-FPGA –400MB/s –(50MB/s sustained) Total Aggregate Peak –1.2GB/s 11

12 Application Ex.: Face Recognition Target application: –Recognize a face out of twenty –low-resolution images from CMOS cameras Potential applications: –Low cost smart toys –Advanced human-machine interfaces –Color CMOS camera processors Image preprocessing: Bayer filter Face location: based on Hough transform Face recognition: Line-Based Recognition rates over 90 % Scale-invariant Tolerant to changes in illumination intensity 12

13 Processor Extension (I) _ x ‘8’ ’16’  Processor Load Unit 64-bit register Result 4-segm. 8-issue, 8-bit L2 distance Complexity: –23 8-bit OPS –6 64-bit OPS 1GOPS peak throughput –Distance computation 10k equiv. ASIC gates Mapped to e-FPGA 13

14 Processor Extension (II) NumberRemaind.root >>1 << 1 <<2>>2>>30 + _ +1 > + 2 Result Fixed-point square root kernel Complexity: –12 32-bit OPS 2k equiv. ASIC gates Mapped to e-FPGA 14

15 Algorithm StageRISC w/ basic DSP RISC w/ basic DSP + uP Ext. Speed-Up Bayer Filter58 msec24.7 msecx 2.3 Edge Detection4.5 msec 2.5 msecx 1.8 Face Detection1.5 sec382 msecx 4 Face Recognition (20-face database) 9.15 sec860 msecx 10.6 Totals10.7 sec1.26 secx 8.5 Performance: Processing 100 MHz

16 Energy Efficiency vs. Flexibility Flexibility (Coverage) Energy Efficiency (MOPS/mW) Embedded Processors ASIPs, DSPs Dedicated HW from: Zhang et Al., ISSCC 2000 Energy-Flexibility Gap ! FPGA-mapped CoProcessors uP + FPGA Instructions 16

17 Algorithm StageSpeed- Up Energy Gain Energy x Delay Gain Bayer Filterx 2.3x 1.4x 3.2 Edge Detectionx 1.8x 0.95x 1.7 Face Detectionx 4x 2.9x 11.6 Face Recognition (20-face database) x 10.6x 9x 95.4 Totalsx 8.5x 6.7x 57 Performance: Energy Efficiency 17

18 Cycle Accurate Simulation Performance Analysis C VHDL (e-FPGA) HW (RTL) uP, AHB/APB Bus Peripherals SW Apps SoC Integration uP ISS Functional model (untimed) Partitioning / I/F Synthesis / Refinement Libraries HW/SW Soft Hardware (eFPGA) eFPGA mapping eFPGA HARD MACRO Inst.Ext. Verilog 18

19 Inst. Ext. Synthesis Mapping (P&R) CPU core, IPs Interface RTL code Flash RAM Synthesis Floorplanning / P&R Static Timing Analysis, Dynamic Verification Static Timing Analysis (SoC + eFPGA) FPGA Timing DB Bit- stream Coproc.I/O I/F eFPGA core Con. Netlist + Timing Database Silicon fab 19

20 Chip Layout Process0.18um CMOS 2P/6M Embedded Flash Flash Memory (x4) 256kB x 9 sectors 128-bit word 1MB/s write through. 400MB/s read through. SRAM Memory Main: 48kB (64-bit) I$: 8kB (64-bit) D$: 8kB (64-bit) Buffers: 4x256B Chip size8.4 x 8.4 mm2 (e-FPGA size: 8.2 mm2) I/O24 inputs + 24 outputs (tristate) + 8 bidirs Supply V (external), 1.8V(core) 48 KB SRAM BUFFER Embedded FPGA TAGS 8+8 KB I$ + D$ 32b uP + AHB & APB + 250k GATES 1MB FLASH Memory uP AHB/ APB FPGA 8+8 kB I$+D$ DFT Flash Ports Buffers 48kB SRAM 20

21 Chip Performances and Power Consumption Processor maximum speed:125MHz (WCMIL) Reconfiguration 100MHz clock Chip average power consumption 100MHz, 1.8V 21

22 Summary e-FPGAs allow architectural tradeoffs for reconfigurable embedded systems: –Processor ISA extensions –Bus-mapped co-processor –Flexible I/O Modular, content-specific, multiport e-Flash Performance figures: –Up to 10x speedup –Up to 9x energy reduction –Dynamic reconfiguration in 500 us Specific design-flow for system and RTL 22

23 Acknowledgements: The authors thank: all the colleagues of NVM-DP Dept. A. Maurelli, F. Piazza and L. Fumagalli. 23

