FPGAs at 28nm: Meeting the Challenge of Modern Systems-on-a-Chip

Slides:

Advertisements

Similar presentations

FPGA (Field Programmable Gate Array)

Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.

EELE 367 – Logic Design Module 2 – Modern Digital Design Flow Agenda 1.History of Digital Design Approach 2.HDLs 3.Design Abstraction 4.Modern Design Steps.

Graduate Computer Architecture I Lecture 16: FPGA Design.

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.

Configurable System-on-Chip: Xilinx EDK

Programmable logic and FPGA

Ethernet Bomber Ethernet Packet Generator for network analysis Oren Novitzky & Rony Setter Advisor: Mony Orbach Started: Spring 2008 Part A final Presentation.

Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.

Mohamed ABDELFATTAH Vaughn BETZ. 2 Why NoCs on FPGAs? Embedded NoCs Area & Power Analysis Comparison Against P2P/Buses 4 4.

Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.

© 2010 Altera Corporation—Public DSP Innovations in 28-nm FPGAs Danny Biran Senior VP of Marketing.

Programmable Logic- How do they do that? 1/16/2015 Warren Miller Class 5: Software Tools and More 1.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.

© 2010 Altera Corporation—Public Introducing 28-nm Stratix V FPGAs and HardCopy V ASICs: Built for Bandwidth 2010 Technology Roadshow.

© 2011 Altera Corporation—Public Introducing Qsys – Next Generation System Integration Platform AP Tech Roadshow.

Efficient Multi-Ported Memories for FPGAs Eric LaForest Greg Steffan University of Toronto Computer Engineering Research Group February 22, 2010.

© 2010 Altera Corporation—Public Quickly Master SDC (Synopsis Design Constraint) Timing Analysis 2010 Technology Roadshow.

Lecture #3 Page 1 ECE 4110– Sequential Logic Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.No Class Monday, Labor Day Holiday 2.HW#2 assigned.

Power Reduction for FPGA using Multiple Vdd/Vth

Titan: Large and Complex Benchmarks in Academic CAD

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

CAD for Physical Design of VLSI Circuits

1 WORLD CLASS – through people, technology and dedication High level modem development for Radio Link INF3430/4431 H2013.

1 Nios II Processor Architecture and Programming CEG 4131 Computer Architecture III Miodrag Bolic.

Electronics in High Energy Physics Introduction to Electronics in HEP Field Programmable Gate Arrays Part 1 based on the lecture of S.Haas.

GBT Interface Card for a Linux Computer Carson Teale 1.

System Arch 2008 (Fire Tom Wada) /10/9 Field Programmable Gate Array.

Automated Design of Custom Architecture Tulika Mitra

© 2011 Altera Corporation - Public Optimizing Power and Performance in 28-nm FPGA Designs Technology Roadshow

1 Moore’s Law in Microprocessors Pentium® proc P Year Transistors.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

J. Christiansen, CERN - EP/MIC

© 2010 Altera Corporation—Public Easily Build Designs Using Altera’s Video and Image Processing Framework 2010 Technology Roadshow.

Power-Aware RAM Processing for FPGAs December 9, 2005 Power-aware RAM Processing for FPGA Embedded Memory Blocks Russell Tessier University of Massachusetts.

NIOS II Ethernet Communication Final Presentation

Introduction to FPGA Created & Presented By Ali Masoudi For Advanced Digital Communication Lab (ADC-Lab) At Isfahan University Of technology (IUT) Department.

Lecture #3 Page 1 ECE 4110–5110 Digital System Design Lecture #3 Agenda 1.FPGA's 2.Lab Setup Announcements 1.HW#2 assigned Due.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

© 2010 Altera Corporation - Public Lutiac – Small Soft Processors for Small Programs David Galloway and David Lewis November 18, 2010.

ESS | FPGA for Dummies | | Maurizio Donna FPGA for Dummies Basic FPGA architecture.

Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.

Ethernet Bomber Ethernet Packet Generator for network analysis

Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May – 9 June 2007 Javier.

© 2009 Altera Corporation Floating Point Synthesis From Model-Based Design M. Langhammer, M. Jervis, G. Griffiths, M. Santoro.

Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.

Altera Technical Solutions Seminar Schedule OpeningIntroduction FLEX ® 10KE Devices APEX ™ 20K & Quartus ™ Overview Design Integration EDA Integration.

Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.

© 2008 Altera Corporation—Public 40-nm Stratix IV FPGAs Innovation Without Compromise.

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

FPGAs in AWS and First Use Cases, Kees Vissers

Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &

Field Programmable Gate Array

Field Programmable Gate Array

Field Programmable Gate Array

Programmable Logic- How do they do that?

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

A High Performance SoC: PkunityTM

HIGH LEVEL SYNTHESIS.

Win with HDL Slide 4 System Level Design

Network-on-Chip Programmable Platform in Versal™ ACAP Architecture

Programmable logic and FPGA

Presentation transcript:

FPGAs at 28nm: Meeting the Challenge of Modern Systems-on-a-Chip Vaughn Betz Senior Director, Software Engineering Altera

Overview Process scaling & FPGAs FPGAs becoming SoCs Stratix V & 28 nm End user demand Technological challenges FPGAs becoming SoCs Stratix V: more hard IP FPGA families targeted at more specific markets Stratix V & 28 nm Challenges & features Partial reconfiguration Designer productivity Challenges Possible software stack solutions

Demand and Scaling Trends

Broad End Market Demand Communications Broadcast Military Consumer/industrial Mobile internet and video driving bandwidth at 50% annualized growth rate Fixed footprints Proliferation of HD/1080p Move to digital cinema and 4k2k Software defined radio More sensors, higher precision Advanced radar Smart cars and appliances Smart Grid Need more processing in same footprint, power and cost 4 4

Driving Factors—Mobility and Video Mobile Bandwidth Video Bandwidth 100 10,000,000 4K2K 1,000,000 100,000 Kb/s 10,000 Streaming Bandwidth (Mbps) 1080p 10 1,000 720p 1080i 100 480p 10 SD 1 1 1G 1983 2G 1991 3G 2001 4G 2009 (LTE) 5G ~2017 1970 1980 1990 2000 2010 2020 Minimum Bandwidth Maximum Bandwidth

Evolution of Video-Conferencing Today Tomorrow High end in 2000: 384 kbps Cisco telepresence: 15 Mbps

Communication Processing Needs More bandwidth: CAGR of 25% to 131% / year [By domain, Cisco] More data through fixed channel  more processing per symbol Security and quality of service needs  deep packet inspection Toronto Internet Exchange (TorIX), 2009-2010 [Courtesy W. Gross, McGill]

Moore’s Law: On-Chip Bandwidth Datapath width * datapath speed 40% / year increase in transistor density 20% / year transistor speed until ~90 nm Total ~60% gain / year 40 nm and beyond: Little intrinsic transistor speed gain once power controlled ~40% gain / year from pure scaling Need to innovate to keep up with demand

Increasing I/O Bandwidth 1000 3D integration? Optical? 26% increase / year per lane Modest growth in # lanes / chip PCIe 3 PCIe 2 100 100 GbE RapidIO 3.0 Interlaken Bandwidth (Gbps) in Log Scale OC 768 PCIe 20G FC RapidIO 2.0 10 10G FC 10 GbE PCI-X RapidIO 1.0 PCI-66 3G SDI OC 48 PCI 1G FC GbE 1 1990 1995 2000 2005 2010

IP FPGA Scaling Economics TSMC Fab 15: $9B $10,000 TSMC Fab 15: $9B 40 & 28 nm 90’s fab cost  fabless industry Chip cost @ 28 nm ~$60M Need big market  go programmable “Chipless” industry emerging $1,000 Foundry Facility Costs ($M) $100 $10 $1 1965 1970 1975 1980 1985 1990 1995 2000 FPGA IP ASSP

FPGAs Becoming SoCs Example: Stratix V

From Glue Logic to SoC 1992 1995 1999 2002 2008 2003 2010 Flex 10k Stratix APEX 20K Stratix GX Stratix IV GX Stratix V GX LUTs FFs Basic I/Os Block RAM PLLs Complex I/Os Hard Processor DSP Blocks Hard PCIe Gen1/2 Serial Transceivers Hard PCIe Gen1/2/3 Hard 40G / 100G Ethernet

Hard Block Evaluation Develop Parameterized Soft IP Hard PCIe? Specific IP in soft fabric Develop Parameterized Soft IP Area, Power, Speed Hard PCIe? Create Configurable Hard IP Estimate Usage & Dev. Cost Gen1 Gen2 Gen3 Area, Power, Speed Include routing ports! Net Win?

Stratix V Transceivers Power Down Power Down Fractional PLLs (fPLL) Clock networks Embedded HardCopy Block) Hard PCS Transceiver PMA I/O Hard PCS Transceiver PMA I/O Hard PCS Transceiver PMA I/O Hard PCS Transceiver PMA I/O FPGA Fabric Hard PCS Transceiver PMA I/O LC Transmit PLLs Hard PCS Transceiver PMA I/O Hard PCS Transceiver PMA I/O Hard PCS Transceiver PMA I/O Hard PCS Transceiver PMA I/O Hard PCS Transceiver PMA I/O 14

Embedded HardCopy Blocks Metal programmed: reduces cost of adding device variants with new hard IP 700K equivalent LEs 14M ASIC gates 5X area reduction vs. soft logic 65% reduction in operating power Very low leakage when unused PCIe Gen3 40G/100G Ethernet Other/Custom 15

Hard IP Example: PCIe & Interlaken Interlaken – PCI Express Switch/Bridge Hard IP LE Savings Interlaken (24 Ch @ 5K LEs) 120K LEs PCIe Gen3 x8 (2 x 160K LEs) 320K LEs Total LE savings 440K LEs 12 Ch @ 5G Interlaken 12 Ch @ 5G Interlaken Stratix V FPGA 5SGXA7 ~630K LEs PCIe Gen3 x8 PCIe Gen3 x8 630K LEs + 440K LEs = 1,070K LEs Lower power Higher effective density Guaranteed timing closure  ease of use 16

Variable-Precision DSP Block Efficiently supports 9x9, 18x18 and 27x27 multiplies 27x27 well suited to floating point Cascade blocks for larger multiplies Can store filter coefficients in register bank inside DSP

Stratix V Maximum Capacities Feature Stratix V Logic Elements 1.1 M RAM bits 52 Mb + 7.3 Mb 18x18 multipliers 3680 High-speed serial links GX: 66 full-duplex @ 12.5 Gb/s GT: 4 @ 28 Gb/s + 32 @ 12.5 Gb/s Hard PCIe blocks 4 Hard 40G / 100G PCS Yes Memory interfaces 7 x 72-bit DDR3 DIMM @ 800 MHz On-chip memory bandwidth ~20,000 GB/s I/O Bandwidth ~300 GB/s 18x18 MACs 1,840 GMAC/s

Altera’s Device Roadmap HardCopy Stratix Arria Cyclone Altera’s ASIC series High-end FPGAs Mid-range FPGAs Low-cost FPGAs Altera’s Device Roadmap HardCopy V ASIC Stratix V FPGA Arria V FPGA HardCopy IV ASIC Stratix IV FPGA HardCopy III ASIC Performance, features, and density Cyclone V FPGA Stratix III FPGA Arria II FPGA Cyclone IV FPGA Arria FPGA Cyclone III LS FPGA Cyclone III FPGA MAX IIZ CPLD 2007 2008 2009 2010 2011 2012 19 19 19 19 19 19

100G Optical System (Stratix II GX) Ten 90 nm FPGAs 1 or 2 @ 28 nm T. Mizuochi, et al, “Experimental demonstration of concatenated LDPC and RS codes by FPGAs emulation,” IEEE Photon Technol. Lett., 2009

Other Challenges & Enhancements @ 28 nm

Stratix V FPGA Power Reduction (New techniques highlighted in yellow) Controlling Power Stratix V FPGA Power Reduction (New techniques highlighted in yellow) Lower Static Power Lower Dynamic Power 28-nm process (high-k, more strain, small C) ü Programmable Power Technology Lower core voltage (0.85 V) Extensive hardening of IP, Embedded HardCopy Blocks Hard power-down of more functional blocks More granular clock gating Selective use of high-speed transistors Dynamic on-chip termination Quartus II software PowerPlay power optimization 22

Fabric Performance Low operating voltage key to reasonable power But costs speed Logic still speeding up, routing more challenging Optimize process for FPGA circuitry (e.g. pass gates) Trend to bigger blocks / more hard IP Wire resistance rapidly increasing  Co-optimize metal stack & FPGA routing architecture Greater mix of wire types and metal layers (H3, H6, H20, V4, V12) Delay to cross chip not scaling Above ~300 MHz, designers pipelining interconnect

Fabric: More Registers MLAB Wr Data Reg Reg ALM1 Full Full Adder Adder ALM2 Reg Reg Adaptive Adaptive Wr Addr Rd Data LUT LUT Reg Reg Full Full Adder Adder ALM10 Reg Reg Memory mode: 5 registers Re-uses 4 ALM registers Adds extra register for write address Easier timing Double the logic registers (4 per ALM) Faster registers Aids deep pipelining & interconnect pipelining

Metastability Robustness data Metastable? clka clkb clk clk data ~Vdd/2

Metastability Robustness Source: Chen, FPGA 2010 Loop gain at Vdd/2 dropping  tmet increasing Solution: register design (e.g. use lower Vt) Solution: CAD system analyzes & optimizes 20,000 to 200,000 increase in MTBF

Pass Transistors Most area-efficient routing mux But Vdd – Vt dropping Bias Temperature Instability (BTI) makes worse Increase / hysteresis in Vt due to Vgs state over time All circuits affected, but pass transistors more sensitive to Vt shift Careful process and circuit design needed Future scaling: Full CMOS? Opening for a new programmable switch?

Soft Errors Block RAM: new M20K block has hard ECC MLAB: can implement ECC in soft logic Configuration RAM: background ECC But could take up to 33 ms to detect Config. RAM circuit design to minimize SEU Trends with SRAM scaling: Smaller target  lower FIT rate / Mb (constant per die) Less charge  higher FIT for alpha, stable for neutron Will this stabilize at an acceptable rate? Known techniques to greatly reduce (at area cost)  does not threaten scaling

Stratix V Partial Reconfiguration Very flexible HW Reconfigure individual LABs, block RAMs, routing muxes, … Without disrupting operation elsewhere

Partial Reconfiguration (PR) Overview Software flow is key Build on existing incremental design & floorplanning tools Enter design intent, automate low-level details Simulation flow for operation, including reconfiguration Partial reconfiguration can be controlled by soft logic, or an external device Load partial programming files while device operating Target: multi-modal applications

Example System: 10*10Gbps→OTN4 Muxponder Client Side Line Side MUXPonder Channel 1 10Gbs 10GbE Channel 2 100Gps OTN4 10Gbs OTN2 10GbE OTN2 Channel 10 10Gbs 31 31

Design Entry & Simulation One set of HDL Tools to simulate during reconfig module reconfig_channel (clk, in, out); input clk, in; output [7:0] out; parameter VER = 2; // 1 to select 10GbE, 2 to select OTN2 generate case (VER) 1: gige m_gige (.clk(clk), .in(in), .out(out)); 2: otn2 m_otn2 (.clk(clk), .in(in), .out(out)); default: gige m_gige(.clk(clk), .in(in), .out(out)); endcase endgenerate endmodule

Incremental Design Flow Background Specify partitions in your design hierarchy Can independently recompile any partition CAD optimizations across partitions prevented Can preserve synthesis, placement and routing of unchanged partitions Top … Channel 2 MUXponder OTN4 Channel 1 33

Partial Reconfig Instances Top … MUXponder OTN4 C1, OTN2 C2, OTN2 C1, 10GbE C2, 10GbE Static partition Partial Reconfig Partition 2 Partial Reconfig Partition 2 34

Partial Reconfiguration: Floorplanning Define partial reconfiguration regions Non-rectangular OK Any number OK Works in conjunction with transceiver dynamic reconfiguration for dynamic protocol support “Double-buffered” partial reconfig Partial Reconfiguration for Core 10GbE OTN2 FPGA Core Transceivers OTN4 Dynamic Reconfiguration for Transceivers 10GbE FPGA Core Transceivers OTN4 35

Physical: I/Os PR region I/Os must stay in same spot Same wire? So rest of design can communicate with any instance Same wire? FPGAs not designed to route to/from specific wires Solution: automatically insert “wire LUT” Automatically lock down in same spot for all instances OTN2 10GbE MUXponder

Physical: Route-Throughs Can partially reconfigure individual routing muxes Enables routing through partial reconfig regions Simplifies / removes many floorplanning restrictions Quartus II records routing reserved for top-level use Prevents PR instances from using it Transceivers 10GbE OTN2 OTN4

Extending & Improving the Software Stack

Design Flow Challenges HDL: low-level parallel programming language RTL ~300 kLOCs, behavioural ~40 kLOCs [NEC, ASPDAC04] Timing closure Fabric speed flattening, but processing needs growing Datapaths widening, device sizes growing exponentially 4x28 Gbps  336 bit datapath @ 333 MHz  need good P & R Need more latency?  may cause major HDL changes Compile, test, debug cycle slower than SW And tools to observe HW state less mature Any timing closure issues exacerbate Firmware development needs working HW

The Competition: Many Core Tilera TILE64 PCIe 1 MAC PHY PCIe 0 Serdes Flexible IO GbE 0 GbE 1 UART, HPI JTAG, I2C, SPI DDR2 Memory Controller 3 DDR2 Memory Controller 0 DDR2 Memory Controller 2 DDR2 Memory Controller 1 XAUI PHY 0 PHY 1 PROCESSOR P2 Reg File P1 P0 CACHE L2 CACHE L1I L1D ITLB DTLB 2D DMA STN MDN TDN UDN IDN SWITCH 40

Competition: ASSP w/HW Accelerators Ex. Cavium – Octeon CN68XX 85 application accelerators 65 nm in Q4 2010 2 process generations behind FPGAs

“Bespoke” ASSPs in FPGAs Processor (Master) PCI Express (Master) Connect IP with SoPC Builder Integrates system & builds software headers Next generation: general Network-on-a-Chip Topology, latency: selectable Scalable enough to form heart-of-the-system Interface Network Interface Network Interface Interconnect Network Internal pipelining, arbitrary topology, customizable arbitration Network Interface Network Interface DDR3 Accel

High-Level Synthesis Good results in some problem domains (e.g. DSP kernels) Often difficult to scale to large programs Debugging and timing closure difficult Unclear how the code relates to the synthesized solution How to change the ‘C’ code to make hardware run faster? Few tools to drive profiling data back to the high-level code Few tools to debug HW in a software-centric environment

OpenCL: Explicitly Parallel C The OpenCL programming model allows us to: Define Kernels Data-parallel computational units  can hardware accelerate Including communication mechanism to kernels Describe parallelism within & between kernels Manage Entire Systems Framework for mix of HW-accelerated and software tasks Still C Multi-target

OpenCL Structure __kernel void sum { … } __kernel void transpose {…} float cross_product { … } Program: kernels and functions Task-level parallelism, overall framework __kernel void sum (__global const float *a, __global const float *b, __global float *answer) { int xid = get_global_id(0); answer[xid] = a[xid] + b[xid]; } Kernels: data-level parallelism Suitable for HW or parallel SW implementation Specify memory hierarchy

The Past (1984): Editing Switches

The Present: HDL Design Flow Timing & Other Constraints Verilog, VHDL Synthesis Timing and Power Analyzer Placement and Routing Timing, Power and Area Optimized Design

Extract Communication RTL becomes assembly language The Future? OpenCL Extract Communication Fast debug Kernel Compilers Kernel Compilers Kernel Compilers SoPC Builder Control SW HW kernels or SW kernels Communication Fabric RTL becomes assembly language 48

Summary

Summary Huge demand for more processing FPGAs becoming SoCs Possibly outstripping Moore’s law & off-chip bandwidth FPGAs becoming SoCs More heterogeous/hard function units FPGAs specializing to markets 28 nm & Stratix V -30 to -50% power, 1.5x I/O bandwidth, 1.5x – 2x more processing Partial reconfiguration FPGA robustness with scaling Innovation overcoming issues  scaling continues Tool innovation needed Higher-level, fast debug cycles, push-button timing closure