© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006.

© 2006, reiner@hartenstein.de http://hartenstein.de 2 The Pervasiveness of RC 162,000 127,000 158,000 113,000 171,000 194,000 # of hits by Google 1,620,000 915,000 398,000 272,000 647,000 1,490,000 # of hits by Google “FPGA and ….” ECE-savvy scene (mainstream many years) Math/SW-savvy scene (more recently: 2-3 years) and many more areas

© 2006, reiner@hartenstein.de http://hartenstein.de 5 >> Outline << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de 6 The Reconfigurable Computing Paradox area-inefficient, slow, power-hungry, expensive tools and languages unacceptable by most users poor FPGA technology: RC education: extremely poor, if at all even most hardware experts (86% ** ) hate their tools **) DeHon ‘98 poor tools: poor education: - ignored by CS curricula CS taught like for a 50 year old mainframe …

© 2006, reiner@hartenstein.de http://hartenstein.de 7 FPGA integration density the effective integration density of plane FPGAs is behind Moore’s law by more than 4 orders of magnitude However, brilliant results everywhere what paradox ?

© 2006, reiner@hartenstein.de http://hartenstein.de 8 X 2 / yr FPGA speed-up factors published 1980199020002010 10 0 10 3 10 6 10 9 8080 Pentium 4 7% / yr 50% / yr http://xputers.informatik.uni-kl.de/faq-pages/fqa.html 10 000 Los Alamos traffic simulation 47 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 Smith-Waterman pattern matching 288 BLAST 52 protein identification 40 molecular dynamics simulation 88 Reed-Solomon Decoding 2400 Viterbi Decoding 400 FFT 100 1000 MAC Grid-based DRC: no FPGA: DPLA on MoM by TU-KL Grid-based DRC: no FPGA: DPLA on MoM by TU-KL 2000 2-D FIR filter [TU-KL] 39,4 Lee Routing ( by TU-KL) 160 Grid-based DRC („fair comparizon“) 15000 DSP and wireless Image processing, Pattern matching, Multimedia Image processing, Pattern matching, Multimedia Bioinformatics GRAPE 20 Astrophysics DPLA MoM Xputer architecture Microprocessor relative performance Memory 10 000 x1.25 / yr (Moore) crypto 1000 pre-FPGA era >1 OoM >2 OoM >3 OoM <4 OoM

© 2006, reiner@hartenstein.de http://hartenstein.de 9 500MHz Flexible Soft Logic Architecture 200KLogic Cells 500MHz Programmable DSP Execution Units 0.6-11.1Gbps Serial Transceivers 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 1Gbps Differential I/O 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management platform FPGAs: better area efficiency [courtesy Xilinx Corp.] DSP platform FPGA DeHon‘s 1 st Law (1996) was for plane FPGAs

© 2006, reiner@hartenstein.de http://hartenstein.de 10 pre FPGA era: Why DPLA* was so good Large arrays of canonical boolean expressions - close to Moore’s law classical PLA layout highly area-efficient: *) fabricated 1984 by E.I.S. multi university project 2 ASM : A uto- S equencing M emory ASM **) for a survey by IMEC & TU-KL see: [M. Herz et al.: I CECS 2003, Dubrovnik] 1 Mid’ 80ies: first only very tiny FPGAs available: 1 DPLA replaced 256 of them a generalization of the DMA** GAG Generic Address Generator** to avoid address computation overhead reducing memory cycles which is the key issue Speed-up factor of 20 by

© 2006, reiner@hartenstein.de http://hartenstein.de 11 X 2 / yr FPGA taxonomy of algorithms, better tools and better education 1980199020002010 10 0 10 3 10 6 10 9 8080 Pentium 4 7% / yr 50% / yr 10 000 Los Alamos traffic simulation 47 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 Smith-Waterman pattern matching 288 BLAST 52 protein identification 40 molecular dynamics simulation 88 Reed-Solomon Decoding 2400 Viterbi Decoding 400 FFT 100 1000 MAC Grid-based DRC: no FPGA: DPLA on MoM by TU-KL Grid-based DRC: no FPGA: DPLA on MoM by TU-KL 2000 2-D FIR filter [TU-KL] 39,4 Lee Routing ( by TU-KL) 160 Grid-based DRC („fair comparizon“) 15000 DSP and wireless Image processing, Pattern matching, Multimedia Bioinformatics GRAPE 20 Astrophysics DPLA MoM Xputer architecture Microprocessor relative performance Memory 10 000 x1.25 / yr (Moore) crypto 1000 even higher speed-up ? consolidation ?

© 2006, reiner@hartenstein.de http://hartenstein.de 12 New dimensions of low power: Application migration [from supercomputer ] resulting not only in massive speed-ups Electricity bills reduced by an order of magnitude and even more you may get for free …. up to millions of $ dollars per year (also a matter of national energy policy) Google Amsterdam NY „Saves more than $10,000 in electricity bills per year (7 ¢ / kWh) -.... per 64-processor 19" rack “ [Herb Riley, R. Associates]

© 2006, reiner@hartenstein.de http://hartenstein.de 14 ISC2006 BoF SessionTitle and Abstract Is Reconfigurable Computing the Next Generation Supercomputing? Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing. Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing Date and Time This BoF session is part of the conference program and will take place within a 45 minute-slot on Wednesday 28. June 2006 from 18:00 - 19:30. BoF Organizers John Abott Chief Analyst, The 451 Group, USA Dr. Joshua Harr CTO, Linux Networx, USA As CTO for Linux Networ x, Dr. Joshu a Harr has the respon sibility of laying the technic al roadma p for the compa ny and is leading the team develo ping cluster manag ement tools. Josh's experie nce with parallel process ing, distrib uted comput ing, large server farms, and Linux clusteri ng began when he built an eight- node cluster system out of used compo nents while in college. An industr y expert, Josh has been called upon to consult with busines ses and lecture in college classro oms. He earned a Ph.D. in comput ational chemis try and a bachel or's degree in molecu lar biolog y from BYU. Dr. Eric Stahlberg Organizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA The Supercomputing Paradox Growing listed Teraflops Increasing number of processors running in parallel COTS processor decreasing cost promising technology

© 2006, reiner@hartenstein.de http://hartenstein.de 15 HPC by classic supercomputing methodology Extreme shortage of affordable capacity Lack of scalability: progress only by innovation More parallelism absorbs programmer productivity Program ready: hardware obsolete The law of More Not for high performance embedded computing poor results

© 2006, reiner@hartenstein.de http://hartenstein.de 17 ISC2006 BoF SessionTitle and Abstract Is Reconfigurable Computing the Next Generation Supercomputing? Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing. Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing Date and Time This BoF session is part of the conference program and will take place within a 45 minute-slot on Wednesday 28. June 2006 from 18:00 - 19:30. BoF Organizers John Abott Chief Analyst, The 451 Group, USA Dr. Joshua Harr CTO, Linux Networx, USA As CTO for Linux Networ x, Dr. Joshu a Harr has the respon sibility of laying the technic al roadma p for the compa ny and is leading the team develo ping cluster manag ement tools. Josh's experie nce with parallel process ing, distrib uted comput ing, large server farms, and Linux clusteri ng began when he built an eight- node cluster system out of used compo nents while in college. An industr y expert, Josh has been called upon to consult with busines ses and lecture in college classro oms. He earned a Ph.D. in comput ational chemis try and a bachel or's degree in molecu lar biolog y from BYU. Dr. Eric Stahlberg Organizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA Why traditional supercomputing / HPC failed memory-cycle-hungry instruction-stream-based: the wrong way, how the data are moved around because of the wrong multi-core interconnect architecture extremely unbalanced stolen from Bob Colwell CPU

© 2006, reiner@hartenstein.de http://hartenstein.de 21 Key issues in very High Performance Computing (vHPC) this needs a paradigm shift reducing memory cycles is the key issue away from the dominance of instruction streams

© 2006, reiner@hartenstein.de http://hartenstein.de 22 Here is the common model data- stream- based instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code CPU it’s not von Neumann the vN monopoly in our curricula is severely harmful Von Neumann: the tail is wagging the dog we need dual paradigm education very high performance & electricity bill issues legacy issues symbiotic

© 2006, reiner@hartenstein.de http://hartenstein.de 23 The wrong basic mind set we need a a dual paradigm approach this is a severe eduational challenge our IT expert labor force lacks the rite basic mind set

© 2006, reiner@hartenstein.de http://hartenstein.de 24 For high school and undergraduate education we need a an archtype simple common model this is a severe eduational challenge instead of a wide variety of sophisticated architectures

© 2006, reiner@hartenstein.de http://hartenstein.de 26 integration density the effective integration density of plane FPGAs behind Moore’s law by more than 4 orders of magnitude the effective integration density of rDPAs* may come close to Moore’s law *) reconfigurable DataPath Arrays (coarse-grained reconfigurability)

© 2006, reiner@hartenstein.de http://hartenstein.de 27 array size: 10 x 16 = 160 rDPUs Coarse grain is about computing, not logic rout thru only not used backbus connect SNN filter on KressArray (mainly a pipe network) [Ulrich Nageldinger] r econfigurable D ata P ath U nit, e. g. 32 bits wide no CPU rDPU

© 2006, reiner@hartenstein.de http://hartenstein.de 29 rDPU Compare it to software solution on CPU S = R + (if C then A else B endif); C = 1 simple conservative CPU example memory cycles nano seconds if C then read A read instruction1100 instruction decoding read operand*1100 operate & reg. transfers if not C then read B read instruction1100 instruction decoding add & store read instruction1100 instruction decoding operate & reg. transfers store result1100 total 5 500 S + Clock 200 S + S = R + (if C then A else B endif);

© 2006, reiner@hartenstein.de http://hartenstein.de 30 hypothetical branching example to illustrate software-to-configware migration *) if no intermediate storage in register file C = 1 simple conservative CPU example memory cycles nano seconds if C then read A read instruction1100 instruction decoding read operand*1100 operate & reg. transfers if not C then read B read instruction1100 instruction decoding add & store read instruction1100 instruction decoding operate & reg. transfers store result1100 total 5 500 S = R + (if C then A else B endif); S + ABR C clock 200 MHz ( 5 nanosec) =1 no memory cycles: speed-up factor = 100

© 2006, reiner@hartenstein.de http://hartenstein.de 31 moving the locality of operation into the route of the data stream by P&R Why the speed-up? What‘s the difference? instead of moving data by instruction streams

© 2006, reiner@hartenstein.de http://hartenstein.de 33 Data-stream-based instead of instruction-triggered execution should be transport-triggered transport should be done within compiled pipelines, not by move engines* *) which are instruction-stream-based !

© 2006, reiner@hartenstein.de http://hartenstein.de 35 The wrong model array size: 10 x 16 = 160 rDPUs rout thru only not used backbus connect SNN filter on KressArray (mainly a pipe network) [Ulrich Nageldinger] r econfigurable D ata P ath U nit, e. g. 32 bits wide no CPU rDPU upon this schematics … … question by a Japanese Corporate vVIP

© 2006, reiner@hartenstein.de http://hartenstein.de 36 The wrong mind set.... not knowing this solution: symptom of the hardware / software chasm and the configware / software chasm „but you can‘t implement decisions!“ We need Reconfigurable Computing Education S + ABR C clock 200 MHz ( 5 nanosec) =1 (Question by a Japanese Corporate vVIP: [RAW’99])

© 2006, reiner@hartenstein.de http://hartenstein.de 38 Universal HPC co-architecture for: some Goals embedded vHPC (nomadic, automotive,...) desktop vHPC (scientific computing...) Application co-development environment for Hardware non-experts,.... Acceptability by software-type users,... Meet product lifetime >> embedded syst. life: FPGA emulation logistics from development downto maintenance and repair stations examples: automotive, aerospace, industrial,..

© 2006, reiner@hartenstein.de http://hartenstein.de 39 Architecture: A potential Pentium successor Discard most caches have 64* cores, 0.5 - 1 GHz with clever interconnect for: ▪ concurrent processes and ▪ and for multithreading, ▪ Kung-Kress pipe network The Desk-top Supercomputer! *) CPU mode / DPU mode capability and, for CPU mode DPU mode

© 2006, reiner@hartenstein.de http://hartenstein.de 41 e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz Games MusicVideos SMeXPP Camera Baseband- Processor Radio- Interface Audio - Interface SD/MMC Cards LCD DISPLAY rDPA Variable resolutions and refresh rates Variable scan mode characteristics Noise Reduction and Artifact Removal High performance requirements Variable file encoding formats Variable content security formats Variable Displays Luminance processing Detail enhancement Color processing Sharpness Enhancement Shadow Enhancement Differentiation Programmable de-interlacing heuristics Frame rate detection and conversion Motion detection & estimation & compensation Different standards (MPEG2/4, H.264) A single device handles all modes World TV & game console & multi media center http://pactcorp.com

© 2006, reiner@hartenstein.de http://hartenstein.de 44 Dual Paradigm Application Development Support instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code data- stream- based CPU software/configware co-compiler high level language placement & routing in the compiler optimizes interconnect bandwidth by preferring nearest neighbor connect

© 2006, reiner@hartenstein.de http://hartenstein.de 45 Software / Configware Co-Compilation Juergen Becker’s CoDe-X, 1996 CPU SW compiler CW compiler C language source Partitioner rDPU Placement & Routing (Move the Locality of Operation) Resource Parameters supporting different platforms

© 2006, reiner@hartenstein.de http://hartenstein.de 46 Software / Configware very high level Synthesis instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code data- stream- based CPU term-rewriting-based vhl synthesis system Math formula.... [Arvind, or, Mauricio Ayala]

© 2006, reiner@hartenstein.de http://hartenstein.de 47 >> Conclusions << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer Conclusions http://www.uni-kl.de

© 2006, reiner@hartenstein.de http://hartenstein.de 48 flexibility (for accelerators) Objectives avoiding specific silicon rapid prototyping, field-patching, emulation cheap, compact vHPC for every area which needs:

© 2006, reiner@hartenstein.de http://hartenstein.de 49 Reconfigurable Computing opens many spectacular new horizons: Conclusion (1) Cheap vHPC without needing specific silicon, no mask.... Massive reduction of the electricity bill: locally and national Cheap embedded vHPC Cheap desktop supercomputer (a new market) Fast and cheap prototyping Replacing expensive hardwired accelerators Supporting fault tolerance, self-repair and self-organization Flexibility for systems with unstable multiple standards by dynamic reconfigurability Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace … )

© 2006, reiner@hartenstein.de http://hartenstein.de 50 Universal vHPC co-architecture demonstrator Conclusion (2) Needed: The compilation tool problem to be solved Language selection problem to be solved Education backlog problems to be solved Use this to develop a very good high school and undergraduate lab course A motivator: preparing for the top 500 contest For widely spreading its use successfully: select killer applications for demo

© 2006, reiner@hartenstein.de http://hartenstein.de 54 Compilation: Software vs. Configware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering placement & routing data C, FORTRAN MATHLAB

© 2006, reiner@hartenstein.de http://hartenstein.de 55 configware resources: variable Nick Tredennick’s Paradigm Shifts explain the differences 2 programming sources needed flowware algorithm: variable Configware Engineering Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU

© 2006, reiner@hartenstein.de http://hartenstein.de 56 Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, FORTRAN, MATHLAB automatic SW / CW partitioner simulated annealing

© 2006, reiner@hartenstein.de http://hartenstein.de 57 Co-Compiler for Hardwired Kress/Kung Machine [e. g. Brodersen] software compiler software code Software / Flowware Co-Compiler flowware compiler scheduler flowware code data source automatic SW / CW partitioner

© 2006, reiner@hartenstein.de http://hartenstein.de 58 The first archetype machine model main frame CPU compile or assemble procedural personalization Software Industry Software Industry’s Secret of Success simple basic. Machine Paradigm personalization: RAM-based instruction-stream- based mind set “von Neumann”

© 2006, reiner@hartenstein.de http://hartenstein.de 59 The 2nd archetype machine model compile structural personalization Configware Industry Configware Industry’s Secret of Success personalization: RAM-based data-stream- based mind set “Kress-Kung” accelerator reconfigurable simple basic. Machine Paradigm

© 2006, reiner@hartenstein.de http://hartenstein.de 60 Co-Compiler Enabling Technology is available from academia only a small team needed for commercial re-implementation on the road map to the Personal Supercomputer

© 2006, reiner@hartenstein.de http://hartenstein.de 61 DPA x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # define:... which data item at which time at which port Data streams (pipe network) H. T. Kung paradigm (systolic array) implemented by distributed memory data counter GAG RAM ASM ASM : A uto- S equencing M emory 50 & more on-chip ASM are feasible 50 & more on-chip ASM are feasible

© 2006, reiner@hartenstein.de http://hartenstein.de 62 The Generalization of the Systolic Array [R. Kress]: use optimization algorithms e. g.: simulated annealing Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible reconfigurability makes sense discard algebraic synthesis methods remedy? only for applications with regular data dependencies Kress-Kung paradigm super systolic array

© 2006, reiner@hartenstein.de http://hartenstein.de 63 (Kress-Kung machine paradigm) drastically reducing memory cycles Data Counter instead of Program Counter Generalization of the DMA ASM : A uto- S equencing M emory data counter GAG RAM ASM GAG & enabling technology: multiple publications 1989 … - Survey paper: [M. Herz et al. * : IEEE ICECS 2003, Dubrovnik] *) IMEC, Leuven & TU-KL Storge Scheme optimization methodology, etc.*

© 2006, reiner@hartenstein.de http://hartenstein.de 64 fine-grained RC: 1 st DeHon‘s 1 st Law Technology: reconfigurability overhead> routing congestion wiring overhead overhead: >> 10 000 1980199020002010 10 0 10 3 10 6 10 9 FPGA logical FPGA routed (Gordon Moore curve) transistors / microchip (microprocessor) immense area inefficiency [1996: Ph. D, MIT] 10 12 density: FPGA physical

© 2006, reiner@hartenstein.de http://hartenstein.de 65 coarse-grained RC: Hartenstein‘s amendment of DeHon‘s 1 st Law rDPA FPGA routed >> 10 000 (Gordon Moore curve) rDPA physical rDPA logical area efficiency very close to Moore‘s law [1996: ISIS, Austin, TX] e.g. KressArray family 1980199020002010 10 0 10 3 10 6 10 9 transistors / microchip 10 12

© 2006, reiner@hartenstein.de http://hartenstein.de 66 More compute power by Configware than Software Conclusion: most compute power from Configware 75% of all (micro)processors are embedded4 : 1 avarage acceleration factor >2 -> rMIPS* : MIPS > 2 *) rMIPS: MIPS replaced by FPGA compute power 25% embedded µProc. accelerated by FPGA(s) 1 : 4 (a very cautious estimation**) **) Dataquest interaction pending -> 1 : 1 -> Every 2 nd µProc accelerated by FPGA(s) (difference probably an order of magnitude)

© 2006, reiner@hartenstein.de http://hartenstein.de 67 Conclusion (3) Self-Repair and Self-Organization methodology Embedded r-emulation logistics methodology Universal vHPC co-architecture demonstrator select a killer application for demo For widely spreading its use successfully:

© 2006, reiner@hartenstein.de http://hartenstein.de 68 Dual Paradigm Application Development Support instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code data- stream- based CPU software/configware co-compiler high level language MATLAB adapter other example

© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006.

Similar presentations

Presentation on theme: "© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006.

Similar presentations

Presentation on theme: "© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006."— Presentation transcript:

Similar presentations

About project

Feedback