Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June 28 - 30, 2006 International Supercomputer.

Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June 28 - 30, 2006 International Supercomputer Conference http://hartenstein.de

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 2 >> Outline << http://www.uni-kl.de Preface The von Neumann paradigm trap Supercomputing: the wrong Road Map The Solution ignored for decades Fine-grained vs. coarse-grained The wrong Road Map for CS Curricula Conclusions

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 3 Preface The talk illustrates why behind the success of FPGAs there is a hidden paradigm shift My talk does not really cover the performance of bulk storage, discs, etc. My talk highlights the supercomputing paradigm trap and the fully ignored early solution

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 4 The Pervasiveness of Reconfigurable Computing (RC) 162,000 127,000 158,000 113,000 171,000 194,000 # of hits by Google 1,620,000 915,000 398,000 272,000 647,000 1,490,000 # of hits by Google “FPGA and ….” FPGAs are used everywhere >10 mio Nov. 2005 ~150 conferences

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 5 An Example: FPGAs in Oil and Gas.... Saves more than $10,000 in electricity bills per year (7 ¢ / kWh) -.... per 64-processor 19" rack „Application migration [from supercomputer] has resulted in a 17-to-1 increase in performance" [Herb Riley, R. Associates] … 25% of Amsterdam‘s electric energy consumption goes into server farms ? … a quarter square-kilometer of office floor space within New York City is occupied by server farms ? did you know …

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 6 Oil and Gas as a strategic issue It should be investigated, how far the migrational achievements obtained for computationally intensive applications, can also be utilized for servers You know the amount of Google’ s electricity bill?

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 7 15 GigaFLOPs on a single FPGA chip 15 GigaFLOPs on single chip for matrix computations A surprize: much less memory needed than expected Last night I met Stamatis Vassiliadis (TU Delft)

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 8 X 2 / yr FPGA some published speed-up factors 1980199020002010 10 0 10 3 10 6 10 9 8080 Pentium 4 7% / yr 50% / yr real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 Smith-Waterman pattern matching 288 BLAST 52 protein identification 40 molecular dynamics simulation 88 Reed-Solomon Decoding 2400 Viterbi Decoding 400 FFT 100 1000 MAC DSP and wireless Image processing, Pattern matching, Multimedia Bioinformatics GRAPE 20 Astrophysics Microprocessor relative performance Memory crypto 1000 although the effective integration density of FPGAs is by 4 orders of magnitude behind the Moore curve wiring overhead reconfigurability overhead routing congestion The RC paradox the memory wall FPGA invented

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 9 Educational Deficits Transdisciplinary fragmentation: each application domain uses its own trick boxes Too many sophisticated very clever architectures We need a fundamental model with a methodology which all application domains have in common Educational deficits have stalled Reconfigurable Computing (RC) as well as classical supercomputing Transdisciplinary education & basic research needed

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 11 The basic model paradigm trap High performance computing stalled for decades by the von Neuman paradigm trap stolen from Bob Colwell CPU For decades the right roadmap was hidden by another paradigm trap most systems are extremely unbalanced

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 12 Computer Science not prepared Transdisciplinary Education? for decades: the Hardware / Software chasm Lacking intradisciplinary cohesion between the mind sets of: Hardware People Theoreticians (Math background) Software People (Application Development) turns into the Configware / Software chasm Embedded Syst. Designers Computer Architects

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 13 Flag ship conference series: IEEE ISCA Parallelism faded away 98.5 % von Neumann [David Padua, John Hennessy, et al.] Other: cache coherence ? speculative scheduling? (2001: 84%) Jean-Loup Baer migration of the lemings

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 14 The Dead Supercomputer Society ACRI Alliant American Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research Culler-Harris Culler Scientific Cydrome Dana/Ardent/ Stellar/Stardent DAPP Denelcor Elexsi ETA Systems Evans and Sutherland Computer Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech ICL Intel Scientific Computers International Parallel Machines Kendall Square Research Key Computer Laboratories MasPar Research 1985 – 1995 [Gordon Bell, keynote ISCA 2000] Meiko Multiflow Myrias Numerix Prisma Tera Thinking Machines Saxpy Scientific Computer Systems (SCS) Soviet Supercomputers Supertek Supercomputer Systems Suprenum Vitesse Electronics 49: died just in 1 decade

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 16 Moving Data around 5120 Processors, 5000 pins each Crossbar weight: 220 t, 3000 km of thick cable, ES 20: TFLOPS peak or sustained?

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 18 Data meeting the Processing Unit (PU) by Software by Configware routing the data by memory-cycle-hungry instruction streams placement of the execution locality We have 2 choices optimize a pipe network: place PU in data stream

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 19 Illustrating the von Neumann paradigm trap The data-stream-based approach The instruction-stream-based approach von Neumann bottle- neck the watering pot model [Hartenstein] many watering pots

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 20 The Memory Wall (2) Supercomputing urgently needs a fundamentally different approach toward interconnect efficiency. Key problem is the inefficiency and complexity of moving data, not processor performance. Tear down this Wall ! Most important goal is the minimization of the number of main memory cycles.

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 22 x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams time port # time port # time port # defines:... which data item at which time at which port (H. T. Kung paradigm) CS Mathematicians‘ hobby, early 80ies DPA* (pipe network) *) D ata P ath A rray (array of DPUs) The Systolic Array 1980 introducing Data streams no instruction streams needed D ata P ath U nit has no program counter! it’s no CPU! nice time/space notation -

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 23 Terminology term program counter execution triggered by paradigm CPU yes instruction fetch instruction- stream- based DPU ** no data arrival* data- stream- based DPU progra m counter DPU CPU *) “transport-triggered ” **) does not have a program counter

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 24 ISC2006 BoF SessionTitle and Abstract Is Reconfigurable Computing the Next Generation Supercomputing? Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing. Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing Date and Time This BoF session is part of the conference program and will take place within a 45 minute-slot on Wednesday 28. June 2006 from 18:00 - 19:30. BoF Organizers John Abott Chief Analyst, The 451 Group, USA Dr. Joshua Harr CTO, Linux Networx, USA As CTO for Linux Networ x, Dr. Joshu a Harr has the respon sibility of laying the technic al roadma p for the compa ny and is leading the team develo ping cluster manag ement tools. Josh's experie nce with parallel process ing, distrib uted comput ing, large server farms, and Linux clusteri ng began when he built an eight- node cluster system out of used compo nents while in college. An industr y expert, Josh has been called upon to consult with busines ses and lecture in college classro oms. He earned a Ph.D. in comput ational chemis try and a bachel or's degree in molecu lar biolog y from BYU. Dr. Eric Stahlberg Organizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA The new paradigm: how the data are traveling An old hat: transport-triggered pipeline, or chaining super systolic array better not by instruction execution DPU vN Move Processor instruction-driven + instruction-driven [Jack Lipovski, EUROMiCRO, Nice, 1975] P&R: move locality of operation, not data !

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 25 No Memory Wall DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams DPU operation is transport-triggered no instruction streams no message passing nor thru common memory where were the supercomputing people ? massively reducing memory cycles The right road map to HPC: there ignored for decades

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 26 Mathematicians X-ing Systolic Synthesis Mathematicians like the beauty and elegance of Systolic Arrays. Due to a lacking intradisciplinary view, their efforts yielded poor synthesis algorithms. Reiner Hartenstein

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 27 of course, algebraic ! Synthesis Method? Algebraic means linear projection, restricted to uniform arrays, only with linear pipes useful only for applications with strictly regular data dependencies: Mathematicians caught by their own paradigm trap for more than a decade Rainer Kress d iscarded their algebraic synthesis methods and replaced it by simulated annealing. Generalization* by a transdisciplinary hardware guy: rDPA: 1995 *) super-systolic the specialist trap

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 28 Generating the Data Streams DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams Mathematicians: it‘s not our job Who generates the data streams ? (it‘s not algebraic)

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 29 data counter GAG RAM ASM : A uto- S equencing M emory use data counters, no program counter rDPA xx x x x x xx x -- - xx x x x x xx x -- - - - - - - - - - - ASM Data stream generators (pipe network) x x x x x x x x x | || x x x x x x x x x | | | | | | | | | | | | | | ASM implemented by distributed on-chip memory 50 & more on-chip ASM are feasible 50 & more on-chip ASM are feasible reconfigurable 32 ports, or n x 32 ports other example non - von - Neumann machine paradigm

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 30 Compilation: Software vs. Configware source program software compiler software code Software Engineering Configware Engineering placement & routing scheduler flowware code data C, FORTRAN MATHLAB, … instruction streams data streams configuration configware code mapper configware compiler source „ program “

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 32 Coarse-grained vs. fine-grained Definition of FPGA see previous talk by Dr. Thomas Steinke devicegranularitypath width eff ’ ve density flexibility FPGAfine-grained~ 1 bitvery low general purpose DPAcoarse-grained multi bit: e.g. 32 bits very high specialized rDPAcoarse-grained domain- specific platform FPGA fine-grained & embedded hdw. mixedhigh

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 35 Why coarse grain much more MOPS/milliWatt reconfigurable Data Path Unit (e. g. rALU) mind set close to classical computing background instead of rLB (~1 bit wide) use rDPU (e. g. 32 bits wide) instead of FPGA use rDPA rDPU much more area-efficient much less reconfigurability overhead

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 36 array size: 10 x 16 = 160 rDPUs Coarse grain is about computing, not logic rout thru only not used backbus connect SNN filter on KressArray (mainly a pipe network) [Ulrich Nageldinger] Example: mapping onto rDPA by DPSS: based on simulated annealing reconfigurable function block, e. g. 32 bits wide no CPU

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 37 commercial rDPA example: PACT XPP - XPU128 XPP128 rDPA Evaluation Board available, and XDS Development Tool with Simulator buses not shown rDPU Full 32 or 24 Bit Design working silicon 2 Configuration Hierarchies © PACT AG, http://pactcorp.com (r) DPA

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 38 e. g.: array w. 56 rDPUs: running under 500 MHz Games MusicVideos SMeXPP Camera Baseband- Processor Radio- Interface Audio - Interface SD/MMC Cards LCD DISPLAY rDPA Variable resolutions and refresh rates Variable scan mode characteristics Noise Reduction and Artifact Removal High performance requirements Variable file encoding formats Variable content security formats Variable Displays Luminance processing Detail enhancement Color processing Sharpness Enhancement Shadow Enhancement Differentiation Programmable de-interlacing heuristics Frame rate detection and conversion Motion detection & estimation & compensation Different standards (MPEG2/4, H.264) A single device handles all modes World TV & game console & multi media center http://pactcorp.com

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 39 500MHz Flexible Soft Logic Architecture 200KLogic Cells 500MHz Programmable DSP Execution Units 0.6-11.1Gbps Serial Transceivers 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 1Gbps Differential I/O 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management DSP platform FPGA [courtesy Xilinx Corp.]

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 41 Computing Curricula 2004 fully ignores Reconfigurable Computing Joint Task Force for FPGA & synonyma: 0 hits not even here (Google: 10 million hits) Curricula ?

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 42 Upon my complaints * the only change: including at end of last paragraph of the survey volume: Curriculum Recommendations, v. 2005 "programmable hardware (including FPGAs, PGAs, PALs, GALs, etc.)." However, no structural changes at all v. 2005 intended to be the final version (?) torpedoing the transdisciplinary responsibility of CS curricula This is criminal ! Peter Denning … *) no reply

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 43 Here is the common model data- stream- based instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code CPU it’s not von Neumann wagging the dog the tail is most accumulated MIPS have been migrated here mainly just for running legacy code etc.

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 44 Dual Paradigm Application Development instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code data- stream- based CPU software/configware co-compiler high level language CPU SW compiler CW compiler C language source Partitioner rDPU Juergen Becker’s CoDe-X, 1996 Road map to the Personal Supercomputer intel future multi core ? dual mode PUs

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 45 We need a curricular dual-paradigm approach structural procedural instruction- stream- based data- stream- based For Transdisciplinary CS Education procedural-only The von-Neumann-only mind set is obsolete Software Engineering Configware Engineering

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 47 Taxonomy of Algorithm Migration (1) (Instruction-stream-based algorithm taxonomy: partially existing, not really systematic) Algorithms migrated to time-space domain (for RC): a taxonomy is not existing Steadily coming and going data streams are best candidates Computationally intensive applications are the best candidates for migration to FPGA bulk data bases might be subject of FPGA usage to avoid memory cycles for address computation A few algorithms (e. g. Turbocode or Viterbi) require a massive amount of interconnect

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 48 Taxonomy of Algorithm Migration (2) Migration efficiency (reducing memory cycles): Servers: to be investigated - for sure is: loop transformations: efficient, deterministic caches: indeterministic and energy guzzlers much less local memory needed secondary data memory: distributed on-chip memory architectures highly promising address computations: efficient migration

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 49 Conclusions highly promising for servers excellent results proven for computationally intensive applications improvements likely for bulk data & storage applications tool and language scenario needs an urgent transdisciplinary clean-up

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 53 SW / CW Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, FORTRAN, MATHLAB, etc. automatic SW / CW partitioner

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 54 configware resources: variable Why 2 different Codes needed ? Nick Tredennick’s Paradigm Shifts 2 programming sources needed flowware algorithm: variable Configware Engineering Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 55 rDPU S + for demo: a tiny section of the pipe network inter-rDPU-communication: no memory cycles needed configware solution: computing in space

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 56 Compare it to software solution on CPU on a very simple CPU C = 1 memory cycles nano seconds if C then read A read instruction instruction decoding read operand* operate & register transfers if not C then read B read instruction instruction decoding add & store read instruction instruction decoding operate & register transfers store result total S = R + (if C then A else B endif); S + A B R C Clock 200 =1 S +

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 57 hypothetical branching example to illustrate software-to-configware migration *) if no intermediate storage in register file C = 1 simple conservative CPU example memory cycles nano seconds if C then read A read instruction1100 instruction decoding read operand*1100 operate & reg. transfers if not C then read B read instruction1100 instruction decoding add & store read instruction1100 instruction decoding operate & reg. transfers store result1100 total 5500 S = R + (if C then A else B endif); S + ABR C clock 200 MHz (5 nanosec) =1 section of a major pipe network on rDPU no memory cycles: speed-up factor = 100

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 58 The wrong mind set.... S = R + (if C then A else B endif); =1 + A B R C section of a very large pipe network: decision not knowing this solution: symptom of the hardware / software chasm and the configware / software chasm „but you can‘t implement decisions!“

© 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 59 (anti-von-Neumann machine paradigm) Data Counter instead of Program Counter Generalization of the DMA data counter GAG RAM ASM : A uto- S equencing M emory ASM GAG & enabling technology: published 1989 [by TU-KL], Survey paper: [M. Herz et al. * : IEEE ICECS 2003, Dubrovnik] *) IMEC & TU-KL **) -- patented by TI ** 1995 Storge Scheme optimization methodology, etc.

Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June 28 - 30, 2006 International Supercomputer.

Similar presentations

Presentation on theme: "Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June 28 - 30, 2006 International Supercomputer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June 28 - 30, 2006 International Supercomputer.

Similar presentations

Presentation on theme: "Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June 28 - 30, 2006 International Supercomputer."— Presentation transcript:

Similar presentations

About project

Feedback