Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28,

Similar presentations


Presentation on theme: "Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28,"— Presentation transcript:

1 Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28, 2007

2 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 2 The von Neumann Syndrome CS people: blind on the right eye Tunnel view on the left eye Treatment is urgently needed

3 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 3 Mainstream in Embedded Systems for a Decade 1,620,000 915,000 398,000 272,000 647,000 1,490,000 # of hits by Google “FPGA and ….” Embedded Systems scene not imprisened by the von Neumann paradigm trap Hardware People: Computer architects Embedded system designers http://hartenstein.de/pervasiveness.html 10,000,000 FPGA

4 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 4 Everywhere in Scientific Computing 162,000 127,000 158,000 113,000 171,000 194,000 # of hits by Google 1,620,000 915,000 398,000 272,000 647,000 1,490,000 # of hits by Google Math/SW-savvy scene unqualified for RC ? educational deficits: help needed by hardware experts “FPGA and ….”

5 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 5 Reconfigurable Supercomputing Revolution Cray XT4 silicon graphics RASC Reconfigurable Computing at Microsoft Chuck Thacker

6 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 6 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions

7 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 7 The first archetype machine model main frame CPU compile or assemble procedural personalization Software Industry Software Industry’s Secret of Success simple basic. Machine Paradigm personalization: RAM-based instruction-stream- based mind set “von Neumann” But now we live in the Configware Age

8 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 8 The von Neumann Paradigm Trap Program counter (auto-increment, jump, goto, branch) Datapath Unit with ALU etc., I/O unit, …. [Burks, Goldstein, von Neumann; 1946] RAM (memory cells have adresses ….) CS education got stuck in this paradigm trap which stems from technology of the 1940s We need a dual paradigm approach CS education’s right eye is blind, and its left eye suffers from tunnel view

9 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 9 Von Neumann CPU DPU program counter DPU CPU term program counter execution triggered by paradigm CPU yes instruction fetch instruction- stream- based RAM memory - World of Software -Engineering Program Source: Software (tunnel view with the left eye)

10 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 10 Nick Tredennick’s Paradigm Shifts: Von Neumann 1 programming source needed algorithm: variable resources: fixed software CPU Early historic machines algorithm: fixed resources: fixed (slowly preparing to use both eyes for a dual paradigm point of view)

11 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 11 Compilation: Software source program software compiler software code Software Engineering instruction schedule (Befehls-Fahrplan) sequential (von Neumann model)

12 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 12 Monstrous Steam Engines of Computing 5120 Processors, 5000 pins each Crossbar weight: 220 t, 3000 km of thick cable, larger than a battleship power measured in tens of megawatts, floor space measured in tens of thousands of square feet ready 2003

13 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 13 We are in a Computing Crisis platform example hardw cost $ / Gflops cost factor MDgrape-3* (domain-specific 2004) 15 1 Pentium 4400 27 Earth Simulator (supercomputer 2003) 8000 533 *) feasible also with rDPA

14 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 14 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions

15 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 15 von Neumann is not the common model program counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age:

16 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 16 The clash of paradigms a programmer does not understand function evaluation without machine mechanisms - without a pogram counter … accelerators µ processor structural hardware guy programmer procedural the basic mind set is instruction-stream-based kind of data-stream- based mind set the software / hardware chasm we need a datastream based machine paradigm microprocessor age:

17 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 17 Here is the contemporary common model program counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age: Now we are in the configware age: accelerator reconfigurable accelerator hardwired CPU

18 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 18 FPGAs in Supercomputing Synergisms: coarse-grained parallelism through conventional parallel processing, reconfigurable logic box: 1 Bit and: fine-grained parallelism through direct configware execution on the FPGAs DPU program counter DPU CPU DataPath Units 32 Bit, 64 Bit (millions of rLBs embedded in a reconfigurable interconnect fabrics)

19 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 19 FPGA Modes of Operation configware code loaded from external flash memory, e. g. after power-on (~milliseconds) time C ph off E ph Execution phase E ph Configuration phase C ph Legend: simple, static reconfigurability (requiring new OS principles)

20 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 20 established R&D area illustrating dynamically reconfigurable time FPGA module no. macro Z E ph module z C phE ph macro X E ph module XC ph E ph C ph configware macro Y C ph E ph module Y C ph X configures Y Swapping and scheduling of relocatable configware code macros is managed by a configware operating system partially reconfigurable configware OS fundamentally different from software OS Reconfigurable Computing at Microsoft Microsoft ReconVista ? Microsoft ReconVista ? Configware OS

21 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 21 Gliederung The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions

22 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 22 reconfigurability overhead> routing congestion wiring overhead overhead: >> 10 000 1980199020002010 10 0 10 3 10 6 10 9 FPGA logical FPGA routed density: FPGA physical (Gordon Moore curve) transistors / microchip (microprocessor) immense area inefficiency 1 st DeHon‘s Law [1996: Ph. D thesis, MIT] general purpose “simple” FPGA Deficiencies of reconfigurable fabrics (FPGA) (fine-grained) power guzzler slow clock deficiency factor: >10,000

23 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 23 This extreme area-inefficiency holds only for „simple FPGAs“ „Platform-FPGAs“, however, are a predefined mixture of powerful, hardwired resources (microprocessors, memory blocks, multipliers, etc.), embedded in FPGA fabrics.

24 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 24 Software-to-Configware (FPGA) Migration: 1980199020002010 10 0 10 3 10 6 10 9 8080 Pentium 4 7% / yr 50% / yr real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding 400 1000 MAC DSP and wireless Image processing, Pattern matching, Multimedia BLAST 52 protein identification 40 Smith-Waterman pattern matching 288 molecular dynamics simulation 88 Bioinformatics GRAPE 20 Astrophysics Microprocessor relative performance Memory crypto 1000 deficiency factor: >10,000 speed-up factor: 6,000 total discrepancy: >60,000,000 The RC paradox some published speed-up factors oil and gas 17 X 2/yr the memory wall Areas of success. from high-end systems on earth to mission-critical systems in space.

25 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 25 Reconfigurable HPC This area is almost 10 years old

26 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 26 Executive Summary doesn‘t help We must first understand the nature of the paradigm Understanding the RC Paradox ? von Neumann chickens ?

27 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 27 Moore’s law not applicable to all aspects of VLSI What is the reason of the paradox ? resulting from decades of tunnel view in CS R&D and education basic mind set completely wrong the von Neumann Syndrome “CPU: most flexible platform” ? But >1000 CPUs running in parallel are the most inflexible platform: However, FPGA & rDPA are very flexible The Law of More: drastically declining programmer productivity the law of Gates

28 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 28 Rapid Decline of Computational Density [BWRC, UC Berkeley, 2004] 1990199520002005 200 100 0 50 150 75 25 125 175 SPECfp2000/MHz/Billion Transistors DEC alpha SUN HP IBM alpha: down by 100 in 6 yrs IBM: down by 20 in 6 yrs stolen from Bob Colwell CPU memory wall, caches,... primary design goal: avoiding a paradigm shift DPU CPU program counter dramatic demo of the von Neumann Syndrome

29 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 29 „It is feared that domain scientists will have to learn how to design hardware. Can we avoid the need for hardware design skills and understanding?“ Avoiding the paradigm shift? Tarek El-Ghazawi, panelist at SuperComputing 2006 „A leap too far for the existing HPC community“ panelist Allan J. Cantle SuperComputing, Nov 11-17, 2006, Tampa, Florida, over 7000 registered attendees, and 274 exhibitors We need a bridge strategy by developing advanced tools for training the software community to think in fine grained parallelism and pipelining techniques. A shorter leap by coarse-grained platforms which allow a software-like pipelining perspective

30 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 30 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions

31 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 31 We need a new machine paradigm a programmer does not understand function evaluation without machine mechanisms - without a pogram counter … data-stream-based mind set we urgently need a datastream based machine paradigm data it was pepared almost 30 years ago x x x x x x x x x | || xx x x x x xx x - - - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | data streams

32 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 32 Having introduced Data streams x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # systolic array research: throughout the 80ies: Mathematicians‘ hobby The road map to HPC: ignored for decades ~ 1980 DPA (pipe network) execution transport- triggered no memory wall H. T. Kung

33 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 33 Who generates the Data Streams? Mathematicians: it‘s not our job x x x x x x x x x | || xx x x x x xx x -- - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | (it‘s not algebraic) „systolic“

34 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 34 Without a sequencer … … it’s not a machine Mathematicians have missed to invent the new machine paradigm reductionist approach: (it‘s not our job)

35 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 35 of course algebraic (linear projection) only for applications with regular data dependencies Mathematicians caught by their own paradigm trap Rainer Kress d iscarded their algebraic synthesis methods and replaced it by simulated annealing: rDPA 1995 Synthesis Method? The super-systolic array: a generalization of the systolic array reductionist approach

36 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 36 The counterpart of the von Neumann machine x x x x x x x x x | || xx x x x x xx x -- - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | (r)DPA ASM data counter GAG RAM ASM : A uto- S equencing M emory data counters instead of a program counter data counters: located at memory (not at data path) Kress /Kung Anti Machine coarse- grained

37 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 37 term program counter execution triggered by paradigm CPU yes instruction fetch instruction- stream- based DPU ** no data arrival* data- stream- based machine models DPU CPU program counter RAM memory von Neumann Anti machine RAM data counter RAM data counter DPU RAM data counter rDPU *) “transport-triggered” **) does not have a program counter - no instruction fetch at run time

38 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 38 Nick Tredennick’s Paradigm Shifts configware resources: variable 2 programming sources needed flowware algorithm: variable Reconfigurable Computing Von Neumann 1 programming source needed algorithm: variable resources: fixed software CPU Early historic machines algorithm: fixed resources: fixed flowware

39 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 39 data counter GAG RAM ASM data counter GAG RAM ASM data counter GAG RAM ASM Configware Compilation configware code flowware code mapper configware compiler scheduler source „ program “ Configware Engineering placement & routing data programming the data counters configware compilation fundamentally different from software compilation x x x x x x x x x | || xx x x x x xx x - - - xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | data streams rDPA pipe network data counter GAG RAM ASM : A uto- S equencing M emories ASM

40 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 40 Data meeting the Processing Unit (PU) by Software by Configware routing the data by memory-cycle-hungry instruction streams thru shared memory placement of the execution locality... We have 2 choices pipe network generated by configware compilation... partly explaining the RC paradox

41 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 41 How much on-chip embedded BRAM ? 256 – 1704 BGA 56 – 424 8 – 32 fast on-chip block RAMs: BRAMs DPU : coarse- grained On-chip LatticeCS series

42 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 42 Generic Address Generator GAG Generalization of the DMA data counter GAG GAG & enabling technology published 1989, survey: [M. Herz et al.: IEEE ICECS 2003, Dubrovnik] patented by TI 1995 storge scheme optimization methodology, etc. Acceleration factors by: address computation without memory cycles avoid e.g. 94% address computation overhead* *) Software to Xputer migration

43 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 43 The 2nd “archetype” machine model compile structural personalization Configware Industry Configware Industry’s Secret of Success personalization: RAM-based data-stream- based mind set “Kress-Kung” accelerator reconfigurable simple basic. Machine Paradigm

44 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 44 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions

45 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 45 array size: 10 x 16 rDPUs Coarse-grained Reconfigurable Array rout thru only not used backbus connect SNN filter on (supersystolic) KressArray (mainly a pipe network) r econfigurable D ata P ath U nit, 32 bits wide no CPU rDPU question after the talk: „but you can‘t implement decisions!“ note: software perspective without instruction streams: pipelining compiled by Nageldinger‘s KressArray Xplorer with Juergen Becker‘s CoDe-X inside

46 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 46 array size: 10 x 16 = 160 rDPUs rout thru only not used backbus connect SNN filter on (supersystolic) KressArray (mainly a pipe network) r econfigurable D ata P ath U nit, e. g. 32 bits wide no CPU rDPU question after the talk: „but you can‘t implement decisions!“ note: software perspective without instruction streams Symptom of the von Neumann Syndrome A High level R&D manager of a large Japanese IT industry group yielded by single-paradigm mind set Executive summary? Forget it ! How about a microprocessor giant having >100 vice presidents ? if clause turns into multiplexer

47 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 47 DPU Much less deficiencies by coarse-grained 1980199020002010 10 0 10 3 10 6 10 9 (Gordon Moore curve) transistors / microchip rDPA physical rDPA logical area efficiency very close to Moore‘s law Hartenstein‘s Law [1996: ISIS, Austin, TX] very compact configuration code: very fast reconfiguration r DPU DPU CPU program counter rDPU

48 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 48 rDPU CPU Dual Paradigm Application Development SW compiler CW compiler C language source Partitioner Juergen Becker’s CoDe-X, 1996 placement and routing automatic parallelization by loop transformations generating a pipe network

49 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 49 Data meeting the Processing Unit by Configware placement of the execution locality... … pipe network generated by configware compilation

50 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 50 Hybrid Multi Core example twin paradigm machine each core can run CPU mode or rDPU mode rDPU CPU 64 cores How about microprocessor industry ? Customers refuse the pradigm shift? Disabled for the paradigm shift ? Twin paradigm provides the flexibility

51 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 51 rDPU CPU rDPU CPU Compilation for Dual Paradigm Multicore SW compiler CW compiler C language source Partitioner Juergen Becker’s CoDe-X, 1996 compile to hybrid multicore placement and routing automatic parallelization by loop transformations generating a pipe network

52 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 52 Start-ups for Coarse-grained Platforms One company has failed Several companies have succeeded, but their technology disappeared through acquisition. Two companies are still available

53 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 53 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions

54 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 54 Energy: an im portant motivation platform exampleEnergy: W / Gflops energy factor MDgrape-3* (domain-specific 2004) 0.2 1 Pentium 414 70 Earth Simulator (supercomputer 2003) 128 640 *) feasible also on reconfigurable platforms

55 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 55 An accidentially discovered side effect Software to FPGA migration of an oil and gas application: Speed-up factor of 17 Electricity bill down to <10% Hardware cost down to <10% All other publications reporting speed-up did not report energy consumption. Saves > $10,000 in electricity bills per year (7 ¢ / kWh) -.... per 64-processor 19" rack What about higher speed-up factors ? More dramatic electricity savings? Herb Riley, R. Associates $70 in 2010? - This will change.

56 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 56 What’s Really Going On With Oil Prices? [BusinessWeek, January 29, 2007] $52 Price of delivery in February 2007 [New York Mercantile Exchange: Jan. 17] $200 Minimum oil price in 2010, in a bet by investment banker Matthew Simmons

57 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 57 Energy as a strategic issue Google‘s annual electricity bill: 50,000,000 $ Amsterdam‘s electricity: 25% into server farms NY city server farms: 1/4 km 2 building floor area [Mark P. Mills] Predicted f. USA in 2020: 30-50% of the entire national electricity consumption goes into cyber infrastructure petaFlop supercomputer (by 2012 ?): extreme power consumption

58 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 58 Outline The von Neumann Paradigm Accelerators and FPGAs The Reconfigurable Computing Paradox The new Paradigm Coarse-grained Bridging the Paradigm Chasm Conclusions

59 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 59 Multi Core: Just more CPUs ? Complexity and clock frequency of single- core microprocessors come to an end Without a paradigm shift just more CPUs on chip lead to the dead roads known from supercomputing Multi-core microprocessor chips emerging: soon 32 cores on an AMD chip, and 80 on an intel Multi-threading is not the silver bullet We’ve to re-think basic assumptions behind computing

60 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 60 Solution not expected from CS officers We need mutual efforts, like w. EE/CS known from the Mead & Conway revolution Progress of the joint task force on CS curriculum recommendations is extremely disillusioning For RC other motivations are similarly high-grade: growing cost and looming shortage of energy. The personal supercomputer: a far-ranging massive push of innovation in all areas of science and economy: by Reconfigurable Computing it‘s more like a lobby: „my area is the most important“ What about

61 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 61 Computing Sciences are in a severe Crisis We urgently need to shape the Reconfigurable Computing Revolution for enabling to go toward incredibly promising new horizons of affordable highest performance computing This cannot be achieved with the classical software-based mind set We need a new dual paradigm approach Watch out not to get screwed ! Disruptive moves may lead to hostile terrain

62 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 62 The Configware Age Mainframe age and microprocessor (-only) age are history We are living in the configware age right now! Attempts to avoid the paradigm shift will again create a disaster

63 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 63 thank you for your patience

64 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 64 END


Download ppt "Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28,"

Similar presentations


Ads by Google