Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International.

Similar presentations


Presentation on theme: "How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International."— Presentation transcript:

1 How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow http://hartenstein.de PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015 downloadable from http://xputer.de

2 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 2 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

3 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern Oldest conference series on power efficiency Power efficiency is going to become an industry-wide issue Some incremental improvements are on track, at all abstraction levels however, there is still a lot to be done http://hartenstein.de/PATMOS/ 3 Project. leader: Reiner Hartenstein partner.. leader: Antonio Núñez partner.. leader: Francis Jutand spin-off from the PATMOS project The Workshop Series

4 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern Power-Efficient Computing Power-efficient Microchip Design 4 Power-efficient Computer Architectures Power-efficient Languages and Compilers Power-efficient Software Implementation (Power-efficient Memory) Power-efficient Machine Paradigm tutorial by Jan Rabaey my presentation

5 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 5 5 © New York Times Data Center at Dallas Columbia river G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008 Power consumption by internet: x30 til 2030 if trends continue It‘s more than the entire world‘s total power consumption to-day It‘s more than the entire world‘s total power consumption to-day !!! ICT infrastructures © Hewlett Packard Three tectonic shifts: the energy-constrained world from internet of people to internet of (every)thing the end of scaling as we know it the end of scaling as we know it © Hewlett Packard

6 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 6 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

7 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern Terminology Problems Stressing differences to „ Control-Flow “ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“… … although the „Dataflow“ scene is „I-Structure“-centered 7 Tagged Token Flow or „ “

8 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern A Second Opinion D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER, February 1982 the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future." ( Still active workshops …, e. g.: 5th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA ) However, the power efficiency break-thru did not happen here 8

9 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 9 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

10 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 10 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding 400 1000 MAC DSP and wireless Smith-Waterman pattern matching 288 molecular dynamics simulation 88 BLAST 52 protein identification 40 Bioinformatics GRAPE 20 Astrophysics SPIHT wavelet-based image compression 457 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 Image processing, Pattern matching, Multimedia 3000 CT imaging 10 6 Speedup-Factor 199520002010 2005 ~300 DNA & protein sequencing 8723 779 x2/yr 1985 1990 10 0 *) TU Kaiserslautern 10 3 10 0 + Pre-FPGA solutions Xputer 15000 no FPGA: DPLA on MoM by TU-KL* Design Rule Check accelerator PISA („fair comparizon“) 1984: 1 DPLA replaces 256 FPGAs 10 5 Speedup- Factor Speed-ups by vN Software to FPGA Migrations 10 3 ~10% of speed-u p Power saving : http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html fabricated by E.I.S. Multi University Project Chip 1984 >15 years earlier 3439 Crypto 1000 28514 DES breaking Saving Power

11 © 2015, reiner@hartenstein.de http://xputer.de 11 The Reconfigurable Computing Paradox although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion “von Neumann Syndrome” C.V. Ramamoorthy Von Neumann Syndrome von Neumann: an extremely power-inefficient paradigm Reinvent Computing why ?

12 © 2015, reiner@hartenstein.de http://xputer.de 12 Obstacles to widespread FPGA adoption go well beyond the required skill set - Workshop at FPL_2015 http://reconfigurablecomputing4themasses.net/

13 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern What about Acceleration by Graphics Processors ? Power saving mostly not documented R. Vaduc et al.*: „ … adding a GPU is equivalent to adding one more multicore CPU socket …” *) R. Vuduc, J. Choi, M. Guney, A. Shringarpure: On the Limits of GPU Acceleration ; Proc. HotPar'10, 2 nd USENIX workshop on Hot Topics in Parallelism, June 14 - 15, 2010, Berkeley, CA, USA, USENIX Assoc. Berkeley, CA, USA http://newport.eecs.uci.edu/~amowli/resources/papers/vuduc2010-hotpar.pdf Drastically smaller Speed-ups if at all von Neumann

14 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 14 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

15 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Dual paradigm mind set: an old hat – but was ignored 15 time to space mapping: procedural to structural: 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of Register- Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972 FF token bit evoke FF „This is so simple …… 1971 loop to pipe mapping T u n n e l V i s i o n ! why did it take 25 years to find out? PDP-16 RTMs: demultiplexer

16 © 2015, reiner@hartenstein.de http://xputer.de 16 x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # define:... which data item at which time at which port The Systolic Arrays (1) no instruction streams needed 1980 DPA (pipe network) DataPath Array (array of DPUs) M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 H. T. Kung execution transport- triggered Kung‘s example (algebra)

17 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 17. just only. special purpose ? (Tunnel Vision) Karl Steinbuch It is not sufficient to invent something. You need to recognize, that you have invented something. Systolic Arrays (2) http://karl-steinbuch.org 17 M. J. Foster and H. T. Kung: “The Design of Special- Purpose VLSI Chips... “

18 © 2015, reiner@hartenstein.de http://xputer.de 18 H. T. Kung*: „of course algebraic!“ My student Rainer Kress replaced it by simulated annealing : this supports also any irregular & wild shape pipe networks What Synthesis Method? The super-systolic array: a generalization of the systolic array: http://kressarray.de/ supports only very special applications with strictly regular data dependencies only linear pipes *) KressArray [ASP-DAC-1995]KressArray general purpose now a general purpose methodology ! (linear projection) *) a mathematician

19 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern H. T. Kung: “It’s not our job” 19 *) or receives another Tunnel Vision Symptom without a sequencer: missed to invent a new machine paradigm Who generates* the datastreams? (the Xputer)

20 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 20 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

21 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern The Xputer machine Paradigm obtained by adding auto-sequencing memory (ASM) GAG : G eneric A ddress G enerstor With data counters instead of a program counter is the TU-KL ‘ s Symbiosis of Time to Space Mapping and Reconfigurable Computing! (TU-KL) ASM data counter GAG RAM ASM Xputer literature 21

22 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Duality of procedural Languages Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops 22 Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops Asymmetry But there is an Asymmetry program counter: data counter(s): more simple: no ALU tasks MoPL state register(s): Xputer literature Xputer pages easy to learn control-procedural data-procedural

23 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 23 Compilation: Software vs. Configware u. Flowware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering time to space mapping data C, etc. procedural: time domain) space domain time domain Xputer literature

24 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 24 Heterogeneous: Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, or other high level language automatic SW / CW partitioner Xputer pages important: why? ( Jürgen Becker ( Jürgen Becker ‘s Ph.D. thesis) CoDe-X

25 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 25 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

26 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 26 I llustrating the Paradigm Trap The von Neumann Manycore Approach the watering can model [Hartenstein] ( many crippled watering cans ) many von Neumann bottle- necks The von Neumann single core Approach The Memory Wall ( crippled watering can ) (1)

27 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 27 I llustrating the Paradigm Trap The “Dataflow“ Computer extremely complicated: no watering can model ! (2) (a power efficiency break-thru did not happen here)

28 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 28 Xputer: the only massively power-efficient Paradigm The Xputer Paradigm has no von Neumann bottle- neck the watering can model [Hartenstein ] fully supporting Reconfigurable Computing fully supporting Reconfigurable Computing

29 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern We need a Seismic Shift … It’s an extremely massive challenge ! … to avoid future unaffordable ICT power consumption cost 29 For many more years we must work under a heterogeneous triple-paradigm mind set: The software from more than half a century sits squarely on top Configware, Flowware, and still even Software that’s why heterogeneous is important

30 © 2015, reiner@hartenstein.de http://xputer.de 30 thank you !

31 © 2015, reiner@hartenstein.de http://xputer.de 31 END downloadable from

32 © 2015, reiner@hartenstein.de http://xputer.de 32 Backup for discussion: downloadable from

33 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 33 [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] much less equipment needed much less memory and bandwidth needed massively saving energy Reconfigurable Computing (RC): the intensive Impact SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Tarek El-Ghazawi Speed-ups by von Neumann to RC Migrations Application Speed-up factor Savings divisor PowerCostSize. DNA and Protein sequencing 872377922253 DES breaking 285143439961116

34 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern Taxonomy 34 Flynn‘s taxonomy: von Neumann only Diana‘s taxonomy: Reconfigurable Computing Reiner‘s taxonomy: heterogeneous systems Reiner‘s 2 nd taxonomy: Xputers only noI

35 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern LUT Table LUT Table First FPGA available 1984 from Xilinx

36 © 2015, reiner@hartenstein.de http://xputer.de Transformations since the 70ies time domain: procedure domain space domain: structure domain 36 Pipeline k time steps, n DPUs time algorithm space/time algorithmus Strip Mining Transformation Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]Karin SchmidtShaker Verlag n x k time steps, program loop 1 CPU (time to time/space mapping)

37 © 2015, reiner@hartenstein.de http://xputer.de 37 The Reconfigurable Computing Paradox Enabling software developers Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing. Reinvent Computing although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion etc.

38 © 2015, reiner@hartenstein.de http://xputer.de 38 data counter GAG RAM ASM : A uto- S equencing M emory use data counters, no program counter rDPA xx x x x x xx x -- - xx x x x x xx x -- - - - - - - - - - - ASM Data stream generator usage (pipe network) x x x x x x x x x | || x x x x x x x x x | | | | | | | | | | | | | | ASM implemented by distributed on-chip- memory Xputer machine paradigm Xputer pages GAG : Generic Address Generators (reconfigurable: to avoid memory-cycle-hungry address computation) general purpose reconfigurable also more irregular data routes supported

39 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern 39 Paradigm Shift Consequences Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU Program Counter (PC) configware resources: variable 2 programming sources needed flowware algorithm: variable Configware Engineering Data Counters (DCs) sequencing code (e. g. see MoPL language) Configuration Code (CC) von Neumann: Xputer: Xputer literature Heterogenous Engineering Xputer and vN: all 3 programming sources needed

40 © 2015, reiner@hartenstein.de http://xputer.de 40 no memory wall DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams DPU operation is transport-triggered no instruction streams no message passing nor thru common memory massively avoiding memory cycles Pipelining through DPU Arrays: the TU-KL Xputer principle DPA = DPU array DPU = Data Path Unit

41 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern Von Neumann Syndrome 41 Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011 S h i f t i n g t o t h e d o m i n a n c e o f v o n - N e u m a n n - o n l y c a u s e d a n c u m u l a t e d d a m a g e o f a t l e a s t t r i l l i o n s o f D o l l a r s, i f n o t q u a d r i l l i o n s ….

42 © 2015, reiner@hartenstein.de http://xputer.de 42 Computing Paradigms term program counter execution triggered by paradigm CPU yes instruction fetch instruction-stream- based (von Neumann) program counter DPU CPU term no data arrival ** data-driven or data- stream-based data counters r DPU RPU Xputer Reconfigurable Computing no complicated I - structure handling “ Dataflow ” Computer **) “transport-triggered” *) based on tagged token “ I -Structure” I - structure DPUs DFC DFC *

43 © 2015, reiner@hartenstein.de http://xputer.de 43 I-Structure Storage Communication Network PE I-Structure Storage PE MIT Tagged Token Dataflow Architecture* *) source: Jurij SilcJurij Silc „instruction is executed even if some of its operands are not yet available“ no Program Counter no updateable global store ? Tagged Token Flow Architecture I would call it Wait-Match Unit & Waiting Token Store Instruction Fetch Unit Token Queue Program Store & Constant Store Form Token Unit ALU & Form Tag to/from the Communication Network RU SU PE :

44 © 2015, reiner@hartenstein.de http://xputer.de TU Kaiserslautern Solution: I- Structure concept * 44 “can be read but not written” “at least one read request has been deferred” “read request deferrend but write operation is allowed” [ source: Jurij Silc]Jurij Silc very complex data structures ! “each update consumes the structure and the value produces a new data structure” “ awkward or even impossible to implement ” Problems with such “Dataflow” Architectures *) „ I- Structure Flow“ instead of „Dataflow“ ? ? = Tagged Token Structure „Tagged Token Flow“ I would call it

45 © 2015, reiner@hartenstein.de http://xputer.de I -Structures ( I = incremental) - part 1 http://csd.ijs.si/courses/dataflow/ Jurij Silc: Dataflow Architectures 28 27 28 26 45

46 © 2015, reiner@hartenstein.de http://xputer.de MIT Tagged-Token Dataflow Architecture (PE) 31 29 http://csd.ijs.si/courses/dataflow / Jurij Silc: Dataflow Architectures I -structure select I - structure assign 30 20 I -Structures - part 2 46

47 © 2015, reiner@hartenstein.de http://xputer.de Power Efficiency of Programming Languages (an example) 47

48 © 2015, reiner@hartenstein.de http://xputer.de 48 How big is big ? © Hewlett Packard


Download ppt "How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International."

Similar presentations


Ads by Google