Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern DRAFT PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling,

Similar presentations


Presentation on theme: "How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern DRAFT PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling,"— Presentation transcript:

1 How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern DRAFT PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015

2 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 2 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

3 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 3 years elder than ISPLED. Oldest conference series on power efficiency issues spin-off from the EU’s PATMOS project Power efficiency is going to become an industry-wide issue Some incremental improvements are on track, at all abstraction levels - not only in hardware design and software development however, there is still a lot to do http://hartenstein.de/PATMOS/

4 © 2015, reiner@hartenstein.de http://hartenstein.de Power Efficiency of Programming Languages (an example)

5 © 2015, reiner@hartenstein.de http://hartenstein.de 5 source: iStockphoto Programming Language Popularity not yet determined by power efficiency (IEEE)

6 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 6 6 © New York Times Data Center at Dallas Columbia river G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008 Power consumption by internet: x30 til 2030 if trends continue Already the internet’s 2007 carbon footprint was higher than that of worldwide air traffic It‘s more than the entire world‘s total power consumption to-day !!! IDC predicts: the total number of data centers around the world will peak at 8.6 million in 2017. ICT infrastructures

7 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Google‘s Electricity Bill „The possibility of computer equipment power consumption spiraling out of control could have serious consequences for the overall affordability of computing” [L. A. Barrosso, Google] Google patent for water-based data centers: the ocean provides cooling & power (motion of surface) Google asked the Federal Energy Regulatory Commission for the authority to sell electricity, Cost of a data center is determined solely by the monthly power bill, not by the cost of hardware or maintenance (for example)

8 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Massive Critique of the von Neumann Model John Backus 1978: Can programming be liberated from the von Neumann style? 8 “von Neumann Syndrome” C.V. Ramamoorthy Nathan’s Law: Software is a gas. It expands to fill all its containers... Nathan Myhrvold Dave Patterson bandwidth gap grows 50% / year has reached >1000x (long ago) Patterson’s Law: instruction streams are very slow and extremely memory-cycle-hungry Software dimension: MLoC (Million Lines of Code): SAP NetWeaver 2007238 Mac OS X 10.4 86 Debian 5.0 2009 324 von Neumann: an extremely power-inefficient paradigm Von Neumann Syndrome

9 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 9 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

10 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Terminology Problems Searching a more power-efficient machine paradigm: Stressing differences to „ Control-Flow “ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“… although the „Dataflow“ scene is I-Structure-centered.

11 © 2015, reiner@hartenstein.de http://hartenstein.de 11 I-Structure Storage Communication Network PE I-Structure Storage PE MIT Tagged Token Dataflow Architecture* Wait-Match Unit & Waiting Token Store Instruction Fetch Unit Token Queue Program Store & Constant Store Form Token Unit ALU & Form Tag to/from the Communication Network RU SU PE : *) source: Jurij SilcJurij Silc „instruction is executed even if some of its operands are not yet available“ no PC no updateable global store ?

12 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Solution: I- Structure concept * 12 can be read but not written at least one read request has been deferred read request deferrend but write operation is allowed [ source: Jurij Silc]Jurij Silc very complex data structures ! “each update consumes the structure and the value produces a new data structure” “ awkward or even impossible to implement ” Problems with “Dataflow” Architectures *) „ I- Structure Flow“ instead of „Dataflow“ „instruction is executed even if some of its operands are not yet available“ ? ?

13 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern A Second Opinion D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER, February 1982 the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future." ( Still active … 5th workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA ) However, the power efficiency break-thru did not happen here

14 © 2015, reiner@hartenstein.de http://hartenstein.de 14 Computing Paradigms term program counter execution triggered by paradigm CPU yes instruction fetch instruction-stream- based (von Neumann) program counter DPU CPU term no data arrival ** data-driven or data- stream-based data counters r DPU RPU Xputer Reconfigurable Computing no complicated I - structure handling “ Dataflow ” Computer **) “transport-triggered” *) based on data dependency graph I - structure DPUs DFC DFC *

15 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 15 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

16 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern LUT Table LUT Table First FPGA available 1984 from Xilinx

17 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 17 [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] much less equipment needed much less memory and bandwidth needed massively saving energy Reconfigurable Computing (RC): the intensive Impact SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Tarek El-Ghazawi Speed-ups by von Neumann to RC Migrations Application Speed-up factor Savings PowerCostSize DNA and Protein sequencing 872377922253 DES breaking 285143439961116

18 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 18 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding 400 1000 MAC DSP and wireless Smith-Waterman pattern matching 288 molecular dynamics simulation 88 BLAST 52 protein identification 40 Bioinformatics GRAPE 20 Astrophysics SPIHT wavelet-based image compression 457 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 Image processing, Pattern matching, Multimedia 3000 CT imaging Crypto 1000 28514 DES breaking 10 6 Speedup-Factor 199520002010 2005 ~300 3439 DNA & protein sequencing 8723 779 x2/yr 1985 1990 10 0 *) TU Kaiserslautern 10 0 10 3 + Pre-FPGA solutions Xputer 15000 no FPGA: DPLA on MoM by TU-KL* Grid-based DRC* („fair comparizon“) 1984: 1 DPLA replaces 256 FPGAs 10 5 Speedup- Factor Speed-ups by vN Software to FPGA Migrations 10 3 ~10% of speed-u p Power saving : http://www.fpl.uni-kl.de/staff/hartenstein/Hartenstein-Speedup-Factors.pdf http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html fabricated by E.I.S. Multi University Project Chip

19 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Design Rule Check Accelerator 19 Processing 4-by-4 Reference Patterns DPLA: fabricated by the E.I.S. Multi University Project: *) PISA DRC accelerator [ICCAD 1984] 1984: 1 DPLA replaces 256 FPGAs DPLA was more area-efficient than FPGA - by several orders of magnitude http://www.fpl.uni-kl.de/staff/hartenstein/eishistory_en.html

20 © 2015, reiner@hartenstein.de http://hartenstein.de 20 although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion etc. The Reconfigurable Computing Paradox Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing. Reinvent Computing

21 © 2015, reiner@hartenstein.de http://hartenstein.de 21 Obstacles to widespread FPGA adoption go well beyond the required skill set - Workshop at FPL_2015 http://reconfigurablecomputing4themasses.net/

22 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 22 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

23 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Dual paradigm mind set: an old hat – but was ignored 23 time to space mapping: procedural to structural: 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of Register- Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972 FF token bit evoke FF „This is so simple …… 1971 loop to pipe mapping T u n n e l V i s i o n ! why did it take 25 years to find out? PDP-16 RTMs:

24 © 2015, reiner@hartenstein.de http://hartenstein.de Transformations since the 70ies time domain: procedure domain space domain: structure domain 24 Pipeline k time steps, n DPUs time algorithm space/time algorithmus Strip Mining Transformation Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]Karin SchmidtShaker Verlag n x k time steps, program loop 1 CPU (time to time/space mapping)

25 © 2015, reiner@hartenstein.de http://hartenstein.de 25 x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # define:... which data item at which time at which port The Systolic Arrays (1) no instruction streams needed 1980 DPA (pipe network) DataPath Array (array of DPUs) execution transport- triggered M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 H. T. Kung

26 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern why not general purpose ? 26 just only Special purpose VLSI chips ? M. J. Foster and H. T. Kung: “The Design of Special- Purpose VLSI Chips... “ (Tunnel Vision) Karl Steinbuch It is not sufficient to invent something. You need to recognize, that you have invented something. Systolic Arrays (2) http://karl-steinbuch.org

27 © 2015, reiner@hartenstein.de http://hartenstein.de 27 H. T. Kung: „of course algebraic!“ ( linear projection) My student Rainer Kress replaced it by simulated annealing : supports also any irregular & wild shape pipe networks What Synthesis Method? The super-systolic array: a generalization of the systolic array: http://kressarray.de/ supports only special applications with strictly regular data dependencies only linear pipes *) KressArray [ASP-DAC-1995] now a general purpose methodology !

28 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern H. T. Kung: “It’s not our job” 28 *) or receives another Tunnel Vision Symptom without a sequencer: missed to invent a new machine paradigm Who generates* the datastreams? (the Xputer)

29 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern The Xputer machine Paradigm obtained by adding auto-sequencing memory ASM : A uto- S equencing M emory GAG : G eneric A ddress G enerstor ASM : A uto- S equencing M emory GAG : G eneric A ddress G enerstor with multiple data counters instead of a program counter is the TU-KL ‘ s Symbiosis of Time to Space Mapping and Reconfigurable Computing. (TU-KL) ASM data counter GAG RAM ASM Xputer literature

30 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 30 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

31 © 2015, reiner@hartenstein.de http://hartenstein.de What means Configware? Software Compiler Software Code 31 (instruction-procedural) Software Source space domain time domain Software to Configware Migration traditional Computing Configware Code (structural: space domain) mapper Placement & Routing Configware Source Reconfigurable Computing ( time domain) Xputer pages

32 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 32 Compilation: Software vs. Configware u. Flowware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering placement & routing data C, FORTRAN MATHLAB procedural: time domain) space domain time domain Xputer literature

33 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 33 Heterogeneous: Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, FORTRAN, MATHLAB automatic SW / CW partitioner simulated annealing Xputer pages

34 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 34 Paradigm Shift Consequences Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU Program Counter (PC) configware resources: variable 2 programming sources needed flowware algorithm: variable Configware Engineering Data Counters (DCs) sequencing code (e. g. see MoPL language) Configuration Code (CC) von Neumann: Xputer: Xputer literature

35 © 2015, reiner@hartenstein.de http://hartenstein.de 35 data counter GAG RAM ASM : A uto- S equencing M emory use data counters, no program counter rDPA xx x x x x xx x -- - xx x x x x xx x -- - - - - - - - - - - ASM Data stream generators (pipe network) x x x x x x x x x | || x x x x x x x x x | | | | | | | | | | | | | | ASM implemented by distributed on-chip- memory 100 & more on-chip ASM are feasible 100 & more on-chip ASM are feasible reconfigurable 32 ports, or n x 32 ports other example Xputer machine paradigm Xputer pages

36 © 2006, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Duality of procedural Languages Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops 36 Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops Asymmetry But there is an Asymmetry program counter: data counter(s): more simple: no ALU tasks MoPL state register(s): Xputer literature Xputer pages

37 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 37 CoDe-X Dual Paradigm Application Development CPU SW compiler CW compiler C language source Partitioner rDPU Juergen Becker’s PhD thesis 1996 Road map to the Personal Supercomputer instruction- stream- based software codeconfigware code data- stream- based accelerator reconfigurable accelerator hardwired software/configware co-compiler high level language CPU

38 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 38 >> Outline << http://www.uni-kl.de The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

39 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 39 I llustrating the Paradigm Trap The von Neumann Manycore Approach the watering can model [Hartenstein] ( many crippled watering cans ) many von Neumann bottle- necks The von Neumann single core Approach The Memory Wall ( crippled watering can ) (1)

40 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 40 I llustrating the Paradigm Trap The “Dataflow“ Computer extremely complicated: no watering can model ! (2) (a power efficiency break-thru did not yet happen here)

41 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern 41 Xputer: the only massively power-efficient Paradigm The Xputer Paradigm has no von Neumann bottle- neck the watering can model [Hartenstein ]

42 © 2015, reiner@hartenstein.de http://hartenstein.de 42 thank you !

43 © 2015, reiner@hartenstein.de http://hartenstein.de 43 END

44 © 2015, reiner@hartenstein.de http://hartenstein.de 44 Backup for discussion:

45 © 2015, reiner@hartenstein.de http://hartenstein.de 45 no memory wall DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x -- - - - - - - - - - - x x x x x x x x x | | | | | | | | | | | | | | output data streams DPU operation is transport-triggered no instruction streams no message passing nor thru common memory massively avoiding memory cycles Pipelining through DPU Arrays: the TU-KL Xputer principle DPA = DPU array DPU = Data Path Unit

46 © 2015, reiner@hartenstein.de http://hartenstein.de I -Structures ( I = incremental) - part 1 http://csd.ijs.si/courses/dataflow/ Jurij Silc: Dataflow Architectures 28 27 28 26

47 © 2015, reiner@hartenstein.de http://hartenstein.de MIT Tagged-Token Dataflow Architecture (PE) 31 29 http://csd.ijs.si/courses/dataflow / Jurij Silc: Dataflow Architectures I -structure select I - structure assign 30 20 I -Structures - part 2

48 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Taxonomy 48 Flynn‘s taxonomy: von Neumann only Diana‘s taxonomy: Reconfigurable Computing Reiner‘s taxonomy: heterogeneous systems Reiner‘s 2 nd taxonomy: Xputers only noI

49 © 2015, reiner@hartenstein.de http://hartenstein.de TU Kaiserslautern Von Neumann Syndrome 49 Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011 S h i f t i n g t o t h e d o m i n a n c e o f v o n - N e u m a n n - o n l y c a u s e d a n c u m u l a t e d d a m a g e o f a t l e a s t t r i l l i o n s o f D o l l a r s, i f n o t q u a d r i l l i o n s ….


Download ppt "How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern DRAFT PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling,"

Similar presentations


Ads by Google