How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International.

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

DSPs Vs General Purpose Microprocessors

The von Neumann Syndrome Reiner Hartenstein TU Kaiserslautern TU Delft, Sept 28, (v.2)

Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28,

Stonewalled Progress of Computing Efficiency 1 Reiner Hartenstein (keynote) SA - Sep 1 16: :50

Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.

DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.

Reconfigurable Supercomputing: Hindernisse und Chancen Reiner Hartenstein TU Kaiserslautern Universität Mannheim, 13. Dez

From Organic Computing to Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern PASA, Frankfurt, March 16, 2006.

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Processor Technology and Architecture

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.

CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

Recap – Our First Computer WR System Bus 8 ALU Carry output A B S C OUT F 8 8 To registers’ input/output and clock inputs Sequence of control signal combinations.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

CS curricula update proposed: by adding Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern EAB meeting, Philadelphia,1 Nov 2005.

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern DRAFT PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling,

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

The Transdisciplinary Responsibility of CS Curricula Reiner Hartenstein TU Kaiserslautern San Diego, CA, USA, June , 2006 THE NINTH WORLD CONFERENCE.

Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling, Optimization.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

The PATMOS Workshop Series Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International.

Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of.

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Computer Organization and Design Computer Abstractions and Technology

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

Execution of an instruction

VLSI-SoC 2001 IFIP - LIRMM Stream-based Arrays: Converging Design Flows for both, Reiner Hartenstein University of Kaiserslautern December 2- 4, 2001,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.

Computer Architecture 2 nd year (computer and Information Sc.)

Reconfigurable HPC Notes on datastream-based FFT

ALU (Continued) Computer Architecture (Fall 2006).

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Lecture 3: Computer Architectures

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.

Digital Computer Concept and Practice Copyright ©2012 by Jaejin Lee Control Unit.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

William Stallings Computer Organization and Architecture 6th Edition

Topics to be covered Instruction Execution Characteristics

Computer Organization

ECE354 Embedded Systems Introduction C Andras Moritz.

Control Unit Lecture 6.

Ph.D. in Computer Science

Embedded Systems Design

Architecture & Organization 1

Architecture & Organization 1

Embedded Architectures: Configurable, Re-configurable, or what?

How to cope with the Power Wall

ECE 352 Digital System Fundamentals

Samira Khan University of Virginia Jan 23, 2019

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015 downloadable from

© 2015, TU Kaiserslautern 2 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

© 2015, TU Kaiserslautern Oldest conference series on power efficiency Power efficiency is going to become an industry-wide issue Some incremental improvements are on track, at all abstraction levels however, there is still a lot to be done 3 Project. leader: Reiner Hartenstein partner.. leader: Antonio Núñez partner.. leader: Francis Jutand spin-off from the PATMOS project The Workshop Series

© 2015, TU Kaiserslautern Power-Efficient Computing Power-efficient Microchip Design 4 Power-efficient Computer Architectures Power-efficient Languages and Compilers Power-efficient Software Implementation (Power-efficient Memory) Power-efficient Machine Paradigm tutorial by Jan Rabaey my presentation

© 2015, TU Kaiserslautern 5 5 © New York Times Data Center at Dallas Columbia river G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008 Power consumption by internet: x30 til 2030 if trends continue It‘s more than the entire world‘s total power consumption to-day It‘s more than the entire world‘s total power consumption to-day !!! ICT infrastructures © Hewlett Packard Three tectonic shifts: the energy-constrained world from internet of people to internet of (every)thing the end of scaling as we know it the end of scaling as we know it © Hewlett Packard

© 2015, TU Kaiserslautern 6 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

© 2015, TU Kaiserslautern Terminology Problems Stressing differences to „ Control-Flow “ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“… … although the „Dataflow“ scene is „I-Structure“-centered 7 Tagged Token Flow or „ “

© 2015, TU Kaiserslautern A Second Opinion D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER, February 1982 the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future." ( Still active workshops …, e. g.: 5th Workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA ) However, the power efficiency break-thru did not happen here 8

© 2015, TU Kaiserslautern 9 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

© 2015, TU Kaiserslautern 10 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding MAC DSP and wireless Smith-Waterman pattern matching 288 molecular dynamics simulation 88 BLAST 52 protein identification 40 Bioinformatics GRAPE 20 Astrophysics SPIHT wavelet-based image compression 457 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 Image processing, Pattern matching, Multimedia 3000 CT imaging 10 6 Speedup-Factor ~300 DNA & protein sequencing x2/yr *) TU Kaiserslautern Pre-FPGA solutions Xputer no FPGA: DPLA on MoM by TU-KL* Design Rule Check accelerator PISA („fair comparizon“) 1984: 1 DPLA replaces 256 FPGAs 10 5 Speedup- Factor Speed-ups by vN Software to FPGA Migrations 10 3 ~10% of speed-u p Power saving : fabricated by E.I.S. Multi University Project Chip 1984 >15 years earlier 3439 Crypto DES breaking Saving Power

© 2015, 11 The Reconfigurable Computing Paradox although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion “von Neumann Syndrome” C.V. Ramamoorthy Von Neumann Syndrome von Neumann: an extremely power-inefficient paradigm Reinvent Computing why ?

© 2015, 12 Obstacles to widespread FPGA adoption go well beyond the required skill set - Workshop at FPL_2015

© 2015, TU Kaiserslautern What about Acceleration by Graphics Processors ? Power saving mostly not documented R. Vaduc et al.*: „ … adding a GPU is equivalent to adding one more multicore CPU socket …” *) R. Vuduc, J. Choi, M. Guney, A. Shringarpure: On the Limits of GPU Acceleration ; Proc. HotPar'10, 2 nd USENIX workshop on Hot Topics in Parallelism, June , 2010, Berkeley, CA, USA, USENIX Assoc. Berkeley, CA, USA Drastically smaller Speed-ups if at all von Neumann

© 2015, TU Kaiserslautern 14 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

© 2006, TU Kaiserslautern Dual paradigm mind set: an old hat – but was ignored 15 time to space mapping: procedural to structural: 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of Register- Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972 FF token bit evoke FF „This is so simple …… 1971 loop to pipe mapping T u n n e l V i s i o n ! why did it take 25 years to find out? PDP-16 RTMs: demultiplexer

© 2015, 16 x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # define:... which data item at which time at which port The Systolic Arrays (1) no instruction streams needed 1980 DPA (pipe network) DataPath Array (array of DPUs) M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 H. T. Kung execution transport- triggered Kung‘s example (algebra)

© 2015, TU Kaiserslautern 17. just only. special purpose ? (Tunnel Vision) Karl Steinbuch It is not sufficient to invent something. You need to recognize, that you have invented something. Systolic Arrays (2) 17 M. J. Foster and H. T. Kung: “The Design of Special- Purpose VLSI Chips... “

© 2015, 18 H. T. Kung*: „of course algebraic!“ My student Rainer Kress replaced it by simulated annealing : this supports also any irregular & wild shape pipe networks What Synthesis Method? The super-systolic array: a generalization of the systolic array: supports only very special applications with strictly regular data dependencies only linear pipes *) KressArray [ASP-DAC-1995]KressArray general purpose now a general purpose methodology ! (linear projection) *) a mathematician

© 2015, TU Kaiserslautern H. T. Kung: “It’s not our job” 19 *) or receives another Tunnel Vision Symptom without a sequencer: missed to invent a new machine paradigm Who generates* the datastreams? (the Xputer)

© 2015, TU Kaiserslautern 20 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

© 2015, TU Kaiserslautern The Xputer machine Paradigm obtained by adding auto-sequencing memory (ASM) GAG : G eneric A ddress G enerstor With data counters instead of a program counter is the TU-KL ‘ s Symbiosis of Time to Space Mapping and Reconfigurable Computing! (TU-KL) ASM data counter GAG RAM ASM Xputer literature 21

© 2006, TU Kaiserslautern Duality of procedural Languages Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops 22 Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops Asymmetry But there is an Asymmetry program counter: data counter(s): more simple: no ALU tasks MoPL state register(s): Xputer literature Xputer pages easy to learn control-procedural data-procedural

© 2015, TU Kaiserslautern 23 Compilation: Software vs. Configware u. Flowware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering time to space mapping data C, etc. procedural: time domain) space domain time domain Xputer literature

© 2015, TU Kaiserslautern 24 Heterogeneous: Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, or other high level language automatic SW / CW partitioner Xputer pages important: why? ( Jürgen Becker ( Jürgen Becker ‘s Ph.D. thesis) CoDe-X

© 2015, TU Kaiserslautern 25 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions downloadable from

© 2015, TU Kaiserslautern 26 I llustrating the Paradigm Trap The von Neumann Manycore Approach the watering can model [Hartenstein] ( many crippled watering cans ) many von Neumann bottle- necks The von Neumann single core Approach The Memory Wall ( crippled watering can ) (1)

© 2015, TU Kaiserslautern 27 I llustrating the Paradigm Trap The “Dataflow“ Computer extremely complicated: no watering can model ! (2) (a power efficiency break-thru did not happen here)

© 2015, TU Kaiserslautern 28 Xputer: the only massively power-efficient Paradigm The Xputer Paradigm has no von Neumann bottle- neck the watering can model [Hartenstein ] fully supporting Reconfigurable Computing fully supporting Reconfigurable Computing

© 2015, TU Kaiserslautern We need a Seismic Shift … It’s an extremely massive challenge ! … to avoid future unaffordable ICT power consumption cost 29 For many more years we must work under a heterogeneous triple-paradigm mind set: The software from more than half a century sits squarely on top Configware, Flowware, and still even Software that’s why heterogeneous is important

© 2015, 30 thank you !

© 2015, 31 END downloadable from

© 2015, 32 Backup for discussion: downloadable from

© 2015, TU Kaiserslautern 33 [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] much less equipment needed much less memory and bandwidth needed massively saving energy Reconfigurable Computing (RC): the intensive Impact SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Tarek El-Ghazawi Speed-ups by von Neumann to RC Migrations Application Speed-up factor Savings divisor PowerCostSize. DNA and Protein sequencing DES breaking

© 2015, TU Kaiserslautern Taxonomy 34 Flynn‘s taxonomy: von Neumann only Diana‘s taxonomy: Reconfigurable Computing Reiner‘s taxonomy: heterogeneous systems Reiner‘s 2 nd taxonomy: Xputers only noI

© 2015, TU Kaiserslautern LUT Table LUT Table First FPGA available 1984 from Xilinx

© 2015, Transformations since the 70ies time domain: procedure domain space domain: structure domain 36 Pipeline k time steps, n DPUs time algorithm space/time algorithmus Strip Mining Transformation Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]Karin SchmidtShaker Verlag n x k time steps, program loop 1 CPU (time to time/space mapping)

© 2015, 37 The Reconfigurable Computing Paradox Enabling software developers Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing. Reinvent Computing although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion etc.

© 2015, 38 data counter GAG RAM ASM : A uto- S equencing M emory use data counters, no program counter rDPA xx x x x x xx x -- - xx x x x x xx x ASM Data stream generator usage (pipe network) x x x x x x x x x | || x x x x x x x x x | | | | | | | | | | | | | | ASM implemented by distributed on-chip- memory Xputer machine paradigm Xputer pages GAG : Generic Address Generators (reconfigurable: to avoid memory-cycle-hungry address computation) general purpose reconfigurable also more irregular data routes supported

© 2015, TU Kaiserslautern 39 Paradigm Shift Consequences Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU Program Counter (PC) configware resources: variable 2 programming sources needed flowware algorithm: variable Configware Engineering Data Counters (DCs) sequencing code (e. g. see MoPL language) Configuration Code (CC) von Neumann: Xputer: Xputer literature Heterogenous Engineering Xputer and vN: all 3 programming sources needed

© 2015, 40 no memory wall DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x x x x x x x x x x | | | | | | | | | | | | | | output data streams DPU operation is transport-triggered no instruction streams no message passing nor thru common memory massively avoiding memory cycles Pipelining through DPU Arrays: the TU-KL Xputer principle DPA = DPU array DPU = Data Path Unit

© 2015, TU Kaiserslautern Von Neumann Syndrome 41 Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011 S h i f t i n g t o t h e d o m i n a n c e o f v o n - N e u m a n n - o n l y c a u s e d a n c u m u l a t e d d a m a g e o f a t l e a s t t r i l l i o n s o f D o l l a r s, i f n o t q u a d r i l l i o n s ….

© 2015, 42 Computing Paradigms term program counter execution triggered by paradigm CPU yes instruction fetch instruction-stream- based (von Neumann) program counter DPU CPU term no data arrival ** data-driven or datastream-based data counters r DPU RPU Xputer Reconfigurable Computing no complicated I - structure handling “ Dataflow ” Computer **) “transport-triggered” *) based on tagged token “ I -Structure” I - structure DPUs DFC DFC *

© 2015, 43 I-Structure Storage Communication Network PE I-Structure Storage PE MIT Tagged Token Dataflow Architecture* *) source: Jurij SilcJurij Silc „instruction is executed even if some of its operands are not yet available“ no Program Counter no updateable global store ? Tagged Token Flow Architecture I would call it Wait-Match Unit & Waiting Token Store Instruction Fetch Unit Token Queue Program Store & Constant Store Form Token Unit ALU & Form Tag to/from the Communication Network RU SU PE :

© 2015, TU Kaiserslautern Solution: I- Structure concept * 44 “can be read but not written” “at least one read request has been deferred” “read request deferrend but write operation is allowed” [ source: Jurij Silc]Jurij Silc very complex data structures ! “each update consumes the structure and the value produces a new data structure” “ awkward or even impossible to implement ” Problems with such “Dataflow” Architectures *) „ I- Structure Flow“ instead of „Dataflow“ ? ? = Tagged Token Structure „Tagged Token Flow“ I would call it

© 2015, I -Structures ( I = incremental) - part 1 Jurij Silc: Dataflow Architectures

© 2015, MIT Tagged-Token Dataflow Architecture (PE) / Jurij Silc: Dataflow Architectures I -structure select I - structure assign I -Structures - part 2 46

© 2015, Power Efficiency of Programming Languages (an example) 47

© 2015, 48 How big is big ? © Hewlett Packard