How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern DRAFT PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling,

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

DSPs Vs General Purpose Microprocessors

The von Neumann Syndrome Reiner Hartenstein TU Kaiserslautern TU Delft, Sept 28, (v.2)

Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28,

Stonewalled Progress of Computing Efficiency 1 Reiner Hartenstein (keynote) SA - Sep 1 16: :50

Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.

CSCI 4717/5717 Computer Architecture

Reconfigurable Supercomputing: Hindernisse und Chancen Reiner Hartenstein TU Kaiserslautern Universität Mannheim, 13. Dez

From Organic Computing to Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern PASA, Frankfurt, March 16, 2006.

(keynote) (from HPC to) New Horizons of Very High Performance Computing (VHPC): Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Rhodes Island,

Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.

Processor Technology and Architecture

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Chapter 4 Processor Technology and Architecture. Chapter goals Describe CPU instruction and execution cycles Explain how primitive CPU instructions are.

(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

CS curricula update proposed: by adding Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern EAB meeting, Philadelphia,1 Nov 2005.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

The Transdisciplinary Responsibility of CS Curricula Reiner Hartenstein TU Kaiserslautern San Diego, CA, USA, June , 2006 THE NINTH WORLD CONFERENCE.

Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling, Optimization.

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern IEEE fellow FPL fellow SDPS fellow PATMOS 2015, the 25 th International.

High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.

Invitation to Computer Science 5th Edition

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Speed-ups obtained by Reconfigurable Computing Reiner Hartenstein CAPES/DFG Cooperation on Reconfigurable Computing, inv. talk, Sept 19, 2008, Dept of.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Automated Design of Custom Architecture Tulika Mitra

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.

Computer Organization and Design Computer Abstractions and Technology

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

M U N -March 10, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording March 10, 2005.

VLSI-SoC 2001 IFIP - LIRMM Stream-based Arrays: Converging Design Flows for both, Reiner Hartenstein University of Kaiserslautern December 2- 4, 2001,

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.

Reconfigurable HPC Notes on datastream-based FFT

The von Neumann Syndrome calls for a Revolution Reiner Hartenstein TU Kaiserslautern Reno, NV, November 11, HPRCTA'07 - First.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

Dept. of Computer Science - CS6461 Computer Architecture CS6461 – Computer Architecture Fall 2015 Lecture 1 – Introduction Adopted from Professor Stephen.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Mapping of Regular Nested Loop Programs to Coarse-grained Reconfigurable Arrays – Constraints and Methodology Presented by: Luis Ortiz Department of Computer.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Enabling Technologies for Reconfigurable Computing Enabling Technologies for Reconfigurable Computing and Software / Configware Co-Design Part 2: Data-Stream-based.

Autumn 2006CSE P548 - Dataflow Machines1 Von Neumann Execution Model Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode.

William Stallings Computer Organization and Architecture 6th Edition

Computer Organization and Architecture Lecture 1 : Introduction

Topics to be covered Instruction Execution Characteristics

Ph.D. in Computer Science

Embedded Systems Design

Assembly Language for Intel-Based Computers, 5th Edition

Architecture & Organization 1

Architecture & Organization 1

Embedded Architectures: Configurable, Re-configurable, or what?

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

HIGH LEVEL SYNTHESIS.

How to cope with the Power Wall

ECE 352 Digital System Fundamentals

Husky Energy Chair in Oil and Gas Research

Prof. Onur Mutlu Carnegie Mellon University

Presentation transcript:

How to cope with the Power Wall Reiner Hartenstein TU Kaiserslautern DRAFT PATMOS 2015, the 25 th International Workshop on Power and Timing Modeling, Optimization and Simulation; Salvador, Bahia, Brazil, Sept 1-5, 2015

© 2015, TU Kaiserslautern 2 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

© 2015, TU Kaiserslautern 3 years elder than ISPLED. Oldest conference series on power efficiency issues spin-off from the EU’s PATMOS project Power efficiency is going to become an industry-wide issue Some incremental improvements are on track, at all abstraction levels - not only in hardware design and software development however, there is still a lot to do

© 2015, Power Efficiency of Programming Languages (an example)

© 2015, 5 source: iStockphoto Programming Language Popularity not yet determined by power efficiency (IEEE)

© 2015, TU Kaiserslautern 6 6 © New York Times Data Center at Dallas Columbia river G. Fettweis, E. Zimmermann: ICT Energy Consumption - Trends and Challenges; WPMC'08, Lapland, Finland, 8 – 11 Sep 2008 Power consumption by internet: x30 til 2030 if trends continue Already the internet’s 2007 carbon footprint was higher than that of worldwide air traffic It‘s more than the entire world‘s total power consumption to-day !!! IDC predicts: the total number of data centers around the world will peak at 8.6 million in ICT infrastructures

© 2015, TU Kaiserslautern Google‘s Electricity Bill „The possibility of computer equipment power consumption spiraling out of control could have serious consequences for the overall affordability of computing” [L. A. Barrosso, Google] Google patent for water-based data centers: the ocean provides cooling & power (motion of surface) Google asked the Federal Energy Regulatory Commission for the authority to sell electricity, Cost of a data center is determined solely by the monthly power bill, not by the cost of hardware or maintenance (for example)

© 2015, TU Kaiserslautern Massive Critique of the von Neumann Model John Backus 1978: Can programming be liberated from the von Neumann style? 8 “von Neumann Syndrome” C.V. Ramamoorthy Nathan’s Law: Software is a gas. It expands to fill all its containers... Nathan Myhrvold Dave Patterson bandwidth gap grows 50% / year has reached >1000x (long ago) Patterson’s Law: instruction streams are very slow and extremely memory-cycle-hungry Software dimension: MLoC (Million Lines of Code): SAP NetWeaver Mac OS X Debian von Neumann: an extremely power-inefficient paradigm Von Neumann Syndrome

© 2015, TU Kaiserslautern 9 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

© 2015, TU Kaiserslautern Terminology Problems Searching a more power-efficient machine paradigm: Stressing differences to „ Control-Flow “ Computers an area called „Dataflow“ Computers was started mid‘ 70ies at MIT Xputer area people are forced to sidestep by using terms like „data-driven“ or „data streams“… although the „Dataflow“ scene is I-Structure-centered.

© 2015, 11 I-Structure Storage Communication Network PE I-Structure Storage PE MIT Tagged Token Dataflow Architecture* Wait-Match Unit & Waiting Token Store Instruction Fetch Unit Token Queue Program Store & Constant Store Form Token Unit ALU & Form Tag to/from the Communication Network RU SU PE : *) source: Jurij SilcJurij Silc „instruction is executed even if some of its operands are not yet available“ no PC no updateable global store ?

© 2015, TU Kaiserslautern Solution: I- Structure concept * 12 can be read but not written at least one read request has been deferred read request deferrend but write operation is allowed [ source: Jurij Silc]Jurij Silc very complex data structures ! “each update consumes the structure and the value produces a new data structure” “ awkward or even impossible to implement ” Problems with “Dataflow” Architectures *) „ I- Structure Flow“ instead of „Dataflow“ „instruction is executed even if some of its operands are not yet available“ ? ?

© 2015, TU Kaiserslautern A Second Opinion D. D. Gajski, D. A. Padua, D. J. Kuck, R. H. Kuhn: A Second Opinion on Data Flow Machines and Languages; IEEE COMPUTER, February 1982 the subtitle: "... data flow techniques attract a great deal of attention. Other alternatives, however, offer more hope for the future." ( Still active … 5th workshop on Data-Flow Execution Models for Extreme Scale Computing (DFM 2015), Oct 18-21, 2015, San Francisco, CA, USA ) However, the power efficiency break-thru did not happen here

© 2015, 14 Computing Paradigms term program counter execution triggered by paradigm CPU yes instruction fetch instruction-stream- based (von Neumann) program counter DPU CPU term no data arrival ** data-driven or datastream-based data counters r DPU RPU Xputer Reconfigurable Computing no complicated I - structure handling “ Dataflow ” Computer **) “transport-triggered” *) based on data dependency graph I - structure DPUs DFC DFC *

© 2015, TU Kaiserslautern 15 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

© 2015, TU Kaiserslautern LUT Table LUT Table First FPGA available 1984 from Xilinx

© 2015, TU Kaiserslautern 17 [Tarek El-Ghazawi et al.: IEEE COMPUTER, Febr. 2008] much less equipment needed much less memory and bandwidth needed massively saving energy Reconfigurable Computing (RC): the intensive Impact SGI Altix 4700 with RC 100 RASC compared to Beowulf cluster Tarek El-Ghazawi Speed-ups by von Neumann to RC Migrations Application Speed-up factor Savings PowerCostSize DNA and Protein sequencing DES breaking

© 2015, TU Kaiserslautern 18 FFT 100 Reed-Solomon Decoding 2400 Viterbi Decoding MAC DSP and wireless Smith-Waterman pattern matching 288 molecular dynamics simulation 88 BLAST 52 protein identification 40 Bioinformatics GRAPE 20 Astrophysics SPIHT wavelet-based image compression 457 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 Image processing, Pattern matching, Multimedia 3000 CT imaging Crypto DES breaking 10 6 Speedup-Factor ~ DNA & protein sequencing x2/yr *) TU Kaiserslautern Pre-FPGA solutions Xputer no FPGA: DPLA on MoM by TU-KL* Grid-based DRC* („fair comparizon“) 1984: 1 DPLA replaces 256 FPGAs 10 5 Speedup- Factor Speed-ups by vN Software to FPGA Migrations 10 3 ~10% of speed-u p Power saving : fabricated by E.I.S. Multi University Project Chip

© 2015, TU Kaiserslautern Design Rule Check Accelerator 19 Processing 4-by-4 Reference Patterns DPLA: fabricated by the E.I.S. Multi University Project: *) PISA DRC accelerator [ICCAD 1984] 1984: 1 DPLA replaces 256 FPGAs DPLA was more area-efficient than FPGA - by several orders of magnitude

© 2015, 20 although the effective integration density of FPGAs is by 4 orders of magnitude behind the Gordon Moore curve, because of: wiring overhead reconfigurability overhead routing congestion etc. The Reconfigurable Computing Paradox Enabling software developers to apply their skills over FPGAs has been a long and, as of yet, unreached research objective in reconfigurable computing. Reinvent Computing

© 2015, 21 Obstacles to widespread FPGA adoption go well beyond the required skill set - Workshop at FPL_2015

© 2015, TU Kaiserslautern 22 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

© 2006, TU Kaiserslautern Dual paradigm mind set: an old hat – but was ignored 23 time to space mapping: procedural to structural: 1967: W. A. Clark: Macromodular Computer Systems; 1967 SJCC, AFIPS Conf. Proc. C. G. Bell et al: The Description and Use of Register- Transfer Modules (RTM's); IEEE Trans-C21/5, May 1972 FF token bit evoke FF „This is so simple …… 1971 loop to pipe mapping T u n n e l V i s i o n ! why did it take 25 years to find out? PDP-16 RTMs:

© 2015, Transformations since the 70ies time domain: procedure domain space domain: structure domain 24 Pipeline k time steps, n DPUs time algorithm space/time algorithmus Strip Mining Transformation Loop Transformations: rich methodology published: [survey: Diss. Karin Schmidt, 1994, Shaker Verlag]Karin SchmidtShaker Verlag n x k time steps, program loop 1 CPU (time to time/space mapping)

© 2015, 25 x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # define:... which data item at which time at which port The Systolic Arrays (1) no instruction streams needed 1980 DPA (pipe network) DataPath Array (array of DPUs) execution transport- triggered M. J. Foster, H. T. Kung: The Design of Special-Purpose VLSI Chips... IEEE 7th ISCA, La Baule, France, May 6-8, 1980 H. T. Kung

© 2015, TU Kaiserslautern why not general purpose ? 26 just only Special purpose VLSI chips ? M. J. Foster and H. T. Kung: “The Design of Special- Purpose VLSI Chips... “ (Tunnel Vision) Karl Steinbuch It is not sufficient to invent something. You need to recognize, that you have invented something. Systolic Arrays (2)

© 2015, 27 H. T. Kung: „of course algebraic!“ ( linear projection) My student Rainer Kress replaced it by simulated annealing : supports also any irregular & wild shape pipe networks What Synthesis Method? The super-systolic array: a generalization of the systolic array: supports only special applications with strictly regular data dependencies only linear pipes *) KressArray [ASP-DAC-1995] now a general purpose methodology !

© 2015, TU Kaiserslautern H. T. Kung: “It’s not our job” 28 *) or receives another Tunnel Vision Symptom without a sequencer: missed to invent a new machine paradigm Who generates* the datastreams? (the Xputer)

© 2015, TU Kaiserslautern The Xputer machine Paradigm obtained by adding auto-sequencing memory ASM : A uto- S equencing M emory GAG : G eneric A ddress G enerstor ASM : A uto- S equencing M emory GAG : G eneric A ddress G enerstor with multiple data counters instead of a program counter is the TU-KL ‘ s Symbiosis of Time to Space Mapping and Reconfigurable Computing. (TU-KL) ASM data counter GAG RAM ASM Xputer literature

© 2015, TU Kaiserslautern 30 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

© 2015, What means Configware? Software Compiler Software Code 31 (instruction-procedural) Software Source space domain time domain Software to Configware Migration traditional Computing Configware Code (structural: space domain) mapper Placement & Routing Configware Source Reconfigurable Computing ( time domain) Xputer pages

© 2015, TU Kaiserslautern 32 Compilation: Software vs. Configware u. Flowware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering placement & routing data C, FORTRAN MATHLAB procedural: time domain) space domain time domain Xputer literature

© 2015, TU Kaiserslautern 33 Heterogeneous: Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, FORTRAN, MATHLAB automatic SW / CW partitioner simulated annealing Xputer pages

© 2015, TU Kaiserslautern 34 Paradigm Shift Consequences Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU Program Counter (PC) configware resources: variable 2 programming sources needed flowware algorithm: variable Configware Engineering Data Counters (DCs) sequencing code (e. g. see MoPL language) Configuration Code (CC) von Neumann: Xputer: Xputer literature

© 2015, 35 data counter GAG RAM ASM : A uto- S equencing M emory use data counters, no program counter rDPA xx x x x x xx x -- - xx x x x x xx x ASM Data stream generators (pipe network) x x x x x x x x x | || x x x x x x x x x | | | | | | | | | | | | | | ASM implemented by distributed on-chip- memory 100 & more on-chip ASM are feasible 100 & more on-chip ASM are feasible reconfigurable 32 ports, or n x 32 ports other example Xputer machine paradigm Xputer pages

© 2006, TU Kaiserslautern Duality of procedural Languages Flowware Languages read next data item goto (data address) jump to (data address) data loop data loop nesting data loop escape data stream branching yes: internally parallel loops 36 Software Languages read next instruction goto (instruction address) jump to (instruction address) instruction loop instruction loop nesting instruction loop escape instruction stream branching no: no internally parallel loops Asymmetry But there is an Asymmetry program counter: data counter(s): more simple: no ALU tasks MoPL state register(s): Xputer literature Xputer pages

© 2015, TU Kaiserslautern 37 CoDe-X Dual Paradigm Application Development CPU SW compiler CW compiler C language source Partitioner rDPU Juergen Becker’s PhD thesis 1996 Road map to the Personal Supercomputer instruction- stream- based software codeconfigware code datastream- based accelerator reconfigurable accelerator hardwired software/configware co-compiler high level language CPU

© 2015, TU Kaiserslautern 38 >> Outline << The Power Wall “Dataflow” Computing Reconfigurable Computing Time to Space Mapping The Xputer Paradigm Conclusions

© 2015, TU Kaiserslautern 39 I llustrating the Paradigm Trap The von Neumann Manycore Approach the watering can model [Hartenstein] ( many crippled watering cans ) many von Neumann bottle- necks The von Neumann single core Approach The Memory Wall ( crippled watering can ) (1)

© 2015, TU Kaiserslautern 40 I llustrating the Paradigm Trap The “Dataflow“ Computer extremely complicated: no watering can model ! (2) (a power efficiency break-thru did not yet happen here)

© 2015, TU Kaiserslautern 41 Xputer: the only massively power-efficient Paradigm The Xputer Paradigm has no von Neumann bottle- neck the watering can model [Hartenstein ]

© 2015, 42 thank you !

© 2015, 43 END

© 2015, 44 Backup for discussion:

© 2015, 45 no memory wall DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x x x x x x x x x x | | | | | | | | | | | | | | output data streams DPU operation is transport-triggered no instruction streams no message passing nor thru common memory massively avoiding memory cycles Pipelining through DPU Arrays: the TU-KL Xputer principle DPA = DPU array DPU = Data Path Unit

© 2015, I -Structures ( I = incremental) - part 1 Jurij Silc: Dataflow Architectures

© 2015, MIT Tagged-Token Dataflow Architecture (PE) / Jurij Silc: Dataflow Architectures I -structure select I - structure assign I -Structures - part 2

© 2015, TU Kaiserslautern Taxonomy 48 Flynn‘s taxonomy: von Neumann only Diana‘s taxonomy: Reconfigurable Computing Reiner‘s taxonomy: heterogeneous systems Reiner‘s 2 nd taxonomy: Xputers only noI

© 2015, TU Kaiserslautern Von Neumann Syndrome 49 Lambert M. Surhone, Mariam T. Tennoe, Susan F. Hennessow (ed.): Von Neumann Syndrome; ßetascript publishing 2011 S h i f t i n g t o t h e d o m i n a n c e o f v o n - N e u m a n n - o n l y c a u s e d a n c u m u l a t e d d a m a g e o f a t l e a s t t r i l l i o n s o f D o l l a r s, i f n o t q u a d r i l l i o n s ….