Reconfigurable Supercomputing: Hindernisse und Chancen Reiner Hartenstein TU Kaiserslautern Universität Mannheim, 13. Dez. 2005.

Slides:



Advertisements
Similar presentations
CASES 2002 Intl Conference on Compilers, Architectures and Synthesis for Embedded Systems Embedded Architectures: Configurable, Re-configurable, or what?
Advertisements

FPGA (Field Programmable Gate Array)
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
The von Neumann Syndrome Reiner Hartenstein TU Kaiserslautern TU Delft, Sept 28, (v.2)
Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28,
EEE226 MICROPROCESSORBY DR. ZAINI ABDUL HALIM School of Electrical & Electronic Engineering USM.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Hardwired networks on chip for FPGAs and their applications
Course-Grained Reconfigurable Devices. 2 Dataflow Machines General Structure:  ALU-computing elements,  Programmable interconnections,  I/O components.
ISICT 2005 Supercomputing going Reconfigurable Reiner Hartenstein TU Kaiserslautern Jan. 4-6, 2005, Capetown, South Africa.
MSE 2005 Reconfigurable Computing (RC) being Mainstream: Torpedoed by Education Reiner Hartenstein TU Kaiserslautern International Conference on Microelectronic.
© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006.
IPDPS 2004 Software or Configware? About the Digital Divide of Parallel Computing Reiner Hartenstein TU Kaiserslautern Santa Fe, NM, April , 2004.
From Organic Computing to Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern PASA, Frankfurt, March 16, 2006.
Reconfigurable HPC Reconfigurable HPC part 1 Introduction Reiner Hartenstein TU Kaiserslautern May 14, 2004, TU Tallinn, Estonia.
Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June , 2006 International Supercomputer.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Some Thoughts on Technology and Strategies for Petaflops.
Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.
Seminar at Kyushu University Reconfigurable Technologies (1) Reiner Hartenstein TU Kaiserslautern July 23, 2004, Fukuoka, Japan.
Configurable System-on-Chip: Xilinx EDK
Programmable logic and FPGA
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
CS curricula update proposed: by adding Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern EAB meeting, Philadelphia,1 Nov 2005.
Chapter 6 Memory and Programmable Logic Devices
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.
The Transdisciplinary Responsibility of CS Curricula Reiner Hartenstein TU Kaiserslautern San Diego, CA, USA, June , 2006 THE NINTH WORLD CONFERENCE.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Lecture 2: Field Programmable Gate Arrays September 13, 2004 ECE 697F Reconfigurable Computing Lecture 2 Field Programmable Gate Arrays.
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
J. Christiansen, CERN - EP/MIC
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
Computer Organization and Design Computer Abstractions and Technology
Field Programmable Gate Arrays (FPGAs) An Enabling Technology.
VLSI-SoC 2001 IFIP - LIRMM Stream-based Arrays: Converging Design Flows for both, Reiner Hartenstein University of Kaiserslautern December 2- 4, 2001,
© 2004, D. J. Foreman 1 Computer Organization. © 2004, D. J. Foreman 2 Basic Architecture Review  Von Neumann ■ Distinct single-ALU & single-Control.
EE3A1 Computer Hardware and Digital Design
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.
© 2004, D. J. Foreman 1 Computer Organization. © 2004, D. J. Foreman 2 Basic Architecture Review  Von Neumann ■ Distinct single-ALU & single-Control.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
Reconfigurable HPC Notes on datastream-based FFT
1 chapter 1 Computer Architecture and Design ECE4480/5480 Computer Architecture and Design Department of Electrical and Computer Engineering University.
The von Neumann Syndrome calls for a Revolution Reiner Hartenstein TU Kaiserslautern Reno, NV, November 11, HPRCTA'07 - First.
1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.
1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
IPDPS 2004 Software or Configware? About the Digital Divide of Parallel Computing Reiner Hartenstein TU Kaiserslautern Santa Fe, NM, April , 2004.
Reconfigurable Computing1 Reconfigurable Computing Part II.
Introduction to Computers - Hardware
Computer Organization and Architecture Lecture 1 : Introduction
ECE354 Embedded Systems Introduction C Andras Moritz.
Embedded Systems Design
Architecture Background
Lecture 41: Introduction to Reconfigurable Computing
Dynamically Reconfigurable Architectures: An Overview
Embedded Architectures: Configurable, Re-configurable, or what?
Programmable Logic- How do they do that?
Programmable logic and FPGA
Presentation transcript:

Reconfigurable Supercomputing: Hindernisse und Chancen Reiner Hartenstein TU Kaiserslautern Universität Mannheim, 13. Dez. 2005

© 2005, TU Kaiserslautern 2 The Digital Divide The White House, Sept 2000: Bill Clinton condemns the Digital Divide in America: World Economic Forum 2002: The Global Digital Divide, disparity between the "haves" and "have nots“ The Digital Divide of Parallel Computing: Access to Configware (CW) Solutions access to the internet

© 2005, TU Kaiserslautern 3 The new paradigm Reconfigurable Computing: the new machine paradigm no instruction fetch at run time it’s not instruction-stream-based paradigm:von Neumann reconfigurable computing programming modeproceduralstructural programming sourcesoftwareconfigware the 2nd RAM-based computing paradigm

© 2005, TU Kaiserslautern 4 Why use Reconfigurable Computing Exploit spatial parallelism, and.. … high bandwidth and low latency memory access Ride the technology curve avoiding specific silicon Adapt to change: standards, trends, ….. Reduce risk Adapt to application / deployment requirements instead of spec. hardware? instead of software? … and fine-grained parallelism when useful

© 2005, TU Kaiserslautern 5 >> Outline << Reconfigurable Devices Coarse-grained Reconfigurable Devices Data-stream-based Computing The contemporary Common Model Reconfigurable Supercomputing Conclusions

© 2005, TU Kaiserslautern 6 Reconfigurable Computing (RC) and FPGA in the media ##### Design Starts until 2010: from 80,000 to 110,000 [Dataquest] June 2005 fastest growing segment of the semiconductor market: 4 billion US-$ [Dataquest]

© 2005, TU Kaiserslautern 7 going into every application area mainstream established in embedded systems recently observed by supercomputing scenes PLD 2,180,000 CPLD 832,000

© 2005, TU Kaiserslautern 8 Reconfigurable Devices Confusing Terminology  fine-grained reconfigurable devices:  FPGA (Xilinx)  PLD, FPLD (Altera)  CPLD (Philips)  rGA (neutral)  coarse-grained reconfigurable devices:  rDPA (neutral)  FPFA, rFA, XPA, other acronyms.... reconfigurable Gate Array soft hardware Morphware ® reconfigurable DataPath Array

© 2005, TU Kaiserslautern 9 rGA with island architecture reconfigurable logic box switch box connect box reconfigurable interconnect fabrics

© 2005, TU Kaiserslautern 10 A B example: wire routed for 1 net

© 2005, TU Kaiserslautern 11 Modes of Operation configware code loaded from external flash memory, e. g. after power-on (~millisecond) Execution phase E ph time C ph off Configuration phase C phE ph

© 2005, TU Kaiserslautern 12 Dynamic Reconfiguration Swapping and scheduling managed by configware operating system time E phC phE ph module X module Y configware module Y rGA module no. X configures Y

© 2005, TU Kaiserslautern 13 FPGA area Efficiency G. Moore Transistors/chip FPGA logical FPGA routed FPGA physical Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld > wiring overhead reconfigurability overhead routing congestion CPU area standard microprocessor

© 2005, TU Kaiserslautern MHz Flexible Soft Logic Architecture 200KLogic Cells 500MHz Programmable DSP Execution Units Gbps Serial Transceivers 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 1Gbps Differential I/O 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management DSP platform FPGA [courtesy Xilinx Corp.]

© 2005, TU Kaiserslautern 15 >> Outline << Reconfigurable Devices Coarse-grained Reconfigurable Devices Data-stream-based Computing The contemporary Common Model Reconfigurable Supercomputing Conclusions

© 2005, TU Kaiserslautern 16 Why coarse grain much more MOPS/milliWatt reconfigurable Data Path Unit (e. g. rALU) mind set close to classical computing background instead of rLB (~1 bit wide) use rDPU (e. g. 32 bits wide) instead of FPGA use rDPA rDPU Reconfigurable Computing (RC) much more area-efficient much less reconfigurability overhead

© 2005, TU Kaiserslautern 17 Area Efficiency of coarse-grained Platforms G. Moore Transistors/chip << 10 Sources: Proc ISSCC, ICSPAT, DAC, DSPWorld physical logical standard microprocessor Coarse-grained (e. g. 32 bits wide)

© 2005, TU Kaiserslautern 18 array size: 10 x 16 = 160 rDPUs Coarse grain is about computing, not logic rout thru only not used backbus connect SNN filter on KressArray (mainly a pipe network) [Ulrich Nageldinger] Example: mapping onto rDPA by DPSS: based on simulated annealing reconfigurable function block, e. g. 32 bits wide no CPU

© 2005, TU Kaiserslautern 19 commercial rDPA example: PACT XPP - XPU128 XPP128 rDPA Evaluation Board available, and XDS Development Tool with Simulator buses not shown rDPU Full 32 or 24 Bit Design working silicon 2 Configuration Hierarchies © PACT AG, (r) DPA

© 2005, TU Kaiserslautern 20 speed-up examples from 2004 & earlier platformapplication examplespeed-up factormethod PACT Xtreme 4-by-4 array [2003] 16 tap FIR filterx16 MOPS/mW straight forward MoM anti machine with DPLA* [1983] grid-based DRC** 1-metal 1-poly nMOS *** 256 reference patterns > x1000 (computation time) multiple aspects *) DPLA: MPC fabr. via E.I.S. multi univ. project key issue: algorithmic cleverness **) Design Rule Check CPU 2 FPGA [FPGA 2004] migrate several simple application exampes x7 – x46 (compute time) hi level synthesis ***) for 10-metal 3-poly cMOS expected: >> x10,000 DSP 2 FPGA [Xilinx ] from fastest DSP: 10 gMACs to 1 teraMAC X 100 (compute time) not spec. 2) Wim Roelandts

© 2005, TU Kaiserslautern 21 rDPA (coarse grain) vs. FPGA (fine grain) roughly: area efficiency (transistors/chip, orders of magnitude) hardwired 4 FPGA 2 µProc 0 rDPA 4 roughly: performanc e (MOPS/mW, orders of magnitude) hardwired 3 FPGA 2 µProc 0 rDPA 3 DSP 1 Status: ~1998 commodity in special cases much higher acceleration factors !

© 2005, TU Kaiserslautern 22 >> Outline << Reconfigurable Devices Coarse-grained Reconfigurable Devices Data-stream-based Computing The contemporary Common Model Reconfigurable Supercomputing Conclusions

© 2005, TU Kaiserslautern 23 „data stream“: an ambigouos definition Reconfigurable Computing is not instruction-stream-based it‘s data-stream-based it‘s different from the operation of the (indeterministic) „dataflow machine“ other definition also from multimedia area usable definition from systolic array area

© 2005, TU Kaiserslautern 24 DPA x x x x x x x x x | || xx x x x x xx x -- - input data streams xx x x x x xx x x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # Flowware defines:... which data item at which time at which port Data streams (flowware) (pipe network)

© 2005, TU Kaiserslautern 25 rDPU S + for demo: a tiny section of the pipe network inter-rDPU-communication: no memory cycles needed configware solution: computing in space

© 2005, TU Kaiserslautern 26 Compare it to software solution on CPU on a very simple CPU C = 1 memory cycles nano seconds if C then read A read instruction instruction decoding read operand* operate & register transfers if not C then read B read instruction instruction decoding add & store read instruction instruction decoding operate & register transfers store result total S = R + (if C then A else B endif); S + A B R C Clock 200 =1 S +

© 2005, TU Kaiserslautern 27 hypothetical branching example to illustrate software-to-configware migration *) if no intermediate storage in register file C = 1 simple conservative CPU example memory cycles nano seconds if C then read A read instruction1100 instruction decoding read operand*1100 operate & reg. transfers if not C then read B read instruction1100 instruction decoding add & store read instruction1100 instruction decoding operate & reg. transfers store result1100 total 5500 S = R + (if C then A else B endif); S + ABR C clock 200 MHz (5 nanosec) =1 section of a major pipe network on rDPU no memory cycles: speed-up factor = 100

© 2005, TU Kaiserslautern 28 The wrong mind set.... S = R + (if C then A else B endif); =1 + A B R C section of a very large pipe network: decision not knowing this solution: symptom of the hardware / software chasm and the configware / software chasm „but you can‘t implement decisions!“

© 2005, TU Kaiserslautern 29 >> Outline << Reconfigurable Devices Coarse-grained Reconfigurable Devices Data-stream-based Computing The contemporary Common Model Reconfigurable Supercomputing Conclusions

© 2005, TU Kaiserslautern 30 von Neumann is not the common model progra m counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age:

© 2005, TU Kaiserslautern 31 Here is the common model progra m counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age: configware age: CPU accelerator reconfigurable morphware software/configware co-compiler software configware accelerator reconfigurable accelerator hardwired CPU

© 2005, TU Kaiserslautern 32 Here is the common model progra m counter DPU CPU RAM memory von Neumann bottleneck von Neumann instruction-stream- based machine co-processors accelerator CPU instruction- stream- based data- stream- based hardware software mainframe age: microprocessor age: configware age: CPU accelerator reconfigurable morphware software/configware co-compiler software configware

© 2005, TU Kaiserslautern 33 configware resources: variable Nick Tredennick’s Paradigm Shifts explain the differences 2 programming sources needed flowware algorithm: variable Configware Engineering Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU

© 2005, TU Kaiserslautern 34 Terminology clean-up Software: for scheduling instruction streams Flowware: for scheduling data streams Configware: for configuring morphware Programming sources: von Neumann primarily non-von Neumann

© 2005, TU Kaiserslautern 35 Compilation: Software vs. Configware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering placement & routing data C, FORTRAN MATHLAB

© 2005, TU Kaiserslautern 36 Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, FORTRAN, MATHLAB

© 2005, TU Kaiserslautern 37 Co-Compiler for Hardwired Anti Machine [e. g. Brodersen] software compiler software code Software / Flowware Co-Compiler flowware compiler scheduler flowware code data source

© 2005, TU Kaiserslautern 38 Termnology & taxonomy in the morphware age

© 2005, TU Kaiserslautern 39 >> Outline << Reconfigurable Devices Coarse-grained Reconfigurable Devices Data-stream-based Computing The contemporary Common Model Reconfigurable Supercomputing Conclusions

© 2005, TU Kaiserslautern 40 HPC experts coming... Simulation of Star Clusters: x10 speed-up by supercomputer-to-morphware migration (also molecular biology et al.) Rainer Spurzem, University of Heidelberg Reinhard Maenner, University of Mannheim HPC pioneer since 1976 (Physics Dept Heidelberg) Configware by Astrophysics by ARI, Astrononisches Rechen-Institut, founded 1700 in Berlin, moved 1945 to Heidelberg by August Kopff Gottfried Kirch August Kopff example: N-body problem going configware [FPL 1999] Google: 28,2000 hits

© 2005, TU Kaiserslautern 41 Cray XD1 Ignoring Reconfigurable Computing is the completely wrong roadmap … … has delayed the time to insight by more than a decade. ############# 10

© 2005, TU Kaiserslautern 42 FPGAs in Oil and Gas.... Saves more than $10,000 in electricity bills per year (7 ¢ / kWh) per 64-processor 19" rack „Application migration [from supercomputer] has resulted in a 17-to-1 increase in performance" [Herb Riley, R. Associates]

© 2005, TU Kaiserslautern 43 Why the speed-up although FPGA is clock slower by x 3 or even more (most know-how from „ high level synthesis “ discipline) decisions without memory cycles nor clock cycles most „ data fetch “ without memory cycle

© 2005, TU Kaiserslautern 44 data moved around by software i.e. by memory-cycle-hungry instruction streams which fully hit the memory wall P&R: move locality of operation, not data ! extremely unbalanced stolen from Bob Colwell CPU

© 2005, TU Kaiserslautern 45 Replace Caches by... stolen from Bob Colwell CPU caches … by 16 x 16 reconfigurable data path array (rDPA) which fits on the same chip

© 2005, TU Kaiserslautern 46 Software / Configware Co-Compilation Resource Parameters supporting different platforms Analyzer / Profiler SW code SW compiler paradigm “vN" machine CW Code CW compiler anti machine paradigm Partitioner C language source FW Code Juergen Becker’s CoDe-X, 1996 simulated annealing

© 2005, TU Kaiserslautern 47 Co-Compiler Enabling Technology is available from academia only a small team needed for commercial re-implementation on the road map to the Personal Supercomputer

© 2005, TU Kaiserslautern 48 >> Outline << Reconfigurable Devices Coarse-grained Reconfigurable Devices Data-stream-based Computing The contemporary Common Model Reconfigurable Supercomputing Conclusions

© 2005, TU Kaiserslautern 49 The first archetype machine model main frame CPU compile or assemble procedural personalization Software Industry Software Industry’s Secret of Success simple basic. Machine Paradigm personalization: RAM-based instruction-stream- based mind set “von Neumann”

© 2005, TU Kaiserslautern 50 An Archetype Common Model needed Guidance for organizing efficient solutions Make the project manageable Allow to share lessions between applications and between application areas Useful simple archetype not widely accepted Archetype common model should provide.... Progress stalled by the software/configware chasm Configware Industry from the

© 2005, TU Kaiserslautern 51 The memory wall Tear down this wall !

© 2005, TU Kaiserslautern 52 the Computing Education Wall hardwired accelerators hardwired accelerators programmable accelerators programmable accelerators µ processor Simple models of area, power, and delay can be taught in a week or two. [ Bill Dally, WCAE’04 ] Tear down this wall !

© 2005, TU Kaiserslautern 53 Merging the basic paradigms for drastically reduced cost of equipment and electricity supply … into a dual paradigm common model will open up new horizons of high performance Configware Industry from the Software Industry and from the

© 2005, TU Kaiserslautern 54 thank you

© 2005, TU Kaiserslautern 55 END