© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006.

Slides:



Advertisements
Similar presentations
FPGA (Field Programmable Gate Array)
Advertisements

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
The von Neumann Syndrome Reiner Hartenstein TU Kaiserslautern TU Delft, Sept 28, (v.2)
Reconfigurable Supercomputing means to brave the paradigm chasm Reiner Hartenstein HiPEAC Workshop on Reconfigurable Computing Ghent, Belgium January 28,
Device Tradeoffs Greg Stitt ECE Department University of Florida.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Reconfigurable Supercomputing: Hindernisse und Chancen Reiner Hartenstein TU Kaiserslautern Universität Mannheim, 13. Dez
MSE 2005 Reconfigurable Computing (RC) being Mainstream: Torpedoed by Education Reiner Hartenstein TU Kaiserslautern International Conference on Microelectronic.
From Organic Computing to Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern PASA, Frankfurt, March 16, 2006.
(keynote) (from HPC to) New Horizons of Very High Performance Computing (VHPC): Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Rhodes Island,
Reconfigurable Supercomputing: Hurdles and Chances Reiner Hartenstein TU Kaiserslautern Dresden, Gemany, June , 2006 International Supercomputer.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Some Thoughts on Technology and Strategies for Petaflops.
Reconfigurable Supercomputing: What are the Problems? What are the Solutions? Reiner Hartenstein TU Kaiserslautern Dagstuhl, Germany, April 2 - 7, 2006.
Chapter 1. Introduction This course is all about how computers work But what do we mean by a computer? –Different types: desktop, servers, embedded devices.
ECE 232 L2 Basics.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 2 Computer.
EET 4250: Chapter 1 Performance Measurement, Instruction Count & CPI Acknowledgements: Some slides and lecture notes for this course adapted from Prof.
CS 300 – Lecture 2 Intro to Computer Architecture / Assembly Language History.
SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu
CS curricula update proposed: by adding Reconfigurable Computing Reiner Hartenstein TU Kaiserslautern EAB meeting, Philadelphia,1 Nov 2005.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
COM181 Computer Hardware Ian McCrumRoom 5B18,
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
Reconfigurable Hardware in Wearable Computing Nodes Christian Plessl 1 Rolf Enzler 2 Herbert Walder 1 Jan Beutel 1 Marco Platzner 1 Lothar Thiele 1 1 Computer.
The Transdisciplinary Responsibility of CS Curricula Reiner Hartenstein TU Kaiserslautern San Diego, CA, USA, June , 2006 THE NINTH WORLD CONFERENCE.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Automated Design of Custom Architecture Tulika Mitra
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
ASIP Architecture for Future Wireless Systems: Flexibility and Customization Joseph Cavallaro and Predrag Radosavljevic Rice University Center for Multimedia.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Computer Organization and Design Computer Abstractions and Technology
Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.
“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)
VLSI-SoC 2001 IFIP - LIRMM Stream-based Arrays: Converging Design Flows for both, Reiner Hartenstein University of Kaiserslautern December 2- 4, 2001,
EE3A1 Computer Hardware and Digital Design
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
The von Neumann Syndrome calls for a Revolution Reiner Hartenstein TU Kaiserslautern Reno, NV, November 11, HPRCTA'07 - First.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 2 Computer Organization.
Cray XD1 Reconfigurable Computing for Application Acceleration.
CERN VISIONS LEP  web LHC  grid-cloud HL-LHC/FCC  ?? Proposal: von-Neumann  NON-Neumann Table 1: Nick Tredennick’s Paradigm Classification Scheme Early.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
William Stallings Computer Organization and Architecture 6th Edition
ECE354 Embedded Systems Introduction C Andras Moritz.
6. Structure of Computers
Architecture & Organization 1
Introduction to Reconfigurable Computing
Architecture & Organization 1
BIC 10503: COMPUTER ARCHITECTURE
Embedded Architectures: Configurable, Re-configurable, or what?
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Chapter 1 Introduction.
Computer Evolution and Performance
COMS 361 Computer Organization
Presentation transcript:

© 2006, Reconfigurable Computing Reiner Hartenstein Computing Meeting EU, ESU, Brussells, May 18, 2006

© 2006, 2 The Pervasiveness of RC 162, , , , , ,000 # of hits by Google 1,620, , , , ,000 1,490,000 # of hits by Google “FPGA and ….” ECE-savvy scene (mainstream many years) Math/SW-savvy scene (more recently: 2-3 years) and many more areas

© 2006, 3 The dominance of Configware Most compute power is coming from Configware More MIPS migrated to Configware than running as Software

© 2006, 4 Reconfigurable Supercomputing (VHPC) going commercial Cray XD1 silicon graphics RASC … and other vendors

© 2006, 5 >> Outline << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer

© 2006, 6 The Reconfigurable Computing Paradox area-inefficient, slow, power-hungry, expensive tools and languages unacceptable by most users poor FPGA technology: RC education: extremely poor, if at all even most hardware experts (86% ** ) hate their tools **) DeHon ‘98 poor tools: poor education: - ignored by CS curricula CS taught like for a 50 year old mainframe …

© 2006, 7 FPGA integration density the effective integration density of plane FPGAs is behind Moore’s law by more than 4 orders of magnitude However, brilliant results everywhere what paradox ?

© 2006, 8 X 2 / yr FPGA speed-up factors published Pentium 4 7% / yr 50% / yr Los Alamos traffic simulation 47 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 Smith-Waterman pattern matching 288 BLAST 52 protein identification 40 molecular dynamics simulation 88 Reed-Solomon Decoding 2400 Viterbi Decoding 400 FFT MAC Grid-based DRC: no FPGA: DPLA on MoM by TU-KL Grid-based DRC: no FPGA: DPLA on MoM by TU-KL D FIR filter [TU-KL] 39,4 Lee Routing ( by TU-KL) 160 Grid-based DRC („fair comparizon“) DSP and wireless Image processing, Pattern matching, Multimedia Image processing, Pattern matching, Multimedia Bioinformatics GRAPE 20 Astrophysics DPLA MoM Xputer architecture Microprocessor relative performance Memory x1.25 / yr (Moore) crypto 1000 pre-FPGA era >1 OoM >2 OoM >3 OoM <4 OoM

© 2006, MHz Flexible Soft Logic Architecture 200KLogic Cells 500MHz Programmable DSP Execution Units Gbps Serial Transceivers 500MHz PowerPC™ Processors (680DMIPS) with Auxiliary Processor Unit 1Gbps Differential I/O 500MHz multi-port Distributed 10 Mb SRAM 500MHz DCM Digital Clock Management platform FPGAs: better area efficiency [courtesy Xilinx Corp.] DSP platform FPGA DeHon‘s 1 st Law (1996) was for plane FPGAs

© 2006, 10 pre FPGA era: Why DPLA* was so good Large arrays of canonical boolean expressions - close to Moore’s law classical PLA layout highly area-efficient: *) fabricated 1984 by E.I.S. multi university project 2 ASM : A uto- S equencing M emory ASM **) for a survey by IMEC & TU-KL see: [M. Herz et al.: I CECS 2003, Dubrovnik] 1 Mid’ 80ies: first only very tiny FPGAs available: 1 DPLA replaced 256 of them a generalization of the DMA** GAG Generic Address Generator** to avoid address computation overhead reducing memory cycles which is the key issue Speed-up factor of 20 by

© 2006, 11 X 2 / yr FPGA taxonomy of algorithms, better tools and better education Pentium 4 7% / yr 50% / yr Los Alamos traffic simulation 47 real-time face detection 6000 video-rate stereo vision 900 pattern recognition 730 SPIHT wavelet-based image compression 457 Smith-Waterman pattern matching 288 BLAST 52 protein identification 40 molecular dynamics simulation 88 Reed-Solomon Decoding 2400 Viterbi Decoding 400 FFT MAC Grid-based DRC: no FPGA: DPLA on MoM by TU-KL Grid-based DRC: no FPGA: DPLA on MoM by TU-KL D FIR filter [TU-KL] 39,4 Lee Routing ( by TU-KL) 160 Grid-based DRC („fair comparizon“) DSP and wireless Image processing, Pattern matching, Multimedia Bioinformatics GRAPE 20 Astrophysics DPLA MoM Xputer architecture Microprocessor relative performance Memory x1.25 / yr (Moore) crypto 1000 even higher speed-up ? consolidation ?

© 2006, 12 New dimensions of low power: Application migration [from supercomputer ] resulting not only in massive speed-ups Electricity bills reduced by an order of magnitude and even more you may get for free …. up to millions of $ dollars per year (also a matter of national energy policy) Google Amsterdam NY „Saves more than $10,000 in electricity bills per year (7 ¢ / kWh) per 64-processor 19" rack “ [Herb Riley, R. Associates]

© 2006, 13 >> Outline << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer

© 2006, 14 ISC2006 BoF SessionTitle and Abstract Is Reconfigurable Computing the Next Generation Supercomputing? Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing. Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing Date and Time This BoF session is part of the conference program and will take place within a 45 minute-slot on Wednesday 28. June 2006 from 18: :30. BoF Organizers John Abott Chief Analyst, The 451 Group, USA Dr. Joshua Harr CTO, Linux Networx, USA As CTO for Linux Networ x, Dr. Joshu a Harr has the respon sibility of laying the technic al roadma p for the compa ny and is leading the team develo ping cluster manag ement tools. Josh's experie nce with parallel process ing, distrib uted comput ing, large server farms, and Linux clusteri ng began when he built an eight- node cluster system out of used compo nents while in college. An industr y expert, Josh has been called upon to consult with busines ses and lecture in college classro oms. He earned a Ph.D. in comput ational chemis try and a bachel or's degree in molecu lar biolog y from BYU. Dr. Eric Stahlberg Organizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA The Supercomputing Paradox Growing listed Teraflops Increasing number of processors running in parallel COTS processor decreasing cost promising technology

© 2006, 15 HPC by classic supercomputing methodology Extreme shortage of affordable capacity Lack of scalability: progress only by innovation More parallelism absorbs programmer productivity Program ready: hardware obsolete The law of More Not for high performance embedded computing poor results

© 2006, 16 >> Outline << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer

© 2006, 17 ISC2006 BoF SessionTitle and Abstract Is Reconfigurable Computing the Next Generation Supercomputing? Advances in reconfigurable computing, particularly FPGA (field-programmable gate array) technology, have reached a performance level where they rival and exceed the performance of general purpose processors for the right applications. FPGAs have gotten cheaper thanks to smaller geometries, multimillion gate counts and volume market leverage from ASIC preproduction and other conventional uses. The potential benefit from the widespread incorporation of FPGA technology into high-performance applications is high, provided present day barriers to their incorporation can be overcome. This session will focus on defining the anticipated market changes, anticipated roles of FPGA technology in high-performance computing (from accelerators to hybrid architectures), characterizing present day barriers to the incorporation of FPGA technology (such as identifying the right applications), and partnering efforts required (tools, benchmarks, standards, etc.)to speed the adoption of reconfigurable technology in high-performance supercomputing. Keywords: Reconfigurable computing, FPGA Accelerators, Supercomputing Date and Time This BoF session is part of the conference program and will take place within a 45 minute-slot on Wednesday 28. June 2006 from 18: :30. BoF Organizers John Abott Chief Analyst, The 451 Group, USA Dr. Joshua Harr CTO, Linux Networx, USA As CTO for Linux Networ x, Dr. Joshu a Harr has the respon sibility of laying the technic al roadma p for the compa ny and is leading the team develo ping cluster manag ement tools. Josh's experie nce with parallel process ing, distrib uted comput ing, large server farms, and Linux clusteri ng began when he built an eight- node cluster system out of used compo nents while in college. An industr y expert, Josh has been called upon to consult with busines ses and lecture in college classro oms. He earned a Ph.D. in comput ational chemis try and a bachel or's degree in molecu lar biolog y from BYU. Dr. Eric Stahlberg Organizing founder OpenFPGA, Ohio Supercomputer Center (OSC), USA Why traditional supercomputing / HPC failed memory-cycle-hungry instruction-stream-based: the wrong way, how the data are moved around because of the wrong multi-core interconnect architecture extremely unbalanced stolen from Bob Colwell CPU

© 2006, 18 Earth Simulator Crossbar weight: 220 t, 3000 km of thick cable, moving data around inside the

© 2006, 19 discarding the wrong road map with a paradigm shift the same performance is feasible on a single 19” rack

© 2006, 20 Bringing together data and processor moving the grand piano by Software Moving data to the processor:

© 2006, 21 Key issues in very High Performance Computing (vHPC) this needs a paradigm shift reducing memory cycles is the key issue away from the dominance of instruction streams

© 2006, 22 Here is the common model data- stream- based instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code CPU it’s not von Neumann the vN monopoly in our curricula is severely harmful Von Neumann: the tail is wagging the dog we need dual paradigm education very high performance & electricity bill issues legacy issues symbiotic

© 2006, 23 The wrong basic mind set we need a a dual paradigm approach this is a severe eduational challenge our IT expert labor force lacks the rite basic mind set

© 2006, 24 For high school and undergraduate education we need a an archtype simple common model this is a severe eduational challenge instead of a wide variety of sophisticated architectures

© 2006, 25 >> Outline << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer

© 2006, 26 integration density the effective integration density of plane FPGAs behind Moore’s law by more than 4 orders of magnitude the effective integration density of rDPAs* may come close to Moore’s law *) reconfigurable DataPath Arrays (coarse-grained reconfigurability)

© 2006, 27 array size: 10 x 16 = 160 rDPUs Coarse grain is about computing, not logic rout thru only not used backbus connect SNN filter on KressArray (mainly a pipe network) [Ulrich Nageldinger] r econfigurable D ata P ath U nit, e. g. 32 bits wide no CPU rDPU

© 2006, 28 SW 2coarse-grained CW migration example rDPU S +

© 2006, 29 rDPU Compare it to software solution on CPU S = R + (if C then A else B endif); C = 1 simple conservative CPU example memory cycles nano seconds if C then read A read instruction1100 instruction decoding read operand*1100 operate & reg. transfers if not C then read B read instruction1100 instruction decoding add & store read instruction1100 instruction decoding operate & reg. transfers store result1100 total S + Clock 200 S + S = R + (if C then A else B endif);

© 2006, 30 hypothetical branching example to illustrate software-to-configware migration *) if no intermediate storage in register file C = 1 simple conservative CPU example memory cycles nano seconds if C then read A read instruction1100 instruction decoding read operand*1100 operate & reg. transfers if not C then read B read instruction1100 instruction decoding add & store read instruction1100 instruction decoding operate & reg. transfers store result1100 total S = R + (if C then A else B endif); S + ABR C clock 200 MHz ( 5 nanosec) =1 no memory cycles: speed-up factor = 100

© 2006, 31 moving the locality of operation into the route of the data stream by P&R Why the speed-up? What‘s the difference? instead of moving data by instruction streams

© 2006, 32 Bringing together data and processor Move the stool by Configware Place the location of execution into the data pipe

© 2006, 33 Data-stream-based instead of instruction-triggered execution should be transport-triggered transport should be done within compiled pipelines, not by move engines* *) which are instruction-stream-based !

© 2006, 34 For high school and undergraduate education we should send CTOs and professors back to school this is a severe eduational challenge

© 2006, 35 The wrong model array size: 10 x 16 = 160 rDPUs rout thru only not used backbus connect SNN filter on KressArray (mainly a pipe network) [Ulrich Nageldinger] r econfigurable D ata P ath U nit, e. g. 32 bits wide no CPU rDPU upon this schematics … … question by a Japanese Corporate vVIP

© 2006, 36 The wrong mind set.... not knowing this solution: symptom of the hardware / software chasm and the configware / software chasm „but you can‘t implement decisions!“ We need Reconfigurable Computing Education S + ABR C clock 200 MHz ( 5 nanosec) =1 (Question by a Japanese Corporate vVIP: [RAW’99])

© 2006, 37 >> Outline << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer

© 2006, 38 Universal HPC co-architecture for: some Goals embedded vHPC (nomadic, automotive,...) desktop vHPC (scientific computing...) Application co-development environment for Hardware non-experts,.... Acceptability by software-type users,... Meet product lifetime >> embedded syst. life: FPGA emulation logistics from development downto maintenance and repair stations examples: automotive, aerospace, industrial,..

© 2006, 39 Architecture: A potential Pentium successor Discard most caches have 64* cores, GHz with clever interconnect for: ▪ concurrent processes and ▪ and for multithreading, ▪ Kung-Kress pipe network The Desk-top Supercomputer! *) CPU mode / DPU mode capability and, for CPU mode DPU mode

© 2006, 40 “Super Pentium” configuration example rDPU CPU twin paradigm machine CPU

© 2006, 41 e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz Games MusicVideos SMeXPP Camera Baseband- Processor Radio- Interface Audio - Interface SD/MMC Cards LCD DISPLAY rDPA Variable resolutions and refresh rates Variable scan mode characteristics Noise Reduction and Artifact Removal High performance requirements Variable file encoding formats Variable content security formats Variable Displays Luminance processing Detail enhancement Color processing Sharpness Enhancement Shadow Enhancement Differentiation Programmable de-interlacing heuristics Frame rate detection and conversion Motion detection & estimation & compensation Different standards (MPEG2/4, H.264) A single device handles all modes World TV & game console & multi media center

© 2006, 42 feasible under 500 MHz means low electricity cost and allows very high inegration density

© 2006, 43 pipeline apropos compiled pipeline …

© 2006, 44 Dual Paradigm Application Development Support instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code data- stream- based CPU software/configware co-compiler high level language placement & routing in the compiler optimizes interconnect bandwidth by preferring nearest neighbor connect

© 2006, 45 Software / Configware Co-Compilation Juergen Becker’s CoDe-X, 1996 CPU SW compiler CW compiler C language source Partitioner rDPU Placement & Routing (Move the Locality of Operation) Resource Parameters supporting different platforms

© 2006, 46 Software / Configware very high level Synthesis instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code data- stream- based CPU term-rewriting-based vhl synthesis system Math formula.... [Arvind, or, Mauricio Ayala]

© 2006, 47 >> Conclusions << Reconfigurable Computing Paradox The Supercomputing Paradox We are using the wrong model Coarse-grained Reconfigurable Devices Super Pentium for Desktop Supercomputer Conclusions

© 2006, 48 flexibility (for accelerators) Objectives avoiding specific silicon rapid prototyping, field-patching, emulation cheap, compact vHPC for every area which needs:

© 2006, 49 Reconfigurable Computing opens many spectacular new horizons: Conclusion (1) Cheap vHPC without needing specific silicon, no mask.... Massive reduction of the electricity bill: locally and national Cheap embedded vHPC Cheap desktop supercomputer (a new market) Fast and cheap prototyping Replacing expensive hardwired accelerators Supporting fault tolerance, self-repair and self-organization Flexibility for systems with unstable multiple standards by dynamic reconfigurability Emulation logistics for very long term sparepart provision and part type count reduction (automotive, aerospace … )

© 2006, 50 Universal vHPC co-architecture demonstrator Conclusion (2) Needed: The compilation tool problem to be solved Language selection problem to be solved Education backlog problems to be solved Use this to develop a very good high school and undergraduate lab course A motivator: preparing for the top 500 contest For widely spreading its use successfully: select killer applications for demo

© 2006, 51 thank you

© 2006, 52 END

© 2006, 53 backup

© 2006, 54 Compilation: Software vs. Configware source program software compiler software code Software Engineering configware code mapper configware compiler scheduler flowware code source „ program “ Configware Engineering placement & routing data C, FORTRAN MATHLAB

© 2006, 55 configware resources: variable Nick Tredennick’s Paradigm Shifts explain the differences 2 programming sources needed flowware algorithm: variable Configware Engineering Software Engineering 1 programming source needed algorithm: variable resources: fixed software CPU

© 2006, 56 Co-Compilation software compiler software code Software / Configware Co-Compiler configware code mapper configware compiler scheduler flowware code data C, FORTRAN, MATHLAB automatic SW / CW partitioner simulated annealing

© 2006, 57 Co-Compiler for Hardwired Kress/Kung Machine [e. g. Brodersen] software compiler software code Software / Flowware Co-Compiler flowware compiler scheduler flowware code data source automatic SW / CW partitioner

© 2006, 58 The first archetype machine model main frame CPU compile or assemble procedural personalization Software Industry Software Industry’s Secret of Success simple basic. Machine Paradigm personalization: RAM-based instruction-stream- based mind set “von Neumann”

© 2006, 59 The 2nd archetype machine model compile structural personalization Configware Industry Configware Industry’s Secret of Success personalization: RAM-based data-stream- based mind set “Kress-Kung” accelerator reconfigurable simple basic. Machine Paradigm

© 2006, 60 Co-Compiler Enabling Technology is available from academia only a small team needed for commercial re-implementation on the road map to the Personal Supercomputer

© 2006, 61 DPA x x x x x x x x x | || xx x x x x xx x -- - input data stream xx x x x x xx x x x x x x x x x x | | | | | | | | | | | | | | output data streams „ data streams “ time port # time port # time port # define:... which data item at which time at which port Data streams (pipe network) H. T. Kung paradigm (systolic array) implemented by distributed memory data counter GAG RAM ASM ASM : A uto- S equencing M emory 50 & more on-chip ASM are feasible 50 & more on-chip ASM are feasible

© 2006, 62 The Generalization of the Systolic Array [R. Kress]: use optimization algorithms e. g.: simulated annealing Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible reconfigurability makes sense discard algebraic synthesis methods remedy? only for applications with regular data dependencies Kress-Kung paradigm super systolic array

© 2006, 63 (Kress-Kung machine paradigm) drastically reducing memory cycles Data Counter instead of Program Counter Generalization of the DMA ASM : A uto- S equencing M emory data counter GAG RAM ASM GAG & enabling technology: multiple publications 1989 … - Survey paper: [M. Herz et al. * : IEEE ICECS 2003, Dubrovnik] *) IMEC, Leuven & TU-KL Storge Scheme optimization methodology, etc.*

© 2006, 64 fine-grained RC: 1 st DeHon‘s 1 st Law Technology: reconfigurability overhead> routing congestion wiring overhead overhead: >> FPGA logical FPGA routed (Gordon Moore curve) transistors / microchip (microprocessor) immense area inefficiency [1996: Ph. D, MIT] density: FPGA physical

© 2006, 65 coarse-grained RC: Hartenstein‘s amendment of DeHon‘s 1 st Law rDPA FPGA routed >> (Gordon Moore curve) rDPA physical rDPA logical area efficiency very close to Moore‘s law [1996: ISIS, Austin, TX] e.g. KressArray family transistors / microchip 10 12

© 2006, 66 More compute power by Configware than Software Conclusion: most compute power from Configware 75% of all (micro)processors are embedded4 : 1 avarage acceleration factor >2 -> rMIPS* : MIPS > 2 *) rMIPS: MIPS replaced by FPGA compute power 25% embedded µProc. accelerated by FPGA(s) 1 : 4 (a very cautious estimation**) **) Dataquest interaction pending -> 1 : 1 -> Every 2 nd µProc accelerated by FPGA(s) (difference probably an order of magnitude)

© 2006, 67 Conclusion (3) Self-Repair and Self-Organization methodology Embedded r-emulation logistics methodology Universal vHPC co-architecture demonstrator select a killer application for demo For widely spreading its use successfully:

© 2006, 68 Dual Paradigm Application Development Support instruction- stream- based software code accelerator reconfigurable accelerator hardwired configware code data- stream- based CPU software/configware co-compiler high level language MATLAB adapter other example