COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Slides:

Advertisements

Similar presentations

Hao wang and Jyh-Charn (Steve) Liu

Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.

Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.

Breaking SIMD Shackles with an Exposed Flexible Microarchitecture and the Access Execute PDG Venkatraman Govindaraju, Tony Nowatzki, Karthikeyan Sankaralingam.

Computer Abstractions and Technology

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.

Parallell Processing Systems1 Chapter 4 Vector Processors.

An Introduction to Reconfigurable Computing Mitch Sukalski and Craig Ulmer Dean R&D Seminar 11 December 2003.

Implementation methodology for Emerging Reconfigurable Systems With minimum optimization an appreciable speedup of 3x is achievable for this program with.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.

Configurable System-on-Chip: Xilinx EDK

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

SSS 4/9/99CMU Reconfigurable Computing1 The CMU Reconfigurable Computing Project April 9, 1999 Mihai Budiu

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

General FPGA Architecture Field Programmable Gate Array.

L29:Lower Power Embedded Architecture Design 성균관대학교 조 준 동 교수,

C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.

Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.

1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,

Systolic Architectures: Why is RC fast? Greg Stitt ECE Department University of Florida.

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

Architectural Optimizations David Ojika March 27, 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

Automated Design of Custom Architecture Tulika Mitra

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

General Concepts of Computer Organization Overview of Microcomputer.

EE3A1 Computer Hardware and Digital Design

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) Reconfigurable Architectures Forces that drive.

1 Text Reference: Warford. 2 Computer Architecture: The design of those aspects of a computer which are visible to the programmer. Architecture Organization.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Reconfigurable Computing Aspects of the Cray XD1 Sandia National Laboratories / California Craig Ulmer Cray User Group (CUG 2005) May.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Exploiting Parallelism

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

University of Michigan Electrical Engineering and Computer Science 1 Compiler-directed Synthesis of Multifunction Loop Accelerators Kevin Fan, Manjunath.

Computing Systems: Next Call for Proposals Dr. Panagiotis Tsarchopoulos Computing Systems ICT Programme European Commission.

1 - CPRE 583 (Reconfigurable Computing): Reconfigurable Computing Architectures Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture.

Cray XD1 Reconfigurable Computing for Application Acceleration.

Algorithms in Programming Computer Science Principles LO

Buffering Techniques Greg Stitt ECE Department University of Florida.

Chapter 2 – Computer hardware

FPGAs in AWS and First Use Cases, Kees Vissers

Greg Stitt ECE Department University of Florida

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Implementation of IDEA on a Reconfigurable Computer

Dynamically Reconfigurable Architectures: An Overview

Simulation of computer system

Chapter 1 Introduction.

Central Processing Unit

Portable SystemC-on-a-Chip

Lecture 11: Machine-Dependent Optimization

Presentation transcript:

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering University of California Riverside

28 September 2007 Future of Computing - W. Najjar 2 Why? Are FPGA: A New HPC Platform? Comparison of  a dual core Opteron (2.5 GHz) to Virtex 4 & 5 FPGA on dp fp  Balanced allocation of adders, multipliers and registers  Use both DSP and logic for multipliers, run at lower speed  Logic & wires for I/O interfaces (dp) Gflop/s OptV-4V-5 MAc Mult Add Watts OptV-4V ~35 David Strensky, FPGAs Floating-Point Performance -- a pencil and paper evaluation, in HPCwire.com

28 September 2007 Future of Computing - W. Najjar 3 ROCCC Riverside Optimizing Compiler for Configurable Computing Code acceleration  By mapping of circuits to FPGA  Achieve same speed as hand-written VHDL codes Improved productivity  Allows design and algorithm space exploration Keeps the user fully in control  We automate only what is very well understood

28 September 2007 Future of Computing - W. Najjar 4 Challenges FPGA is an amorphous mass of logic  Structure provided by the code being accelerated  Repeatedly applied to a large data set: streams Languages reflect the von Neumann execution model:  Highly structured and sequential (control driven)  Vast randomly accessible uniform memory CPUs (& GPUs)FPGAs Temporal computingSpatial computing SequentialParallel Centralized storageDistributed storage Control flow drivenData flow driven

28 September 2007 Future of Computing - W. Najjar 5 ROCCC Overview Limitations on the code: No recursion No pointers High level transformations Low level transformations Code generation Hi-CIRRF Java C/C++ Lo-CIRRF SystemC VHDL Binary FPGA CPU GPU DSP Custom unit Procedure, loop and array optimizations Instruction scheduling Pipelining and storage optimizations CIRRF Compiler Intermediate Representation for Reconfigurable Fabrics

28 September 2007 Future of Computing - W. Najjar 6 Input memory (on or off chip) Output memory (on or off chip) Mem Fetch Unit Mem Store Unit Input Buffer Output Buffer Multiple loop bodies Unrolled and pipelined A Decoupled Execution Model  Decoupled memory access from datapath  Parallel loop iterations  Pipelined datapath  Smart buffer (input) does data reuse  Memory fetch and store units, data path configured by compiler  Off chip accesses platform specific

28 September 2007 Future of Computing - W. Najjar 7 So far, working compiler with … Extensive optimizations and transformations  Traditional and FPGA specific  Systolic array, pipelined unrolling, look-up tables Compile + hardware support for data reuse  > 98% reduction in memory fetches on image codes Efficient code generation and pipelining  Within 10% of hand-optimized HDL codes Import of existing IP cores  Leverages huge wealth, integrated with C source code Support for dynamic partial reconfiguration

28 September 2007 Future of Computing - W. Najjar 8 Indices of A[] coefficients #define N 516 void begin_hw(); void end_hw(); int main() { int i; const int T[5] = {3,5,7}; int A[N], B[N]; begin_hw(); L1: for (i=0; i<=(N-3); i=i+1) { B[i] = T[0]*A[i] + T[1]*A[i+1] + T[2]*A[i+2]; } end_hw(); } Example: 3-tap FIR

28 September 2007 Future of Computing - W. Najjar 9 RC Platform Models CPU FPGA Memory interface CPU Memory interface FPGA SRAM Fast Network CPU Memory FPGA SRAM CPU Memory FPGA SRAM 2 1 3

28 September 2007 Future of Computing - W. Najjar 10 What we have learned so far Big speedups are possible  10x to 1,000x on application codes, over Xeon and Itanium, molecular dynamics, bio-informatics, etc.  Works best with streaming data New paradigms and tools  For spatio-temporal concurrency  Algorithms, languages, compilers, run-time systems etc

28 September 2007 Future of Computing - W. Najjar 11 Future? Very wide use of FPGAs Why?  High throughput (> 10x) AND low power (< 25%) How?  Mostly in Models 2 and 3, initially  Model2: See Intel QuickAssist, Xtremedata & DRC  Model 3: SGI, SRC & Cray Contingency  Market brings price of FPGAs down  Availability of some software stack  for savvy programmers, initially Potential  Multiple “killer apps” (to be discovered)

28 September 2007 Future of Computing - W. Najjar 12 Conclusion We as a research community should be ready Stamatis was Thank you