IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

Slides:



Advertisements
Similar presentations
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
An Integrated Reduction Technique for a Double Precision Accumulator Krishna Nagar, Yan Zhang, Jason Bakos Dept. of Computer Science and Engineering University.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Distributed Arithmetic
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Hybrid Data Structure for IP Lookup in Virtual Routers Using FPGAs Authors: Oĝuzhan Erdem, Hoang Le, Viktor K. Prasanna, Cüneyt F. Bazlamaçcı Publisher:
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
A Sparse Matrix Personality for the Convey HC-1 Dept. of Computer Science and Engineering University of South Carolina Krishna K Nagar, Jason D. Bakos.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.
Examples of Two- Dimensional Systolic Arrays. Obvious Matrix Multiply Rows of a distributed to each PE in row. Columns of b distributed to each PE in.
FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Efficient FPGA Implementation of QR
Chapter One Introduction to Pipelined Processors.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Accelerating Homomorphic Evaluation on Reconfigurable Hardware Thomas Pöppelmann, Michael Naehrig, Andrew Putnam, Adrian Macias.
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
StrideBV: Single chip 400G+ packet classification Author: Thilan Ganegedara, Viktor K. Prasanna Publisher: HPSR 2012 Presenter: Chun-Sheng Hsueh Date:
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Author : Weirong Jiang, Yi-Hua E. Yang, and Viktor K. Prasanna Publisher : IPDPS 2010 Presenter : Jo-Ning Yu Date : 2012/04/11.
Backprojection and Synthetic Aperture Radar Processing on a HHPC Albert Conti, Ben Cordes, Prof. Miriam Leeser, Prof. Eric Miller
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
Accelerating an N-Body Simulation Anuj Kalia Maxeler Technologies.
Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
1 - CPRE 583 (Reconfigurable Computing): High-level Acceleration Approaches Iowa State University (Ames) CPRE 583 Reconfigurable Computing Lecture 23:
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
1 Aggregated Circulant Matrix Based LDPC Codes Yuming Zhu and Chaitali Chakrabarti Department of Electrical Engineering Arizona State.
Improvement to Hessenberg Reduction
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Author: Yun R. Qu, Shijie Zhou, and Viktor K. Prasanna Publisher:
TI Information – Selective Disclosure
Backprojection Project Update January 2002
High-throughput Online Hash Table on FPGA
Dense Linear Algebra (Data Distributions)
Memory System Performance Chapter 3
Presentation transcript:

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 2 Outline  Double Precision Dense Matrix-Matrix Multiplication.  Motivation  Related Work  Algorithm  Design  Results  Conclusions  Double Precision Sparse Matrix-Vector Multiplication.  Introduction  Prasanna  DeLorimier  David Gregg et. al.  What can we do ?

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 3 FPGA based Double Precision Dense Matrix-Matrix Multiplication.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 4 Motivation  FPGAs have been making inroads for HiPC.  Accelerating BLAS-3 achieved by accelerating matrix multiplications.  Modern FPGAs provide an abundance of resources – We must capitalise upon these.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 5 Related Work{1/2}  The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak.  Dou :  Optimised for a large VirtexII pro device (Xillinx).  Created his own MAC (Not fully compliant).  Sub-block dimensions must be powers of 2.  Optimised for Low IO bandwidth.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 6 Related Work{2/2}  Prasanna:  Scaling results in speed degradation of about 35% (2 PEs to 20 PEs).  2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50).  For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs. »They state they have not made any platform specific optimisations, for the implemented design.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 7 Algorithm 1.Broadcast ‘A’, keep a unique ‘B’ per PE 2.Multiply, and put in pipeline of multiplier. 3.Output is fed to directly to Adder+Ram (accumulator) 4.When the updated C is ready, take them out.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 8 Design-1

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 9 Design-II

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 10 FPGA Synthesis/PAR data{1/2} PEDSP48EsFIFOB RAMSlice RegSlice LUT (SX240) (SX240) Table: Clock Speed in MHz for the overall design for different number of PE. Device/PE SX95T SX240T Table: Resource Utilisation for SX95T and SX240T (post PAR)

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 11 FPGA Synthesis/PAR data{2/2} Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR) 15 PE20 PE MULT18x18240(54%)304(68%) RAMB16s90 (20%)114(26%) Slices30218 (68%)37023(83%) Speed MHz MHz

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 12 Conclusions  We propose a variation of the rank one update algorithm for matrix multiplication.  We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA  The two designs clearly show the difference of local storage on IO bandwidth.  The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 13 FPGA based Double Precision Sparse Matrix-Vector Multiplication.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 14 Introduction  There are three main papers we will be looking at  Viktor Prasanna : Hybrid method use HLL+S/W+HDL  Michael DeLorimier : Maximum performance but unrealistic  David Gregg et. al.: Most realistic assumptions wrt DRAM

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 15 Prasanna  Use of prexisting IP cores – specifically for iterative solver (CG)  4 input reduction ckt does dot product results in partial sums as op.  Adder loop with Array does summation of dotproduct – created using HLL  Reduction ckt at the end uses B-Tree to create the final value  IP s are available  DRAM looked at – but not realistically  Order of Matrices is small  DRAM is bottleneck  With their IP's they have a good architecture -however change the IP and modify datapath – eg. Dou MAC

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 16 DeLorimier  Use BRAMs for everything.  Use for iterative Solver – specifically CG  MAC requires interleaving  They do load balancing in their partitioner which requires – a communication stage, very matrix/partitioner dependent.  Communication is the bottleneck  Performance:750 MFLOPS / processor  16 Virtex II 6000s  Each has 5 PE + 1 CE

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 17 David Gregg et. al. (SPAR)  They only report the use of the SPAR architecture for FPGAs  They use very pessimistic DRAM access times. Emphasis on cache-miss removal  Not using their Block RAMs well – maybe something interesting can be done here  128 MFLOPS for 3 parallel SPAR units but remove cache miss and we get a peak of 570 MFLOPS

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 18 What can we do ?  Both use CSR – Not required why not modify representation  Two approaches : We can try both simultaneously  Prasanna – split across dot products (same row many PE)  Delorimier – split accross rows (many rows – one PE)  Use data from SPAR – viable approach – both do zero multiplies – we get away with one zero multiply/coloumn  Minimise communication or overlap it. - we can do interleaving for this – while one stage computes the previous one communicates.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 19 Questions ?

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 20 THANK YOU Thank You