HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.

Slides:



Advertisements
Similar presentations
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Altera FLEX 10K technology in Real Time Application.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Mid semester Presentation Data Packages Generator & Flow Management Data Packages Generator & Flow Management Data Packages Generator & Flow Management.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Programmable logic and FPGA
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
A Parameterized Floating Point Library Applied to Multispectral Image Clustering Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Prof. JunDong Cho VADA Lab. Project.
University of Veszprém Department of Image Processing and Neurocomputing Emulated Digital CNN-UM Implementation of a 3-dimensional Ocean Model on FPGAs.
Computer Processing of Data
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
Efficient FPGA Implementation of QR
Variable Precision Floating Point Division and Square Root Albert Conti Xiaojun Wang Dr. Miriam Leeser Rapid Prototyping Laboratory Northeastern University,
HPEC 2004 Sparse Linear Solver for Power System Analysis using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
J. Christiansen, CERN - EP/MIC
Quadratic Programming Solver for Image Deblurring Engine Rahul Rithe, Michael Price Massachusetts Institute of Technology.
J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Floating-Point Divide and Square Root for Efficient FPGA Implementation of Image and Signal Processing Algorithms Xiaojun Wang, Miriam Leeser
Principles of Linear Pipelining
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.
Overview Real World NP-hard problems, such as fluid dynamics, calcium cell signaling, and stomata networks in plant leaves involve extensive computation.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
CPU/BIOS/BUS CES Industries, Inc. Lesson 8.  Brain of the computer  It is a “Logical Child, that is brain dead”  It can only run programs, and follow.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
HPEC 2004 Sparse Linear Solver for Power System Analysis using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University.
20031 Janusz Starzyk, Yongtao Guo and Zhineng Zhu Ohio University, Athens, OH 45701, U.S.A. April 27 th, 2003.
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.
1 Level 1 Pre Processor and Interface L1PPI Guido Haefeli L1 Review 14. June 2002.
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
(A) Approved for public release; distribution is unlimited. Experience and results porting HPEC Benchmarks to MONARCH Lloyd Lewins & Kenneth Prager Raytheon.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
M. Bellato INFN Padova and U. Marconi INFN Bologna
Backprojection Project Update January 2002
Topics SRAM-based FPGA fabrics: Xilinx. Altera..
Parallel Beam Back Projection: Implementation
Head-to-Head Xilinx Virtex-II Pro Altera Stratix 1.5v 130nm copper
Embedded Systems Design
Instructor: Dr. Phillip Jones
Cache Memory Presentation I
Spartan FPGAs مرتضي صاحب الزماني.
Mattan Erez The University of Texas at Austin
RECONFIGURABLE PROCESSING AND AVIONICS SYSTEMS
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Computer Organization
Computer Evolution and Performance
Digital Signal Processors-1
Alireza Hodjat IVGroup
Memory System Performance Chapter 3
ADSP 21065L.
Presentation transcript:

HPEC 2003 Linear Algebra Processor using FPGA Jeremy Johnson, Prawat Nagvajara, Chika Nwankpa Drexel University

HPEC 2003 Goal To design an embedded FPGA-based multiprocessor system to perform high speed Power Flow Analysis. To design an embedded FPGA-based multiprocessor system to perform high speed Power Flow Analysis. To provide a single desktop environment to solve the entire package of Power Flow Problem (Multiprocessors on the Desktop). To provide a single desktop environment to solve the entire package of Power Flow Problem (Multiprocessors on the Desktop). Provide a scalable solution to load flow computation. Provide a scalable solution to load flow computation. Deliver: Prototype and feasibility analysis. Deliver: Prototype and feasibility analysis.

HPEC 2003 Approach Utilize parallel algorithms for matrix operations needed in load flow. Utilize parallel algorithms for matrix operations needed in load flow. Utilize sparsity structure across contingencies. Utilize sparsity structure across contingencies. Use multiple embedded processors with problem specific instruction and interconnect. Use multiple embedded processors with problem specific instruction and interconnect. Scalable parameterized design. Scalable parameterized design. Pipelined solution for contingency analysis. Pipelined solution for contingency analysis.

HPEC 2003 Dense Matrix Multiplier Distributed memory implementation Distributed memory implementation Hardwired control for communication Hardwired control for communication Parameterized number of processing elements (multiply and accumulate hardware) Parameterized number of processing elements (multiply and accumulate hardware) Overlap computation and communication Overlap computation and communication Use block matrix algorithms with calls to FPGA for large problems Use block matrix algorithms with calls to FPGA for large problems Processor i stores A i, the ith block of rows of A and B j, the jth block of columns of B Processor i stores A i, the ith block of rows of A and B j, the jth block of columns of B Compute C ij = A i * B j Compute C ij = A i * B j Rotate: send B j to processor (j+1) mod Number of processors Rotate: send B j to processor (j+1) mod Number of processors

HPEC 2003 Processor Architecture Host PC FPGA Board PCI Off-chip RAM units Interconnection Network (BUS) Embedded SRAM unit Multiply/ Add unit Embedded SRAM unit Multiply/ Add unit Embedded SRAM unit Multiply/ Add unit …

HPEC 2003 Performance Performance estimate Performance estimate Xilinx Virtex 2 (XC2V8000) Xilinx Virtex 2 (XC2V8000) 168 built-in multipliers and on chip memories (SRAM)  support for 42 single-precision processing elements 168 built-in multipliers and on chip memories (SRAM)  support for 42 single-precision processing elements 4.11 ns pipelined 18 X 18 bit multiplier 4.11 ns pipelined 18 X 18 bit multiplier 2.39 ns memory access time 2.39 ns memory access time 7.26 ns for multiply accumulate 7.26 ns for multiply accumulate Time for n  n matrix multiply with p processors 7.26n 3 /p ns 7.26n 3 /p ns 11 ms for n=400, p = 42  11,570 MFLOPS 11 ms for n=400, p = 42  11,570 MFLOPS

HPEC 2003 APPLICATION, ALGORITHM & SCHEDULE Application Linear solver in Power-flow solution for large systems Linear solver in Power-flow solution for large systems Newton-Raphson loops for convergence, Jacobian matrix Newton-Raphson loops for convergence, Jacobian matrix Algorithm & Schedule Algorithm & Schedule Pre-permute columns Pre-permute columns Sparse LU factorization, Forward and Backward Substitutions Sparse LU factorization, Forward and Backward Substitutions Schedule: Round Robin distribution of rows of Jacobian matrix according to the pattern of column permutation Schedule: Round Robin distribution of rows of Jacobian matrix according to the pattern of column permutation

HPEC 2003 Data InputExtract Data Ybus LU Factorization Forward SubstitutionBackward Substitution Jacobian Matrix Post Processing Update Jacobian matrix Mismatch < Accuracy HOST YES NO Problem Formulation (Serial) Problem Solution (Parallel) Post Processing (Serial) Breakdown of Power-Flow Solution Implementation in Hardware

HPEC 2003 Minimum-Degree Ordering Algorithm (MMD) Reordering columns to reduce fill-ins while performing LU factorization Reduce floating point operations and storage Compute column permutation pattern once Apply throughout power-flow analysis for that set of input bus data Without MMDWith MMD DivMultSubDivMultSub IEEE 30-bus IEEE 118-bus IEEE 300-bus

HPEC 2003 RING TOPOLOGY Processor 0 Buffer n-2 Buffer n-1 Buffer 1 Processor 1Processor n-1 Buffer 0 SDRAMSRAM Memory Controller RAM Memory Controller FPGA Nios Embedded Processors Buffers are on-chip memories; interprocessor communication

HPEC 2003 Communication: used DMA for passing messages Communication: used DMA for passing messages Buffers are on-chip memories Buffers are on-chip memories Trapezoids are arbitrators Trapezoids are arbitrators Processor 1Processor 2 Buffer out 1 Buffer out 2Buffer in 1 Buffer in 2 DMA1 DMA2 SDRAMSRAM Hardware Model using Nios Processor

HPEC 2003 Stratix FPGA

HPEC 2003 Floating Point Unit FDIV FMUL FADD Pre-Normalize Reciprocal Iteration Multiply Round and Post-Normalize Newton Raphson ROM Lookup Select and Round Near Path Predict and Add Far Path Swap and Shift Near Path Leading Zero And Shift Far Path Add Pre-NormalizeMultiplyPost-NormalizeRound

HPEC 2003 Floating Point Unit IEEE-754 Support IEEE-754 Support Single Precision Format, Round to Nearest Single Precision Format, Round to Nearest Round Nearest Rounding, Denormalized Numbers, Special Numbers Round Nearest Rounding, Denormalized Numbers, Special NumbersLEsMULsFmax(MHz)Latency (clk cycles) Pipeline Rate (clk cycles) FADD FMUL FDIV N/A NIOS + FPU

HPEC 2003 PERFORMANCE ANALYSIS Why? Prototype is not intended for industrial performance Show potential performance as a function of available H/W resources Analysis performed for high-performance multi-processor system Analysis performed for high-performance multi-processor system Estimate of number of clock cycles and timing Communication latency Arithmetic latency Model in MATLAB 80 MHz pipelined Floating-point Unit Variables Number of Processors (2, 4, 6, 8) Size of input data (power flow IEEE1648-bus, IEEE7917-bus) System constraints: memory access, size of FPGA chip, system speed

HPEC 2003 TIMING OF PERFORMANCE MODEL Nios embedded processors on Altera Stratix FPGA, running at 80 MHz  8 processors: bus: ms bus: ms 400 MHz PowerPC on Xilinx Virtex 2 FPGA with 80 MHz FPU  8 processors: bus: ms bus: ms WSMP – 1.4 GHz, 256 KB Cache, 256 MB SDRAM, Linux OS bus: ms bus: ms