The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.

Slides:



Advertisements
Similar presentations
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Chris Madill Molecular Structure and Function, Hospital for Sick Children Department of Biochemistry, University of Toronto Supervised by Dr. Paul Chow.
Introduction CS 524 – High-Performance Computing.
MEMOCODE 2007 HW/SW Co-design Contest Documentation of the submission by Eric Simpson Pengyuan Yu Sumit Ahuja Sandeep Shukla Patrick Schaumont Electrical.
A Scalable FPGA-based Multiprocessor for Molecular Dynamics Simulation Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1,
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 17 Solution of Systems of Equations.
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Project performed by: Naor Huri Idan Shmuel.
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
1 Energy Savings and Speedups from Partitioning Critical Software Loops to Hardware in Embedded Systems Greg Stitt, Frank Vahid, Shawn Nematbakhsh University.
FLANN Fast Library for Approximate Nearest Neighbors
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
The hybird approach to programming clusters of multi-core architetures.
Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.
18.337: Image Median Filter Rafael Palacios Aeronautics and Astronautics department. Visiting professor (IIT-Institute for Research in Technology, University.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Design and Characterization of TMD-MPI Ethernet Bridge Kevin Lam Professor Paul Chow.
A CONDENSATION-BASED LOW COMMUNICATION LINEAR SYSTEMS SOLVER UTILIZING CRAMER'S RULE Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer.
Timing Trials An investigation arising out of the Assignment CS32310 – Nov 2013 H Holstein 1.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)
ESC499 – A TMD-MPI/MPE B ASED H ETEROGENEOUS V IDEO S YSTEM Tony Zhou, Prof. Paul Chow April 6 th, 2010.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
A Profiler for a Multi-Core Multi-FPGA System by Daniel Nunes Supervisor: Professor Paul Chow September 30 th, 2008 University of Toronto Electrical and.
Definitions Speed-up Efficiency Cost Diameter Dilation Deadlock Embedding Scalability Big Oh notation Latency Hiding Termination problem Bernstein’s conditions.
Lecture 9 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Design of a Novel Bridge to Interface High Speed Image Sensors In Embedded Systems Tareq Hasan Khan ID: ECE, U of S Term Project (EE 800)
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Parallel Programming with MPI and OpenMP
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
A Monte Carlo Simulation Accelerator using FPGA Devices Final Year project : LHW0304 Ng Kin Fung && Ng Kwok Tung Supervisor : Professor LEONG, Heng Wai.
A Scalable FPGA-based Multiprocessor Arun Patel 1, Christopher A. Madill 2,3, Manuel Saldaña 1, Christopher Comis 1, Régis Pomès 2,3, Paul Chow 1 Presented.
CSCI-455/552 Introduction to High Performance Computing Lecture 6.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Multiprocessor Systems Using FPGAs Presented By: Manuel Saldaña Connections 2006 The University of Toronto ECE Graduate Symposium.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin
A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Sunpyo Hong, Hyesoon Kim
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Application-Specific Customization of Soft Processor Microarchitecture
by Manuel Saldaña, Daniel Nunes, Emanuel Ramalho, and Paul Chow
By Brandon, Ben, and Lee Parallel Computing.
Chapter 01: Introduction
Application-Specific Customization of Soft Processor Microarchitecture
Presentation transcript:

The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering Department October 1st, 2008

Motivation LINPACK Algorithm Parallelizing LINPACK Results Conclusions Future Work Outline

The LINPACK Benchmark is used to rank the Top500 computers in the world Can FPGAs compete? Motivation

Objective To see how well a multi-core multi-FPGA system performs when compared to processor Disadvantage Much lower clock rate Advantage Total implementation may be done in hardware FPGA

LINPACK Algorithm Solves a system of linear equations by calling two routines: DGEFA and DGESL Ax=b DGEFA: LU factorization with partial pivoting: A=LUP Ax=LUx=b DGESL: Solves the system using LU factorization: Ly=b Ux=y

LINPACK1 vs. HPL LINPACK1 Single processor Uses Level 1 BLAS Slower Low Complexity HPL Multiple processors Uses Level 3 BLAS Faster High Complexity FPGA Implementation BLAS3 performs faster in processors (due to locality of reference) FPGAs do not take advantage of BLAS3, LINPACK1 is chosen

LINPACK Pseudo-Code 1. Random generation of matrix A and vector b 2. Execute DGEFA routine (A=LU) IDAMAX, DSCAL and DAXPY are executed here 3. Execute DGESL routine (LUx=b) 4. Verify the result using residual calculation Performance is measured from 2. to 3. (inclusive) How is this going to be parallelized?

Parallelizing LINPACK Find focus of parallelization: DGEFA 5% 95%

DGEFA Analysis Inside DGEFA: IDAMAX, DSCAL and DAXPY DAXPY is the main computation 5% 90%

TMD-MPI TMD-MPI is a lightweight implementation of the MPI protocol (message passing interface) TMD-MPE is a hardware implementation of TMD- MPI's main functionality (SEND and RECV) MPI Network MPI Network

DGEFA Parallelization Generate matrix A and vector b (main rank) (MPI) Matrix distribution Perform DGEFA (main loop) Perform IDAMAX and DSCAL (MPI) Broadcast scaled column and pivot Perform loop that contains DAXPY (MPI) Matrix gather (main rank) Perform DGESL Calculate residual

LINPACK Engine To Network On-Chip TMD MPE Command FSLs LINPACK Engine Control Signals Main FSM BLAS1 Engine MPE Header FSM Data FSLs RAM Data

BLAS1 Engine Performs IDAMAX, DSCAL and DAXPY

IDAMAX Finds Max(v 1 ) and returns its index

DSCAL Performs v 2 =α.v 1

DAXPY Calculates v 3 =α. v 1 +v 2

Hardware - BEE2 Board

Device Utilization (XC2VP70) About 34% is dedicated to the network Cores4-Input LUTs Number of Occurrences ~ Total 4-Input LUTs Total (%) LINPACK Engine TMD-MPE NetIf PLB-MPE FSLs FSL2IC NETWORK CORES

Methods of Analysis Method 1 – Simulation Modelsim waveform Method 2 – PPC Timer By counting the time through the C code in PPC Method 3 – TMD-Profiler Using an external profiler to analyze the engines

Processor vs FPGA Most important portion is DGEFA DGEFA Benchmark with n = 100 Processor's performance = 315MFLOPS Performance – FPGA (6 Engines) 379MFLOPS Performance – 1 Engine 123MFLOPS

Engines Speedup FPGA 1 FPGA 2

Problem Engines computation time is being surpassed by either communication or idle time TMD-Profiler can be used to track the problem For 8 Engines

IDAMAX & DSCAL Broadcast DAXPY TMD-Profiler SEND RECV COMP

Scaled Problem Size FPGA 1 FPGA 2

Why “super” speedup? As matrix increases the size of column also increases Since each engine has exactly the same amount of data, number of columns decrease = 2 x Latency + 20 = 4 x Latency + 20

New Speedup With matrix size of 195 x 195 Performance of 6 engines (one FPGA): 628MFLOPS Performance of one processor: 324MFLOPS Speedup of FPGA over processor is 1.94x

Newer Technology Max theoretical peak performance of engine in V2Pro is 200MFLOPS Newer FPGAs are larger and faster Estimated peak performance for an engine network (20) for Virtex 5 LX330 – 4000MFLOPS Theoretical speedup, compared to a processor, is 11.4x Compared to HPL, estimated speedup is 4.4x

Scaling to Larger Systems LINPACK is meant to run in large multi- processor systems Computer networks suffer from high latency The tighter coupling and lighter protocol used in this FPGA system have potential to scale

Conclusions TMD-MPE was used to parallelize LINPACK Hardware Engine Disadvantage: expensive in terms of device utilization Advantage: higher flexibility Max speedup of engines over a processor, is 1.9x Newer FPGAs have better chances of outperforming processors (est. 4000MFLOPS for Virtex 5 LX330) Multi-FPGA systems have good scalability potential due to low latencies

Future Work Include DDR memory Improve broadcast method (e.g. to tree approach) Optimize DAXPY flow Replicate DAXPY flow inside each engine Explore newer technologies and scalability

Thank You (Questions?)

Additional Slides

/* dgefa(*A[][], *ipvt[]) */ for (k = 0 : n-2)(loop k) pivot = idamax(A[k][k]) + k;(loop idamax) ipvt[k] = pivot; if (A[pivot][k] != 0) t = -1/(A[pivot][k]); swap(&A[pivot][k], &A[k][k]); dscal(&A[k+1][k], t);(loop dscal) for (j = k+1 : n-1)(loop j) t = A[pivot][j]; swap(&A[pivot][j], &A[k][j]); daxpy(&A[k+1][j], A[k+1][k], t);(loop daxpy) BLAS 1 Functions Most of the time is spent doing this loop DGEFA Code

MPE Protocol

LINPACK Report

Opcode TAG

assigned to Rank 0 assigned to Rank 1 assigned to Rank 2 01 n-3n-2n Matrix Distribution Considering an n x n matrix and 3 ranks

Processor vs. LINPACK Engine Whole LINPACK Benchmark with n = 100 Performance (MFLOPS) Processor: 319MFLOPS LINPACK Engine: 164MFLOPS

IDAMAX

DSCAL

DAXPY

FLOPS

16 Engines