Presentation is loading. Please wait.

Presentation is loading. Please wait.

The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering.

Similar presentations


Presentation on theme: "The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering."— Presentation transcript:

1 The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering Department October 1st, 2008

2 Motivation LINPACK Algorithm Parallelizing LINPACK Results Conclusions Future Work Outline

3 The LINPACK Benchmark is used to rank the Top500 computers in the world Can FPGAs compete? Motivation

4 Objective To see how well a multi-core multi-FPGA system performs when compared to processor Disadvantage Much lower clock rate Advantage Total implementation may be done in hardware FPGA

5 LINPACK Algorithm Solves a system of linear equations by calling two routines: DGEFA and DGESL Ax=b DGEFA: LU factorization with partial pivoting: A=LUP Ax=LUx=b DGESL: Solves the system using LU factorization: Ly=b Ux=y

6 LINPACK1 vs. HPL LINPACK1 Single processor Uses Level 1 BLAS Slower Low Complexity HPL Multiple processors Uses Level 3 BLAS Faster High Complexity FPGA Implementation BLAS3 performs faster in processors (due to locality of reference) FPGAs do not take advantage of BLAS3, LINPACK1 is chosen

7 LINPACK Pseudo-Code 1. Random generation of matrix A and vector b 2. Execute DGEFA routine (A=LU) IDAMAX, DSCAL and DAXPY are executed here 3. Execute DGESL routine (LUx=b) 4. Verify the result using residual calculation Performance is measured from 2. to 3. (inclusive) How is this going to be parallelized?

8 Parallelizing LINPACK Find focus of parallelization: DGEFA 5% 95%

9 DGEFA Analysis Inside DGEFA: IDAMAX, DSCAL and DAXPY DAXPY is the main computation 5% 90%

10 TMD-MPI TMD-MPI is a lightweight implementation of the MPI protocol (message passing interface) TMD-MPE is a hardware implementation of TMD- MPI's main functionality (SEND and RECV) MPI Network MPI Network

11 DGEFA Parallelization Generate matrix A and vector b (main rank) (MPI) Matrix distribution Perform DGEFA (main loop) Perform IDAMAX and DSCAL (MPI) Broadcast scaled column and pivot Perform loop that contains DAXPY (MPI) Matrix gather (main rank) Perform DGESL Calculate residual

12 LINPACK Engine To Network On-Chip TMD MPE Command FSLs LINPACK Engine Control Signals Main FSM BLAS1 Engine MPE Header FSM Data FSLs RAM Data

13 BLAS1 Engine Performs IDAMAX, DSCAL and DAXPY

14 IDAMAX Finds Max(v 1 ) and returns its index

15 DSCAL Performs v 2 =α.v 1

16 DAXPY Calculates v 3 =α. v 1 +v 2

17 Hardware - BEE2 Board

18 Device Utilization (XC2VP70) About 34% is dedicated to the network Cores4-Input LUTs Number of Occurrences ~ Total 4-Input LUTs Total (%) LINPACK Engine TMD-MPE NetIf PLB-MPE FSLs FSL2IC NETWORK CORES

19 Methods of Analysis Method 1 – Simulation Modelsim waveform Method 2 – PPC Timer By counting the time through the C code in PPC Method 3 – TMD-Profiler Using an external profiler to analyze the engines

20 Processor vs FPGA Most important portion is DGEFA DGEFA Benchmark with n = 100 Processor's performance = 315MFLOPS Performance – FPGA (6 Engines) 379MFLOPS Performance – 1 Engine 123MFLOPS

21 Engines Speedup FPGA 1 FPGA 2

22 Problem Engines computation time is being surpassed by either communication or idle time TMD-Profiler can be used to track the problem For 8 Engines

23 IDAMAX & DSCAL Broadcast DAXPY TMD-Profiler SEND RECV COMP

24 Scaled Problem Size FPGA 1 FPGA 2

25 Why “super” speedup? As matrix increases the size of column also increases Since each engine has exactly the same amount of data, number of columns decrease = 2 x Latency + 20 = 4 x Latency + 20

26 New Speedup With matrix size of 195 x 195 Performance of 6 engines (one FPGA): 628MFLOPS Performance of one processor: 324MFLOPS Speedup of FPGA over processor is 1.94x

27 Newer Technology Max theoretical peak performance of engine in V2Pro is 200MFLOPS Newer FPGAs are larger and faster Estimated peak performance for an engine network (20) for Virtex 5 LX330 – 4000MFLOPS Theoretical speedup, compared to a processor, is 11.4x Compared to HPL, estimated speedup is 4.4x

28 Scaling to Larger Systems LINPACK is meant to run in large multi- processor systems Computer networks suffer from high latency The tighter coupling and lighter protocol used in this FPGA system have potential to scale

29 Conclusions TMD-MPE was used to parallelize LINPACK Hardware Engine Disadvantage: expensive in terms of device utilization Advantage: higher flexibility Max speedup of engines over a processor, is 1.9x Newer FPGAs have better chances of outperforming processors (est. 4000MFLOPS for Virtex 5 LX330) Multi-FPGA systems have good scalability potential due to low latencies

30 Future Work Include DDR memory Improve broadcast method (e.g. to tree approach) Optimize DAXPY flow Replicate DAXPY flow inside each engine Explore newer technologies and scalability

31 Thank You (Questions?)

32 Additional Slides

33 /* dgefa(*A[][], *ipvt[]) */ for (k = 0 : n-2)(loop k) pivot = idamax(A[k][k]) + k;(loop idamax) ipvt[k] = pivot; if (A[pivot][k] != 0) t = -1/(A[pivot][k]); swap(&A[pivot][k], &A[k][k]); dscal(&A[k+1][k], t);(loop dscal) for (j = k+1 : n-1)(loop j) t = A[pivot][j]; swap(&A[pivot][j], &A[k][j]); daxpy(&A[k+1][j], A[k+1][k], t);(loop daxpy) BLAS 1 Functions Most of the time is spent doing this loop DGEFA Code

34 MPE Protocol

35 LINPACK Report

36 Opcode TAG

37 assigned to Rank 0 assigned to Rank 1 assigned to Rank 2 01 n-3n-2n Matrix Distribution Considering an n x n matrix and 3 ranks

38 Processor vs. LINPACK Engine Whole LINPACK Benchmark with n = 100 Performance (MFLOPS) Processor: 319MFLOPS LINPACK Engine: 164MFLOPS

39 IDAMAX

40 DSCAL

41 DAXPY

42 FLOPS

43 16 Engines


Download ppt "The LINPACK Benchmark on a Multi-Core Multi-FPGA System by Emanuel Ramalho Supervisor: Prof. Paul Chow University of Toronto Electrical and Computer Engineering."

Similar presentations


Ads by Google