Presentation is loading. Please wait.

Presentation is loading. Please wait.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

Similar presentations


Presentation on theme: "IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay."— Presentation transcript:

1 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay Kumar Siddharth Joshi Sumedh Attarde Prof. Sachin Patkar Prof. H. Narayanan

2 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 2 Outline  Double Precision Dense Matrix-Matrix Multiplication.  Motivation  Related Work  Algorithm  Design  Results  Conclusions  Double Precision Sparse Matrix-Vector Multiplication.  Introduction  Prasanna  DeLorimier  David Gregg et. al.  What can we do ?

3 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 3 FPGA based Double Precision Dense Matrix-Matrix Multiplication.

4 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 4 Motivation  FPGAs have been making inroads for HiPC.  Accelerating BLAS-3 achieved by accelerating matrix multiplications.  Modern FPGAs provide an abundance of resources – We must capitalise upon these.

5 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 5 Related Work{1/2}  The two main works ~ Dou and Prasanna. Both based on linear arrays, both use memory switching – both sustain their peak.  Dou :  Optimised for a large VirtexII pro device (Xillinx).  Created his own MAC (Not fully compliant).  Sub-block dimensions must be powers of 2.  Optimised for Low IO bandwidth.

6 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 6 Related Work{2/2}  Prasanna:  Scaling results in speed degradation of about 35% (2 PEs to 20 PEs).  2.1 GFLOPs on a CRAY XD1 with VirtexII Pros (XC2VP50).  For design only (XC2VP125) they report 15% clock degradation on 2 to 24 PEs. »They state they have not made any platform specific optimisations, for the implemented design.

7 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 7 Algorithm 1.Broadcast ‘A’, keep a unique ‘B’ per PE 2.Multiply, and put in pipeline of multiplier. 3.Output is fed to directly to Adder+Ram (accumulator) 4.When the updated C is ready, take them out.

8 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 8 Design-1

9 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 9 Design-II

10 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 10 FPGA Synthesis/PAR data{1/2} PEDSP48EsFIFOB RAMSlice RegSlice LUT 1161225111374 46448103775451 81288162086510886 1625616324184121750 20(SX240)32020405232927176 40 (SX240)640408010333553914 Table: Clock Speed in MHz for the overall design for different number of PE. Device/PE14816192040 SX95T-3377374373 372201- SX240T-2374373344--372371.7 Table: Resource Utilisation for SX95T and SX240T (post PAR)

11 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 11 FPGA Synthesis/PAR data{2/2} Table: Resource Utilisation for Virtex II ProXC2VP100 (post PAR) 15 PE20 PE MULT18x18240(54%)304(68%) RAMB16s90 (20%)114(26%) Slices30218 (68%)37023(83%) Speed133.94 MHz133.79 MHz

12 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 12 Conclusions  We propose a variation of the rank one update algorithm for matrix multiplication.  We introduce a scalable processing element for this algorithm, targeted a Virtex-5 SX240T FPGA  The two designs clearly show the difference of local storage on IO bandwidth.  The designs achieved a design speed of 373 MHz, 40 PEs and a sustained performance of 29.8 GFLOPS for a single FPGA. We also provide 5.3 GFLOPS on a XC2VP100.

13 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 13 FPGA based Double Precision Sparse Matrix-Vector Multiplication.

14 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 14 Introduction  There are three main papers we will be looking at  Viktor Prasanna : Hybrid method use HLL+S/W+HDL  Michael DeLorimier : Maximum performance but unrealistic  David Gregg et. al.: Most realistic assumptions wrt DRAM

15 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 15 Prasanna  Use of prexisting IP cores – specifically for iterative solver (CG)  4 input reduction ckt does dot product results in partial sums as op.  Adder loop with Array does summation of dotproduct – created using HLL  Reduction ckt at the end uses B-Tree to create the final value  IP s are available  DRAM looked at – but not realistically  Order of Matrices is small  DRAM is bottleneck  With their IP's they have a good architecture -however change the IP and modify datapath – eg. Dou MAC

16 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 16 DeLorimier  Use BRAMs for everything.  Use for iterative Solver – specifically CG  MAC requires interleaving  They do load balancing in their partitioner which requires – a communication stage, very matrix/partitioner dependent.  Communication is the bottleneck  Performance:750 MFLOPS / processor  16 Virtex II 6000s  Each has 5 PE + 1 CE

17 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 17 David Gregg et. al. (SPAR)  They only report the use of the SPAR architecture for FPGAs  They use very pessimistic DRAM access times. Emphasis on cache-miss removal  Not using their Block RAMs well – maybe something interesting can be done here  128 MFLOPS for 3 parallel SPAR units but remove cache miss and we get a peak of 570 MFLOPS

18 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 18 What can we do ?  Both use CSR – Not required why not modify representation  Two approaches : We can try both simultaneously  Prasanna – split across dot products (same row many PE)  Delorimier – split accross rows (many rows – one PE)  Use data from SPAR – viable approach – both do zero multiplies – we get away with one zero multiply/coloumn  Minimise communication or overlap it. - we can do interleaving for this – while one stage computes the previous one communicates.

19 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 19 Questions ?

20 IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 20 THANK YOU Thank You


Download ppt "IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright 2005-2006 All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay."

Similar presentations


Ads by Google