FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.

FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Heterogeneous and Reconfigurable Computing Group http://herc.cse.sc.edu This material is based upon work supported by the National Science Foundation under Grant Nos. CCF-0844951 and CCF-0915608

Sparse Matrix Vector Multiplication SpMV used as a kernel in many methods –Iterative Principal Component Analysis (PCA) –Matrix Decomposition: LU, SVD, Cholesky, QR, etc –Iterative Linear System Solvers: CG, BCG, GMRES, Jacobi etc –Other Matrix Operations

Talk Outline GPU –Microarchitecture & Memory Hierarchy Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations

NVIDIA GT200 Microarchitecture Many-core architecture –24 or 30 On-chip Streaming Multiprocessors –8 Scalar Processor per SMs –Each SP can issue up to four threads –Warp: Group of 32 threads having common control path

GPU Memory Hierarchy Off-Chip Device Memory –On board –Host and GPU exchange I/O data –GPU stores state data On-Chip Memories –A large Set of 32-bit regs per processor –Shared Memory –Constant Cache (Read Only) –Texture Cache (Read Only) Multiprocessor 1 Multiprocessor 2 Multiprocessor n Constant Texture Constant Memory Texture Memory Device Memory

GPU Utilization and Throughput Metrics CUDA Profiler used to measure –Occupancy Ratio of active warps to the maximum number of active warps per SM Limiting Factors: –Number of registers –Amount of shared memory –Instruction count required by the threads Not an accurate indicator of SM utilization –Instruction Throughput Ratio of achieved instruction rate to peak instruction rate Limiting Factors: –Memory latency –Bank conflicts on shared memory –Inactive threads within a warp caused by thread divergence

Talk Outline GPU –Memory Hierarchy & Microarchitecture Sparse Matrix Vector Multiplication on GPU Sparse Matrix Vector Multiplication on FPGA Analysis of FPGA and GPU Implementations

Sparse Matrix Sparse Matrices can be very large but contain few non-zero elements SpMV: Ax = b Need special storage format –Compressed Storage Row (CSR) val 7-95321-45-6 col 1422335145 ptr 1346811 700-90 05000 03200 1000 -4005-6

GPU SpMV Multiplication State of the art –NVIDIA Research (Nathan Bell) –Ohio State University and IBM (Rajesh Bordawekar) Built on top of NVIDIA’s SpMV CSR kernel Memory management optimizations added In general, performance depends on effective use of GPU memories

OSU/IBM SpMV Matrix stored in device memory –Zero padding: Elements per row to be a multiple of sixteen Input vector in SM’s texture cache Shared memory stores output vector Extracting Global Memory Bandwidth –Instruction and variable alignment necessary Fulfilled by built-in types –Global memory access by all threads of a half-warp coalesced into a transaction of 32, 64, or 128 bytes

Analysis Each thread reads 1/16 th of non-zero elements in a row Accessing device memory (128 byte interface): –Access val array => 16 threads read 16 x 8 bytes = 128 bytes –Access col array => 16 threads read 16 x 4 bytes = 64 bytes Occupancy achieved by all matrices was ONE –Each thread uses sufficiently small amount of registers and shared memory –Each SM capable of executing the maximum number of threads possible Instruction throughput ratio : 0.799 to 0.886

SpMV FPGA Implementation Generally Implemented Architecture (from literature) –Multipliers followed by a binary tree of adders followed by accumulator –Values delivered serially to the accumulator –For a set of n values, n-1 additions required to reduce Problem –Accumulation of FP values is an iterative procedure M1M1 M2M2 V1V1 V2V2 Accumulator

The Reduction Problem + + Mem Control Partial sums Basic Accumulator Architecture Adder Pipeline Required Design Reduction Ckt Feedback Loop

Previous Reduction Ckt Implementations GroupFPGA Reduc’n Logic Reduc’n BRAM D.p. adder speed Accumulator speed Prasanna ’07 Virtex2 Pro100 DSA3170 MHz142 MHz Prasanna ’07 Virtex2 Pro100 SSA6170 MHz165 MHz Gerards ’08Virtex 4 Rule Based 9324 MHz200 MHz We need better architecture Feedback Reduction Circuit −Simple and Resource Efficient Reduce the performance gap between adder and accumulator –Move logic outside the feedback loop

A Close Look at Floating Point Addition Compare exponents Add mantissas De- normalize smaller value Round Re- normalize 1.1011 x 2 23 0.01111 x 2 23 10.00101 x 2 23 10.0011 x 2 23 1.00011 x 2 24 Round 1.0010 x 2 24 IEEE 754 adder pipeline (assume 4-bit significand): 1.1011 x 2 23 1.1110 x 2 21

Base Conversion Idea: –Shift both inputs to the left by amount specified in low-order bits of exponents –Reduces size of exponent, requires wider adder Example: –Base-8 conversion: 1.01011101, exp=10110 (1.36328125 x 2 22 => ~5.7 million) Shift to the left by 6 bits… 1010111.01, exp=10 (87.25 x 2 8*2 = > ~5.7 million)

Accumulator Design Feedback Loop Preprocess Post-process α = 3

Reduction Circuit Designed a novel reduction circuit Lightweight by taking advantage of shallow adder pipeline Requires –One input buffer –One output buffer –Eight State FSM controller

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 22 11 B1 Input 0

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 22 11 B2 Input B1

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer B1 33 B2 Input  2 B3

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer B1 33 Input  2 B4 B2+B3

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer 33 Input  2 B5 B2+B3B1+B4

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input  2  3 B6 B2+B3B1+B4 B5

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input  2  3 B7 B2+B3 +B6 B1+B4 B5

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input  2  3 B8 B2+B3 +B6 B1+B4 +B7 B5

Three-Stage Reduction Architecture “Adder” pipeline Input buffer Output buffer Input C1 B2+B3 +B6 B1+B4 +B7 B5+B8 0

Reduction Circuit Configurations Four “configurations”: Deterministic control sequence, triggered by set change: –D, A, C, B, A, B, B, C, B/D Minimum set size: α ⌈ lg α + 1 ⌉ -1 Minimum set size for adder pipeline depth of 3 is 8

New SpMV Architecture Built on top of limitation of Reduction Circuit Delete Adder Binary tree Replicate accumulators Schedule data to process multiple dot products in parallel

Performance Figures GPUFPGA Matrix Order/ dimensions nznz Avg. n z /row Mem. BW (GB/s) GFLOPs GFLOPs ( 8.5 GB/s) TSOPF_RS_b162_c3 153746102994058.0010.081.60 E40r1000 172815535623257.038.761.65 Simon/olafu 1614610151563252.588.521.67 Garon/garon2 135353732352949.167.181.64 Mallya/lhr11c 109642337412140.235.101.49 Hollinger/mark3jac020sc 912952883626.641.581.10 Bai/dw8192 819241746525.681.281.08 YCheng/psse1 14318 x 1102857376427.661.240.85 GHS_indef/ncvxqp1 1211173963327.080.981.13

Performance Comparison If FPGA Memory bandwidth scaled by adding multipliers/ accumulators to match GPU Memory Bandwidth for each matrix separately GPU Mem. BW (GB/s) FPGA Mem BW (GB/s) 58.00 51.0 GB/s (x6) 57.03 51.0 GB/s (x6) 52.58 51.0 GB/s (x6) 49.16 42.5 GB/s (x5) 40.23 34 GB/s (x4) 26.64 25.5 GB/s (x3) 25.68 25.5 GB/s (x3) 27.66 25.5 GB/s (x3) 27.08 25.5 GB/s (x3)

Conclusions Presented state of the art GPU Implementation of SpMV Presented a new SpMV Architecture for FPGA –Based on novel Accumulator architecture GPUs at present, perform better than FPGAs for SpMV –Due to available memory bandwidth FPGAs have the potential to outperform GPUs –Need more memory bandwidth

Acknowledgement Dr. Jason Bakos Yan Zhang, Tiffany Mintz, Zheming Jin, Yasser Shalabi, Rishabh Jain National Science Foundation Questions?? Thank You!!

Performance Analysis Xilinx Virtex-2Pro100 –Includes everything related to the accumulator (LUT based adder)

FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.

Similar presentations

Presentation on theme: "FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and.

Similar presentations

Presentation on theme: "FPGA vs. GPU for Sparse Matrix Vector Multiply Yan Zhang, Yasser H. Shalabi, Rishabh Jain, Krishna K. Nagar, Jason D. Bakos Dept. of Computer Science and."— Presentation transcript:

Similar presentations

About project

Feedback