Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.

Similar presentations


Presentation on theme: "Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008."— Presentation transcript:

1 Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008

2 Advantages of Using Graphics Processors Parallel architectures with lots of ALUs High memory bandwidth Cheap, fast and scalable New generation within 2 years High Gflops/$ Cons No double precision yet ( only SP floating point operations) Loss of precision (not fully IEEE 754 compliant)

3 NVIDIA GeForce 8 Series cards Currently using 8500GT to test our algorithms 8500GT has 16 processors and a theoretical peak fp performance of 28.8 Gflops and memory bandwidth of 12.8GB/s Scalable architecture 8800 GT – 128 processors, ~350 Gflops and 86.4 GB/s

4 GeForce 8500GT Architecture GLOBAL MEMORY Shared Memory Local Memory Local Memory Control ALU Control ALU Thread Scheduler

5 Programming Model Massively multi-threaded Threads -> warps -> blocks -> grid Shared memory and global memory Coalesced memory access - 5GB/s – 70 GB/s

6 Results CPUCPU opt.GeForce 6200 GeForce 8500 GT Matrix-Matrix (1000) ~6s~0.86s~6s~0.18s Matrix-Vector (1000) ~0.01s -~0.8s~0.5s Matrix-vector operations are so slow because of the data transfer from host to device. 10 Gflops on GPU for matrix-matrix compared to 2+ Gflops on CPU and 6 Gflops reported using BLAS. Also Nvidia 8800 card is observed to have a performance of up to 180 Gflops for matrix-matrix multiplication using optimized algorithms.

7 Conclusion Most reported performances for GPU are ~30-40% of theoretical peak performances. These are still 5x - 10x faster than CPU Considerable understanding and work required to fully optimize code Matrix-matrix operations are easily a magnitude faster than on CPU Future Work Aim is to develop optimized routines for LU decomposition, Cholesky, Conjugate Gradient etc Try to incorporate these routines with the DC Analyzer to achieve both performance improvement as well as tackle larger data sizes.


Download ppt "Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008."

Similar presentations


Ads by Google