Download presentation

Presentation is loading. Please wait.

Published byRaven Padfield Modified over 3 years ago

1
Timothy Blattner and Shujia Zhou May 18, 2011 1

2
This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael Bellor, Sarah Sellman and Kang Edward for sharing their insight on the CSCF code and Milt Halem for his guidance 2

3
Goal Introduction Design Algorithm Current Implementation Results Future Direction 3

4
Determine if the use of GPGPU based co- processors provides enough computational acceleration to reduce overall time-to-solution. Develop generic off-load acceleration model 4

5
Lower-upper Decomposition Used to solve a system of linear equations Solves Ax = LUx = b Slab-based solution Algorithm uses a forward elimination moving through slabs from left to right Followed by a backwards substitution going through the slabs from right to left This is done using a triple buffer solution, Buffer one is the right hand side slab Buffer two is the slab being read in Buffer three is the slab for the current computation Each element is double precision complex 5

6
Computation solved using a series of FORTRAN routines Routines - Update_slab and Factor_slab use: BLAS ZGEMM ZTRSM ZGEMM and ZTRSM accelerated on GPU 6

7
Size of GPU buffer is less than size of one of the CPU buffers Example: 1 million unknowns Contains 1 million rows and 1,250 columns per slab Total of 800 slabs Each slab is ~18 GB in size Largest GPU memory available is 6 GB (Tesla C2070) Matrices are oblong Number of rows is much larger than the number of columns 7

8
Solve ZTRSM and ZGEMM on the GPU CUBLAS CUDA optimized version of BLAS routines Domain Decomposition of ZGEMM into GPU buffers A*B = C A and C 8 A_CPU Buffer A_GPU Buffer Size = size of A_GPU Buffer COPY COPY N

9
Four Phases: Phase 1: Baseline Benchmark Phase 2: Decompose GEMM Phase 3: Square Matrix Decomposition Demonstrates effectiveness of GPU on square matrices, and potentially utilizes Fermi’s concurrent kernel execution Phase 4: Asynchronous Memory Copy Provides possibility of overlapping PCI express requests, potentially reducing the impact of the bottleneck 9

10
NVIDIA GTX 460 336 cores 1 GB GDDR5 Intel Q6600 4 cores 2.4 Ghz 4 GB DDR2 – 800 1 TB 7200 RPM Disk 10

11
Baseline Benchmark and phase 2 complete for 10,000 unknowns Implementation in Fortran Use Fortran to C wrappers provided by NVIDIA for CUBLAS 11

12
Test CaseUpdate TimeTotal Wall Time 5000x100028.2x6.0x 10000x100029.1x8.6x 10000x50023.4x12.4x 10000x25016.2x12.0x 10000x1008.3x7.6x Speedup Analysis (CPU time vs GPU time) 12

13
Slabs = Number of Rows / Number of Columns 13

14
14

15
15

16
Execute Phase 1 and 2 for up to 1 million unknowns Run on Tesla C2070 on UMBC Bluegrit cluster Implement Phase 3 and benchmark Implement Phase 4 and benchmark Investigate Factor_slab routine for speedup 16

Similar presentations

OK

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on reaction mechanism in organic chemistry Ppt on non material heritage Story like ppt on parts of speech Ppt on needle stick injury management Download ppt on life cycle of butterfly Ppt on carburetor diagram Free download ppt on the road not taken Ppt on first conditional form Ppt on center of gravity Ppt on cartesian product in database