Timothy Blattner and Shujia Zhou May 18, 2011 1. This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.

Slides:



Advertisements
Similar presentations
GPU Programming using BU Shared Computing Cluster
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
MATH 685/ CSI 700/ OR 682 Lecture Notes
LU Factorization LU-factorization Matrix factorization Forward substitution Back substitution.
OpenFOAM on a GPU-based Heterogeneous Cluster
Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 17 Solution of Systems of Equations.
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 15 Solution of Systems of Equations.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Assignment Solving System of Linear Equations Using MPI Phạm Trần Vũ.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Using LU Decomposition to Optimize the modconcen.m Routine Matt Tornowske April 1, 2002.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
Introduction to Numerical Analysis I MATH/CMPSC 455 PA=LU.
Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.
£899 – Ultimatum Computers indiegogo.com/ultimatumcomputers The Ultimatum.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
GPU Architecture and Programming
Lecture 8 Matrix Inverse and LU Decomposition
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Accelerating MCAE with GPUs Information Sciences Institute 15 Sept 2010 Bob Lucas, Gene Wagenbreth, Dan Davis, Roger Grimes
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
3.6 Solving Systems Using Matrices You can use a matrix to represent and solve a system of equations without writing the variables. A matrix is a rectangular.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Sunpyo Hong, Hyesoon Kim
Programming Massively Parallel Graphics Multiprocessors using CUDA Final Project Amirhassan Asgari Kamiabad
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
Unit #1 Linear Systems Fall Dr. Jehad Al Dallal.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Numerical Methods.  LU Decomposition is another method to solve a set of simultaneous linear equations  For most non-singular matrix [A] that one could.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
LU Decomposition ● In Gauss elimination; Forward elimination Backward substitution Major computational effort Low computational effort can be used for.
Generalized and Hybrid Fast-ICA Implementation using GPU
Analysis of Sparse Convolutional Neural Networks
CS427 Multicore Architecture and Parallel Computing
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Spring Dr. Jehad Al Dallal
Linear Equations.
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Introduction to cuBLAS
Numerical Computation and Optimization
Simultaneous Equations
Lecture 8 Matrix Inverse and LU Decomposition
6- General Purpose GPU Programming
Presentation transcript:

Timothy Blattner and Shujia Zhou May 18,

This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael Bellor, Sarah Sellman and Kang Edward for sharing their insight on the CSCF code and Milt Halem for his guidance 2

 Goal  Introduction  Design  Algorithm  Current Implementation  Results  Future Direction 3

 Determine if the use of GPGPU based co- processors provides enough computational acceleration to reduce overall time-to-solution.  Develop generic off-load acceleration model 4

 Lower-upper Decomposition  Used to solve a system of linear equations  Solves Ax = LUx = b  Slab-based solution  Algorithm uses a forward elimination moving through slabs from left to right  Followed by a backwards substitution going through the slabs from right to left  This is done using a triple buffer solution,  Buffer one is the right hand side slab  Buffer two is the slab being read in  Buffer three is the slab for the current computation  Each element is double precision complex 5

 Computation solved using a series of FORTRAN routines  Routines - Update_slab and Factor_slab use:  BLAS  ZGEMM  ZTRSM  ZGEMM and ZTRSM accelerated on GPU 6

 Size of GPU buffer is less than size of one of the CPU buffers  Example:  1 million unknowns  Contains 1 million rows and 1,250 columns per slab  Total of 800 slabs  Each slab is ~18 GB in size  Largest GPU memory available is 6 GB (Tesla C2070)  Matrices are oblong  Number of rows is much larger than the number of columns 7

 Solve ZTRSM and ZGEMM on the GPU  CUBLAS  CUDA optimized version of BLAS routines  Domain Decomposition of ZGEMM into GPU buffers  A*B = C A and C  8 A_CPU Buffer A_GPU Buffer Size = size of A_GPU Buffer COPY COPY N

 Four Phases:  Phase 1: Baseline Benchmark  Phase 2: Decompose GEMM  Phase 3: Square Matrix Decomposition  Demonstrates effectiveness of GPU on square matrices, and potentially utilizes Fermi’s concurrent kernel execution  Phase 4: Asynchronous Memory Copy  Provides possibility of overlapping PCI express requests, potentially reducing the impact of the bottleneck 9

 NVIDIA GTX 460  336 cores  1 GB GDDR5  Intel Q6600  4 cores  2.4 Ghz  4 GB DDR2 – 800  1 TB 7200 RPM Disk 10

 Baseline Benchmark and phase 2 complete for 10,000 unknowns  Implementation in Fortran  Use Fortran to C wrappers provided by NVIDIA for CUBLAS 11

Test CaseUpdate TimeTotal Wall Time 5000x x6.0x 10000x x8.6x 10000x x12.4x 10000x x12.0x 10000x1008.3x7.6x Speedup Analysis (CPU time vs GPU time) 12

Slabs = Number of Rows / Number of Columns 13

14

15

 Execute Phase 1 and 2 for up to 1 million unknowns  Run on Tesla C2070 on UMBC Bluegrit cluster  Implement Phase 3 and benchmark  Implement Phase 4 and benchmark  Investigate Factor_slab routine for speedup 16