A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
Ch 7.6: Complex Eigenvalues
Ch 7.7: Fundamental Matrices
Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Block LU Factorization Lecture 24 MA471 Fall 2003.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Strassen's Matrix Multiplication Sibel KIRMIZIGÜL.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
Solving Linear Systems (Numerical Recipes, Chap 2)
Refresher: Vector and Matrix Algebra Mike Kirkpatrick Department of Chemical Engineering FAMU-FSU College of Engineering.
Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.
Slides by Olga Sorkine, Tel Aviv University. 2 The plan today Singular Value Decomposition  Basic intuition  Formal definition  Applications.
1cs542g-term Notes  In assignment 1, problem 2: smoothness = number of times differentiable.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Symmetric Matrices and Quadratic Forms
Motion Analysis Slides are from RPI Registration Class.
3D Geometry for Computer Graphics
Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.
Ch 7.8: Repeated Eigenvalues
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Ch 7.3: Systems of Linear Equations, Linear Independence, Eigenvalues
3D Geometry for Computer Graphics
Boot Camp in Linear Algebra Joel Barajas Karla L Caballero University of California Silicon Valley Center October 8th, 2008.
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
Stats & Linear Models.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
Boyce/DiPrima 9th ed, Ch 7.3: Systems of Linear Equations, Linear Independence, Eigenvalues Elementary Differential Equations and Boundary Value Problems,
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Linear Algebra Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
1 MAC 2103 Module 12 Eigenvalues and Eigenvectors.
Day 1 Eigenvalues and Eigenvectors
Day 1 Eigenvalues and Eigenvectors
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.
Chapter 5 Eigenvalues and Eigenvectors 大葉大學 資訊工程系 黃鈴玲 Linear Algebra.
Linear algebra: matrix Eigen-value Problems Eng. Hassan S. Migdadi Part 1.
Scientific Computing Singular Value Decomposition SVD.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Eigenvalues The eigenvalue problem is to determine the nontrivial solutions of the equation Ax= x where A is an n-by-n matrix, x is a length n column.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
Review of Linear Algebra Optimization 1/16/08 Recitation Joseph Bradley.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
Linear Algebra Operators for GPU Implementation of Numerical Algorithms J. Krüger R. Westermann computer graphics & visualization Technical University.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
GPU Accelerated Vessel Segmentation Using Laplacian Eigenmaps Lin Cheng, Hyunsu Cho and Peter A. Yoon Trinity College.
1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Boot Camp in Linear Algebra TIM 209 Prof. Ram Akella.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
An Accelerated Procedure for Hypergraph Coarsening on the GPU Lin Cheng, Hyunsu Cho, and Peter Yoon Trinity College Hartford, CT, USA.
ALGEBRAIC EIGEN VALUE PROBLEMS
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Review of Eigenvectors and Eigenvalues from CliffsNotes Online mining-the-Eigenvectors-of-a- Matrix.topicArticleId-20807,articleId-
CS246 Linear Algebra Review. A Brief Review of Linear Algebra Vector and a list of numbers Addition Scalar multiplication Dot product Dot product as a.
Review of Eigenvectors and Eigenvalues
Review of Linear Algebra
Linear Equations.
Boyce/DiPrima 10th ed, Ch 7.8: Repeated Eigenvalues Elementary Differential Equations and Boundary Value Problems, 10th edition, by William E. Boyce.
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Eigenvalues and Eigenvectors
Parallelization of Sparse Coding & Dictionary Learning
Symmetric Matrices and Quadratic Forms
Lecture 13: Singular Value Decomposition (SVD)
Eigenvalues and Eigenvectors
Presentation transcript:

A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College, Hartford, CT, USA

Symmetric Eigenvalue Problem Many interesting applications require eigenvectors Ax = λx where A is symmetric

Divide and Conquer Yields full spectrum of eigenvalues and eigenvectors Is numerically stable Gives rise to independent subproblems Often faster than O(n 3 ) due to deflation

Divide and Conquer Apply orthogonal similarity transformation to reduce A to tridiagonal form Existing work on single-node, multi-GPU: MAGMA (UTK) Q T AQ = A’ where A’ is symmetric tridiagonal and Q is orthogonal

Solve subproblems Merge solutions Repair Divide and Conquer A A1A1 A2A2

Merging solutions Suppose where (subproblem #1) (subproblem #2)

Merging solutions Then where and

Merging solutions Then where and Rank-one modifier

Rank-one update where

Rank-one update where Need eigen-decomposition of inner system

Decompose 1. Sort entries in D; permute z likewise 2. Filter some entries in D and z via deflation (next slide)

Decompose 1. Sort entries in D; permute z likewise 2. Filter some entries in D and z via deflation (next slide) 3. Compute all roots of the secular equation [1] giving the m eigenvalues. 4. Compute corresponding eigenvectors stably [2] [1] Li 1994 [2] Gu & Eisenstat 1994,

Decompose 1. Sort entries in D; permute z likewise 2. Filter some entries in D and z via deflation (next slide) 3. Compute all roots of the secular equation [1] giving the m eigenvalues. 4. Compute corresponding eigenvectors stably [2] 5. Multiply each eigenvector by Q Recall: [1] Li 1994 [2] Gu & Eisenstat 1994,

Deflation Recall: Entries of D are eigenvalues of two subproblems If two entries are nearly identical, we throw one away Fewer columns when multiplying eigenvectors by Q Same thing for small entries in z Reduce work complexity to O(n 2.3 )

GPU computing General-purpose computation on GPUs Bulk parallelism w/ many small threads Cost effective; widely available

Mapping work to GPU 1. Sort entries in D; permute z likewise 2. Filter some entries in D and z via deflation 3. Compute all roots of the secular equation, giving the m eigenvalues. 4. Compute corresponding eigenvectors stably 5. Multiply each eigenvector by Q → Done in bulk via DGEMM Parallel but not as work-intense

GPU memory High-bandwidth dedicated memory Separate from main memory Limited in size Main memory CPU GPU memory GPU PCI-E

Memory requirement Eigenvectors are dense → O(n 2 ) storage Intermediate workspace: eigenvectors of inner system Matrix dimension Memory required GB GB GB GB GB

Our contribution Overcome limitation in GPU memory while retaining adequate performance

Strategies 1. Use multiple GPUs GPU workspace

Strategies 2. Keep most of workspace in main memory (out-of-core approach) GPU CPU

Strategies 3. Shape work to fit GPU workspaces GPU work

Block matrix multiplication Use a fine partition to fit submatrices into GPU memory X = Q Eigenvectors of Eigenvectors of A

Hybrid computation Allocate subproblems to both GPUs and CPUs Model performance as a power function Profiler fits parameters using least-squares performance subproblem sizelog(subproblem size) log(performance)

Hybrid computation Solve many subproblems in parallel

Hybrid computation Solve each subproblem by parts

Results Scales to 50k * 50k matrix With 4 GB of GPU memory Main memory: 64 GB GPU memory: 5 GB per GPU

Results 9x CPU: dual Intel® Xeon® E GPU: 4 Nvidia Tesla® K20c

Conclusion Out-of-core approach overcomes memory limitation on the GPU Hybrid computation with profiling delivers reasonable performance

Acknowledgment Trinity College, Student Research Program Nvidia Corporation, CUDA Teaching Center Program

Any questions?