1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

MAGMA – LAPACK for HPC on Heterogeneous Architectures MAGMA – LAPACK for HPC on Heterogeneous Architectures Stan Tomov and Jack Dongarra Research Director.
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
GPU Programming using BU Shared Computing Cluster
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Solving Challenging Numerical Linear Algebra Algorithms Using GPU Accelerators Hatem Ltaief KAUST Supercomputing Laboratory Stanimire Tomov University.
A Memory-Efficient Algorithm for Large-Scale Symmetric Tridiagonal Eigenvalue Problem on Multi-GPU Systems Hyunsu Cho and Peter A. Yoon Trinity College,
Matrix Algebra on GPU and Multicore Architectures Matrix Algebra on GPU and Multicore Architectures Stan Tomov Research Director Innovative Computing Laboratory.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
OpenFOAM on a GPU-based Heterogeneous Cluster
Nequalities Takagi Factorization on a GPU using CUDA Gagandeep S. Sachdev, Vishay Vanjani & Mary W. Hall School of Computing, University of Utah What is.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium.
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
Thinking Outside of the Tera-Scale Box Piotr Luszczek.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Adam Wagner Kevin Forbes. Motivation  Take advantage of GPU architecture for highly parallel data-intensive application  Enhance image segmentation.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Author : Cedric Augonnet, Samuel Thibault, and Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux Workshop on Highly Parallel Processing on a.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
The Library Approach to GPU Computations of Initial Value Problems Dave Yuen University of Minnesota, U.S.A. with Larry Hanyk and Radek Matyska Charles.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
International Supercomputing Conference 2014 Leipzig, Germany Tutorial on June 22, 2014.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Exploiting Graphics Processors for High-performance IP Lookup in Software Routers Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu IEEE INFOCOM.
Appendix C Graphics and Computing GPUs
Generalized and Hybrid Fast-ICA Implementation using GPU
TI Information – Selective Disclosure
Analysis of Sparse Convolutional Neural Networks
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Ioannis E. Venetis Department of Computer Engineering and Informatics
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
GPU with CPU OZAN ÇETİNASLAN.
Linchuan Chen, Xin Huo and Gagan Agrawal
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
libflame optimizations with BLIS
Adaptive Strassen and ATLAS’s DGEMM
6- General Purpose GPU Programming
Presentation transcript:

1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU Schedulers Stan Tomov 1, George Bosilca 1, and Cédric Augonnet 2 SAAHPC'10, Knoxville, TN July 15, 2010 Innovative Computing Laboratory 1 The University of Tennessee, Knoxville INRIA 2 Bordeaux Sud Ouest, France

2/24 Outline Linear Algebra for Multicore and GPUs The MAGMA project The Hybridization Methodology of MAGMA – enabling task-based parallelism DPLASMA StarPU

3/24 Challenges Increase in parallelism Tesla C2050 (Fermi): 448 CUDA GHz SP peak is 1075 GFlop/s, DP peak is 515 Gflop/s Increase in communication cost [vs computation] Processor speed improves ~59% / year memory bandwidth by only 23% Heterogeneity

4/24 Matrix Algebra on GPU and Multicore Architectures (MAGMA) MAGMA: a new generation linear algebra (LA) libraries to achieve the fastest possible time to an accurate solution on hybrid/heterogeneous architectures, starting with current multicore+MultiGPU systems Homepage: MAGMA & LAPACK – MAGMA - based on LAPACK and extended for hybrid systems (multi-GPUs + multicore systems); – MAGMA - designed to be similar to LAPACK in functionality, data storage and interface, in order to allow scientists to effortlessly port any of their LAPACK-relying software components to take advantage of the new architectures – MAGMA - to leverage years of experience in developing open source LA software packages and systems like LAPACK, ScaLAPACK, BLAS, ATLAS as well as the newest LA developments (e.g. communication avoiding algorithms) and experiences on homogeneous multicores (e.g. PLASMA) Support - NSF, Microsoft, NVIDIA [ now CUDA Center of Excellence at UTK on the development of Linear Algebra Libraries for CUDA-based Hybrid Architectures ] MAGMA developers – University of Tennessee, Knoxville; University of California, Berkeley; University of Colorado, Denver

5/24 “delayed update” to organize successive Level 2 BLAS as a single Level 3 BLAS Localized (over tiles) elementary transformations A New Generation of Algorithms MAGMA Hybrid Algorithms (heterogeneity friendly) Rely on - hybrid scheduler (of DAGs) - hybrid kernels (for nested parallelism) - existing software infrastructure

6/24 Hybridization methodology MAGMA uses HYBRIDIZATION methodology based on Representing linear algebra algorithms as collections of TASKS and DATA DEPENDANCIES among them Properly SCHEDULING the tasks' execution over the multicore and the GPU hardware components Successfully applied to fundamental linear algebra algorithms One and two-sided factorizations and slvers Iterative linear and eigen-solvers Faster, cheaper, better ? High-level Leveraging prior developments Exceeding in performance (and sometimes accuracy) homogeneous solutions Hybrid CPU+GPU algorithms (small tasks for multicores and large tasks for GPUs)

7/24 MAGMA Status LU, QR, Cholesky (S, C, D, Z) Linear solvers In working precision, based on LU, QR, and Cholesky Mixed-precision iterative refinement CPU and GPU interfaces Two-sided factorizations Reduction to upper Hessenberg form for the general eigenvalue problem MAGMA BLAS Routines critical for MAGMA (GEMM, SYRK, TRSM, GEMV, SYMV, etc.) Bidiagonal two-sided reduction for SVD Tridiagonal two-sided for the symmetric eigenvalue problem Divide & Conquer for the symmetric eigenvalue problem GEMM for FERMI Cholesky and QR for multiGPUs on MAGNUM tiles Hybrid kernels (building blocks) for tile algorithms (e.g., dynamically scheduled) GMRES and PCG MAGMA 0.2 Unreleased

8/24 MAGMA Software Stack

9/24 Statically Scheduled One-Sided Factorizations (LU, QR, and Cholesky) Hybridization Panels (Level 2 BLAS) are factored on CPU using LAPACK Trailing matrix updates (Level 3 BLAS) are done on the GPU using “look-ahead” Note Panels are memory bound but are only O(N 2 ) flops and can be overlapped with the O(N 3 ) flops of the updates In effect, the GPU is used only for the high-performance Level 3 BLAS updates, i.e., no low performance Level 2 BLAS is scheduled on the GPU

10/24 Performance of the one-sided statically scheduled hybrid factorizations GPU : NVIDIA GeForce GTX 280 ( GHz) GPU BLAS : CUBLAS 2.2, sgemm peak: 375 GFlop/s CPU : Intel Xeon dual socket quad-core (8 GHz) CPU BLAS : MKL 10.0, sgemm peak: 128 GFlop/s Time GFlop/s QR factorization in single precision arithmetic, CPU interface Performance of MAGMA vs MKL MAGMA QR time breakdown [ for more performance data, see ] Matrix size x 1000

11/24 A look into MAGMA MAGMA homepage An example using the Cholesky factorization CPU interface: GPU interface:

12/24 Linear Solvers Direct solvers - Factor and do triangular solves in the same, working precision Mixed Precision Iterative Refinement - Factor in single (i.e. the bulk of the computation in fast arithmetic) and use it as preconditioner in simple double precision iteration, e.g. x i+1 = x i + (LU SP ) -1 P (b – A x i ) Direct solvers - Factor and do triangular solves in the same, working precision Mixed Precision Iterative Refinement - Factor in single (i.e. the bulk of the computation in fast arithmetic) and use it as preconditioner in simple double precision iteration, e.g. x i+1 = x i + (LU SP ) -1 P (b – A x i )

13/24 Two-sided matrix factorizations Used in singular-value and eigen-value problems LAPACK-based two-sided factorizations are rich in Level 2 BLAS and therefore can not be properly accelerated on multicore CPUs We developed hybrid algorithms exploring GPUs' high bandwidth GPU: GTX280 ( GHz, 141 GB/s) CPU: 2 x 4 cores Intel 2.33GHz, 10.4 GB/s) GPU: GTX280 ( GHz, 141 GB/s) CPU: 2 x 4 cores Intel 2.33GHz, 10.4 GB/s) High-performance CUDA kernels were developed for various matrix-vector products [ e.g., ssymv reaching up to 102 Gflop/s for the symmetric eigenvalue problem ]

14/24 Two-sided factorizations (performance in single precision arithmetic) GPU : NVIDIA GeForce GTX 280 ( GHz) GPU BLAS : CUBLAS 2.3, dgemm peak: 75 GFlop/s CPU : Intel Xeon dual socket quad-core (8 GHz) CPU BLAS : MKL 10.0, dgemm peak: 65 GFlop/s 26 x 12 x 22 x

15/24 FERMI What has changed regarding MAGMA algorithms? – MAGMA is coded on high-level, extracting its performance from the performance of low-level kernels, i.e., everything works for FERMI and nothing has changed on high-level – We have relied on being able to develop the low-level kernels needed of very high-performance as GPUs become more complex, this has become more difficult – Auto-tuning has become more important

16/24 GEMM for FERMI (Tesla C2050: 448 CUDA 1.15GHz, theoretical SP peak is 1.03 Tflop/s, DP peak 515 GFlop/s) SGEMM DGEMM Kernels have to be redesigned and auto-tuned for Fermi, e.g., inner-most blocking sizes have to be increased; add register blocking, etc. May even need to be written in assembly

17/24 Auto-tuning GEMM on Fermi GPUs A thread block computes a block of matrix C Each thread computes a row of the block submatrix of C Part of matrix B is loaded into shared memory and computations are done in terms of axpy Previous generation GPUs Fermi [ parametrize kernel by V. Volkov, UC Berkeley ] [ search space extended by R. Nath, UTK ] blk_M blk_N B A blk_K M NK lda ldc ldb C blk_K Increase blocking size and add register blocking Each thread computes a sub-block of submatrix of C Parts of both A and B are first loaded into shared memory and each thread loads corresponding values into registers to do register-blocked computation

18/24 A look into MAGMA BLAS MAGMA BLAS DGEMM for Fermi Note: This is just one version, produced by a code generator, exploring the search space described on the previous slide

19/24 LU factorization in double precision FERMI Tesla C2050: 448 CUDA 1.15GHz SP/DP peak is 1030 / 515 GFlop/s ISTANBUL AMD 8 socket 6 core (48 SP/DP peak is 1075 / 538 GFlop/s

20/24 LU, Cholesky, and QR on Fermi in DP FERMI Tesla C2050: 448 CUDA 1.15GHz SP/DP peak is 1030 / 515 GFlop/s

21/24 MultiGPU and Multicore MAGNUM tiles Tasks are hybrid, GPU BLAS-based, or multicore kernels Using PLASMA with customized extensions to reduce communication and w/ hybrid MAGMA kernels ● Demonstrated scalability for one-sided factorizations ● Highly optimized, used as benchmark to compare with dynamic schedulers [e.g., performance on 4 C1060 GPUs for Cholesky is up to 1200 Gflop/s, for QR is up to 830 Gflop/s in SP ] Using StarPU to schedule hybrid, GPU and multicore kernels (from PLASMA and MAGMA) Using the DPLASMA scheduler Rectangular tiles Tiles of variable sizes to be used to account for the heterogeneity of the system To experiment with “communication-avoiding” algorithms PLASMA tile algorithms A single GPU kernel processing multiple tile tasks in parallel [vs only one, but magnum tile, at a time] Scheduling:

22/24 Scheduling with PLASMA Cholesky factorization in SP HOST: 4x AMD Opteron GPUs: 4x C1060 (240 cores

23/24 Sparse Linear Algebra The hybridization approach naturally works [e.g., Richardson iteration in mixed-precision iterative refinement solvers, Krylov space iterative solvers and eigen-solvers ] Observed better accuracy (compared to CPU) Fast sparse matrix-vector product on Fermi does not have to use texture memory Explore ideas to reduce communication [ e.g., mixed precision, reduced storage for integers for the indexing, etc. ] Need high bandwidth

24/24 MAGMA Future Plans Experimentation with various schedulers (StarPU, PLASMA, DPLASMA) and improvements for multicore with multiGPUs “Communication avoiding” algorithms for systems of multicore CPUs and GPUs Kernels for FERMI (GEMM, SYRK, TRSM, SpMV) Auto-tuning framework Port [or facilitate the port] to Windows, Python and Matlab Krylov space iterative linear solvers and block eigensolvers

Collaborators / Support MAGMA Matrix Algebra on GPU and Multicore Architectures [ ICL team: J. Dongarra, S. Tomov, R. Nath, H. Ltaief ] PLASMA Parallel Linear Algebra for Scalable Multicore Architectures Collaborating partners University of Tennessee, Knoxville University of California, Berkeley University of Colorado, Denver University of Coimbra, Portugal INRIA Bordeaux Sud Ouest, France