Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Slides:



Advertisements
Similar presentations
Yafeng Yin, Lei Zhou, Hong Man 07/21/2010
Advertisements

Computational Physics Linear Algebra Dr. Guy Tel-Zur Sunset in Caruaru by Jaime JaimeJunior. publicdomainpictures.netVersion , 14:00.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Kaldi’s matrix library
Chapter 2, Linear Systems, Mainly LU Decomposition.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
ECIV 301 Programming & Graphics Numerical Methods for Engineers Lecture 12 System of Linear Equations.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
LAPACK HTML version LAPACK = Linear Algebra PACKage ~ LINPACK + EISPACK
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
Parallel & Cluster Computing Linear Algebra Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma SC08 Education.
CE 311 K - Introduction to Computer Methods Daene C. McKinney
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Engineering Analysis ENG 3420 Fall 2009
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Lecture 8: Caffe - CPU Optimization
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
Tuning Libraries to Effectively Exploit Memory Prof. Misha Kilmer Emily Reid Stacey Ecott.
Little Linear Algebra Contents: Linear vector spaces Matrices Special Matrices Matrix & vector Norms.
By: David McQuilling; Jesus Caban Deng Li Numerical Linear Algebra.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Introduction to MATLAB Session 1 Prepared By: Dina El Kholy Ahmed Dalal Statistics Course – Biomedical Department -year 3.
1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
By: David McQuilling and Jesus Caban Numerical Linear Algebra.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Intel Math Kernel Library Wei-Ren Chang, Department of Mathematics, National Taiwan University 2009/03/10.
Presented by The Lapack for Clusters (LFC) Project Piotr Luszczek The MathWorks, Inc.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
1. 2 Define the purpose of MKL Upon completion of this module, you will be able to: Identify and discuss MKL contents Describe the MKL EnvironmentDiscuss.
Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden
CIS 601 Fall 2003 Introduction to MATLAB Longin Jan Latecki Based on the lectures of Rolf Lakaemper and David Young.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.
CIS 595 MATLAB First Impressions. MATLAB This introduction will give Some basic ideas Main advantages and drawbacks compared to other languages.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Matrices. Variety of engineering problems lead to the need to solve systems of linear equations matrixcolumn vectors.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
A rectangular array of numeric or algebraic quantities subject to mathematical operations. The regular formation of elements into columns and rows.
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
University of Tennessee Automatically Tuned Linear Algebra Software (ATLAS) R. Clint Whaley University of Tennessee
Improvement to Hessenberg Reduction
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Linear Algebra review (optional)
05/23/11 Evaluation and Benchmarking of Highly Scalable Parallel Numerical Libraries Christos Theodosiou User and Application Support.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Chapter 2, Linear Systems, Mainly LU Decomposition
Numerical Linear Algebra
Linear Systems, Mainly LU Decomposition
Automatic Performance Tuning
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Introduction to MATLAB
Adaptive Strassen and ATLAS’s DGEMM
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
All we need in Game Programming Course Reference
Linear Algebra review (optional)
Presentation transcript:

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA CPS5401 Fall 2015 November 19 Class

Learning Objectives After completing this lesson, you should be able to List and describe advantages of using linear algebra libraries List types of computations performed by linear algebra libraries Describe functionality of the BLAS Locate and use documentation on linear algebra libraries for your platform Insert calls to linear algebra library routines into your program and compile and run the resulting program Describe current research on numerical linear algebra for multicore and heterogeneous architectures

Numerical Linear Algebra Algorithms for performing matrix operations on computers Widely used in scientific, engineering, and financial applications Fundamental algorithms Basic matrix and vector operations LU decomposition QR decomposition Singular value decomposition Eigenvalues

BLAS Basic Linear Algebra Subprograms De facto standard (all implementations use the same calling interface) First published in 1979 http://www.netlib.org/blas/ BLAS Quick Reference Guide: http://www.netlib.org/lapack/lug/node145.html Tuned versions implemented by vendors (Intel MKL, AMD ACML, Cray LibSci, IBM ESSL) Routines to perform basic operations such as vector and matrix multiplication

BLAS Functionality and Levels This level contains vector operations of the form as well as scalar dot products and vector norms, among other things. Level 2 This level contains matrix-vector operations of the form as well as solving for with being triangular, among other things. Level 3 This level contains matrix-matrix operations of the form as well as solving for triangular matrices , among other things. This level contains the widely used General Matrix Multiply (GEMM) operation.

General Matrix Multiply (GEMM) where TRANSA and TRANSB determine if the matrices A and B are to be transposed M is the number of rows in matrix C and, depending on TRANSA, the number of rows in the original matrix A or its transpose. N is the number of columns in matrix C and, depending on TRANSB, the number of columns in the matrix B or its transpose. K is the number of columns in matrix A (or its transpose) and rows in matrix B (or its transpose). LDA, LDB and LDC specify the size of the first dimension of the matrices, as laid out in memory; meaning the memory distance between the start of each row/column, depending on the memory structure. Precision (x) – S for single, D for double, C for complex single, Z for complex double

LAPACK Linear Algebra PACKage www.netlib.org/lapack/ De facto standard Successor to the linear equations and linear least-squares routines of LINPACK and the eigenvalue routines of EISPACK Routines for solving systems of linear equations, linear least squares, eigenvalue problems, and singular value decomposition Routines to implement the associated matrix factorizations such as LU, QR, Cholesky and Schur decomposition Handles real and complex matrices in both single and double precision Depends on the BLAS to effectively exploit caches on modern cache-based architectures Tuned versions implemented in vendor libraries (e.g., AMD ACML, Intel MKL, Cray LibSci, IBM ESSL)

LAPACK Naming Scheme A LAPACK subroutine name is in the form pmmaaa, where: p is a one-letter code denoting the type of numerical constants used. S, D stand for real floating point arithmetic respectively in single and double precision, while C and Z stand for complex arithmetic with respectively single and double precision. mm is a two-letter code denoting the kind of matrix expected by the algorithm. The actual data are stored in a different format depending on the specific kind; e.g., when the code DI is given, the subroutine expects a vector of length n containing the elements on the diagonal, while when the code GE is given, the subroutine expects an n×n array containing the entries of the matrix. aaa is a one- to three-letter code describing the actual algorithm implemented in the subroutine, e.g. SV denotes a subroutine to solve linear system For example, the subroutine to solve a linear system with a general (non-structured) matrix using real double-precision arithmetic is called DGESV. For details, see the LAPACK User’s Guide at www.netlib.org/lapack/lug/

Intel MKL Stands for Math Kernel Library https://software.intel.com/en-us/intel-mkl Vectorized and threaded linear algebra, FFTs, and statistics functions Uses standard BLAS and LAPACK APIs MIT’s FFTW C interface Direct sparse solver (not standardized) Support for AVX-512 Advanced Vector Extensions

ACML AMD Core Math Library http://developer.amd.com/tools/cpu-development/amd-core-math-library-acml/ ACML consists of the following main components: A full implementation of Level 1, 2 and 3 Basic Linear Algebra Subprograms (BLAS), with optimizations for AMD Opteron processors. A full suite of Linear Algebra (LAPACK) routines. A comprehensive suite of Fast Fourier transform (FFTs) in single-, double-, single-complex and double-complex data types. Fast scalar, vector, and array math transcendental library routines Random Number Generators in both single- and double-precision

ScaLAPACK Scalable Linear Algebra PACKage www.netlib.org/scalapack/ Library of high-performance linear algebra routines for parallel distributed memory machines Solves dense and banded linear systems, least squares problems, eigenvalue problems, and singular value problems Key ideas block cyclic data distribution for dense matrices and a block data distribution for banded matrices, parameterizable at runtime block-partitioned algorithms to ensure high levels of data reuse Efficient low-level communication implemented by BLACS (Basic Linear Algebra Communication Subprograms) Will run on any machine with BLAS, LAPACK, and BLACS

Current Efforts Parallel Linear Algebra Software for Multicore Architectures (PLASMA) www.netlib.org/plasma/ http://icl.eecs.utk.edu/plasma/ Matrix Algebra on GPU and Multicore Architectures (MAGMA) http://icl.eecs.utk.edu/magma/ OpenBLAS http://c2.com/cgi/wiki?OpenBlas

MKL on Stampede Libraries in $TACC_MKL_DIR (This environment variable should be defined if you have the Intel compiler module loaded). Examples in $TACC_MKL_DIR/examples See documentation at https://software.intel.com/en-us/articles/intel-math-kernel-library-documentation

Scalapack www.citutor.org Introduction to MPI Chapter 10