libflame optimizations with BLIS

libflame optimizations with BLIS
Kiran varaganti 19 September 2016

Introduction AMD provides high-performance computing libraries for various verticals: Oil & Gas Exploration Computer Vision Machine Learning … We will provide BLAS & LAPACK functionalities through BLIS & libFLAME respectively Basically optimize these open source libraries for AMD architectures

benefits of BLIS & libFLAME
The benefits are enormous, but to name a few: Removed the inconvenient dependencies on the Fortran runtime. The source for both BLIS and libFLAME is written in C, or in assembly where performance requires. Object-Based Abstractions and API Built around opaque structures that hide matrices implementation details (data-layout) Exports object-based programming interfaces to operate on these objects An expanded API that is a strict superset of the traditional BLAS and LAPACK libraries High-performance dense linear algebra library framework Abstraction facilitates programming without array or loop indices, which allows the user to avoid painful index-related programming errors Provides algorithm families for each operation, so developers can choose the one that best suits their needs Provides a framework for building complete custom linear algebra codes

… libFLAME BLIS MKL BLAS Compatible libraries
OpenBLAS … LAPACK functionalities are implemented on top of BLAS routines libFLAME can be configured to use any external BLAS compatible libraries Libflame can be configured to use external high performance lapack kind of library which adhere’s to LAPACK interface. Any BLAS library which adhere’s to BLAS interface can be used as BLAS library. BLAS Compatible libraries Goal: Optimize libFLAME using BLIS as the BLAS library

libflame functionalities
Decomposition/Factorizations LU Cholesky QR Eigen (Symmetric, Hermitian,…) Singular Value Matrix Inversions Triangular matrix inversion Symmetric positive definite matrix inversion Hermitian Positive inversion Solvers Triangular Sylvester equations Triangular Lyapunov equations Reductions To tridiagonal form To bidiagonal form Upper Hessenberg form Standard form libflame functionalities Currently Focusing on Cholesky LU QR Very useful in solving linear equations

Cholesky, LU, QR benchmarking on AMD reference CPU
Cholesky & LU performance is comparable with OpenBLAS except for small dimensions QR needs to be improved for all dimensions Single Precision – Column Major blis library # commit a017062fdf763037da9d971a028bb07d47aa1c8a Date: Fri Jul 22 17:02: Ubuntu LTS 64- bit OS No Optimizations

Block Sizes - Cholesky b = 128 ****** (m-128) x 128 96 x 96 64 x 64
b x b (m-128) x 128 ****** (m-128) x (m-128) b = 128 32 x 32 96 x 32 96 x 96 96 x 96 32 x 32 64 x 32 64 x 64 32 x 32 *** 32 x 32 32 x 32 64 x 64 128 x 128 m x m Repeat sub-block partitioning after HERK update for (m-128) x (m-128) matrix BLAS Operations - Level 1 & 2 , block size < 32 - Level 3 TRSM, 32 x 32 <= block size <= (m-128) x 128 - Level 3 HERK , 32 x 32 <= block size <= (m-128) x (m-128) m x m = input matrix size 128 x 128 = Block size 32 X 32 = sub-block size (Cholesky using Level-1 & Level-2 BLAS

BLAS Dependencies Cholesky Level 1 Level 2 Level 3
DOT (Data Size < 32) Scalv (Data Size < 32) Level 2 GEMV (Data Size < 32) Level 3 TRSM (Data Size < input Matrix Size) HERK (Data Size < input Matrix Size) Summary: The impact of L1 & L2 routines is less compared to Level-3 BLAS routines. Optimization: Improve TRSM & HERK for all block sizes. Most important for smaller block sizes less than 128.

Block sizes LU & QR LU with pivoting QR
128 x 128, 16 x 16 (Level 1 & Level 2) QR Level 1 & Level 2 works on all higher block sizes as well. It has high impact on the performance of QR Commented FLA_QR_UT_opt_var2( A, T ); to comment Level 1 & 2 routines used in QR Factorization.

BLAS Dependencies LU with Pivot QR Factorization Level 1 Level 2
Dot Amax scalv Level 2 GEMV Level 3 TRSM GEMM QR Factorization Level 1 Copyv Scalv Axpyv Axpyt Level 2 Ger (General rank-1 update) GEMV Level 3 TRMM GEMM TRSM

+ Blis optimizations AVX x86 SIMD Optimization x = x f 𝑨 𝒙 = x
axpyf – which in turn optimizes GEMV (Column Major) dotxf - which in turn optimizes GEMV (row major) axpyv 256 bit YMM Registers 3 1 9 7 27 21 x = .... 3197 1201 2220 8654 301.. x f 𝑨 𝒙 + 9 3 27 21 = 1 2 x

Blis optimizations Impact – QR Factorization
AMD Reference cpu 1.9x improvement 1.11x improvement 1.11x ~ 11% improvement 1.8x ~89.67% 2.12x improvement bli_ssumsqv_unb_var1() is replaced with dotv for norm2 calculation Ubuntu LTS 64- bit OS

Profile Data – Cholesky, LU & QR
Cholesky Factorization LU Factorization QR Factorization

Blis optimizations Impact – Cholesky & LU Factorization
Needs improvement AMD Reference cpu Better than OpenBLAS ~7% improvement Needs improvement Better than OpenBLAS ~14% improvement Ubuntu LTS 64- bit OS

Profile Data – Cholesky
Square Matrix Size = 128 Square Matrix Size = 160 Square Matrix Size = 640 Square Matrix Size = 1600 To improve performance for small matrices – Optimize dotxv

Profile Data – LU Square Matrix Size = 128 Square Matrix Size = 320
To improve performance for small matrices – Optimize Framework code.

Profile Data – QR Square Matrix Size = 160 Square Matrix Size = 480
Small matrices – L1 routines are dominating

Summary For larger matrix sizes, Cholesky factorization performance is better than OpenBLAS To improve performance of Cholesky for smaller matrices, optimize dotxv routine & framework code LU performance is better than OpenBLAS for matrices greater 320 Performance of QR factorization is improved by more than 2x, but still falling short in performance compared to OpenBLAS for all matrix sizes

Thank You

libflame - Parameters 3 Number of repeats per experiment
c Flat matrix storage scheme(s) to test ('c' = col-major; 'r' = row-major; 'g' = general; 'm' = mixed) s Datatype(s) to test Algorithmic blocksize for blocked algorithms Algorithmic blocksize for algorithms-by-blocks Storage blocksize for algorithms-by-blocks Problem size: first to test Problem size: maximum to test Problem size: increment between experiments Number of SuperMatrix threads (0 = disable SuperMatrix in FLASH front-ends) i Reaction to test failure ('i' = ignore; 's' = sleep() and continue; 'a' = abort)

Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.

libflame optimizations with BLIS

Similar presentations

Presentation on theme: "libflame optimizations with BLIS"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

libflame optimizations with BLIS

Similar presentations

Presentation on theme: "libflame optimizations with BLIS"— Presentation transcript:

Similar presentations

About project

Feedback