libflame optimizations with BLIS

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

Innovations in Structured Products October 25, 2010 An Innovator’s Dilemma?

MATH 685/ CSI 700/ OR 682 Lecture Notes

Solving Linear Systems (Numerical Recipes, Chap 2)

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

V0.01 © 2009 Research In Motion Limited Introduction to Java Application Development for the BlackBerry Smartphone Trainer name Date.

V0.01 © 2009 Research In Motion Limited Understanding Java APIs for Mobile Devices Trainer name Date.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

1 Intel Mathematics Kernel Library (MKL) Quickstart COLA Lab, Department of Mathematics, Nat’l Taiwan University 2010/05/11.

Conditions and Terms of Use

Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.

Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,

Enhancement Package Innovations Gabe Rodriguez - Halliburton Stefan Kneis – SAP Marco Valencia - SAP.

Lesson 3 CSPP58001.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

Optimized Synthetics 1 OpenStorage Optimized Synthetics.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.

1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang

Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Oracle Proprietary and Confidential. 1.

The Troubleshooting Process. Hardware Maintenance Make sure that the hardware is operating properly.  Check the condition of parts.  Repair or replace.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Connectivity to bank and sample account structure

An Introduction to the Bibliographic Metadata Profile in Alma

Parallel Direct Methods for Sparse Linear Systems

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

TI Information – Selective Disclosure

Matrices and Vector Concepts

ADP Product Suite Integration – New Hire Workflow

Ioannis E. Venetis Department of Computer Engineering and Informatics

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Parallelspace PowerPoint Template for ArchiMate® 2.1 version 1.1

Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0

The Small batch (and Other) solutions in Mantle API

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

SOC Runtime Gregory Stoner.

Automation in an XML Authoring Environment

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Introduction to cuBLAS

Interference from GPU System Service Requests

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Pedro Miguel Teixeira Senior Software Developer Microsoft Corporation

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Motivation for 36OU Open Rack

Advanced Micro Devices, Inc.

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

libflame optimizations with BLIS Kiran varaganti 19 September 2016

Introduction AMD provides high-performance computing libraries for various verticals: Oil & Gas Exploration Computer Vision Machine Learning … We will provide BLAS & LAPACK functionalities through BLIS & libFLAME respectively Basically optimize these open source libraries for AMD architectures http://gpuopen.com/professional-compute/ http://developer.amd.com/community/blog/2015/08/07/open-source-strikes-again-accelerated-math-libraries-at-amd/

benefits of BLIS & libFLAME The benefits are enormous, but to name a few: Removed the inconvenient dependencies on the Fortran runtime. The source for both BLIS and libFLAME is written in C, or in assembly where performance requires. Object-Based Abstractions and API Built around opaque structures that hide matrices implementation details (data-layout) Exports object-based programming interfaces to operate on these objects An expanded API that is a strict superset of the traditional BLAS and LAPACK libraries High-performance dense linear algebra library framework Abstraction facilitates programming without array or loop indices, which allows the user to avoid painful index-related programming errors Provides algorithm families for each operation, so developers can choose the one that best suits their needs Provides a framework for building complete custom linear algebra codes

… libFLAME BLIS MKL BLAS Compatible libraries OpenBLAS … LAPACK functionalities are implemented on top of BLAS routines libFLAME can be configured to use any external BLAS compatible libraries Libflame can be configured to use external high performance lapack kind of library which adhere’s to LAPACK interface. Any BLAS library which adhere’s to BLAS interface can be used as BLAS library. BLAS Compatible libraries Goal: Optimize libFLAME using BLIS as the BLAS library

libflame functionalities Decomposition/Factorizations LU Cholesky QR Eigen (Symmetric, Hermitian,…) Singular Value Matrix Inversions Triangular matrix inversion Symmetric positive definite matrix inversion Hermitian Positive inversion Solvers Triangular Sylvester equations Triangular Lyapunov equations Reductions To tridiagonal form To bidiagonal form Upper Hessenberg form Standard form libflame functionalities Currently Focusing on Cholesky LU QR Very useful in solving linear equations

Cholesky, LU, QR benchmarking on AMD reference CPU Cholesky & LU performance is comparable with OpenBLAS except for small dimensions QR needs to be improved for all dimensions Single Precision – Column Major blis library # commit a017062fdf763037da9d971a028bb07d47aa1c8a Date: Fri Jul 22 17:02:59 2016 Ubuntu 14.04 LTS 64- bit OS No Optimizations

Block Sizes - Cholesky b = 128 ****** (m-128) x 128 96 x 96 64 x 64 b x b (m-128) x 128 ****** (m-128) x (m-128) b = 128 32 x 32 96 x 32 96 x 96 96 x 96 32 x 32 64 x 32 64 x 64 32 x 32 *** 32 x 32 32 x 32 64 x 64 128 x 128 m x m Repeat sub-block partitioning after HERK update for (m-128) x (m-128) matrix BLAS Operations - Level 1 & 2 , block size < 32 - Level 3 TRSM, 32 x 32 <= block size <= (m-128) x 128 - Level 3 HERK , 32 x 32 <= block size <= (m-128) x (m-128) m x m = input matrix size 128 x 128 = Block size 32 X 32 = sub-block size (Cholesky using Level-1 & Level-2 BLAS

BLAS Dependencies Cholesky Level 1 Level 2 Level 3 DOT (Data Size < 32) Scalv (Data Size < 32) Level 2 GEMV (Data Size < 32) Level 3 TRSM (Data Size < input Matrix Size) HERK (Data Size < input Matrix Size) Summary: The impact of L1 & L2 routines is less compared to Level-3 BLAS routines. Optimization: Improve TRSM & HERK for all block sizes. Most important for smaller block sizes less than 128.

Block sizes LU & QR LU with pivoting QR 128 x 128, 16 x 16 (Level 1 & Level 2) QR Level 1 & Level 2 works on all higher block sizes as well. It has high impact on the performance of QR Commented FLA_QR_UT_opt_var2( A, T ); to comment Level 1 & 2 routines used in QR Factorization.

BLAS Dependencies LU with Pivot QR Factorization Level 1 Level 2 Dot Amax scalv Level 2 GEMV Level 3 TRSM GEMM QR Factorization Level 1 Copyv Scalv Axpyv Axpyt Level 2 Ger (General rank-1 update) GEMV Level 3 TRMM GEMM TRSM

+ Blis optimizations AVX x86 SIMD Optimization x = x f 𝑨 𝒙 = x axpyf – which in turn optimizes GEMV (Column Major) dotxf - which in turn optimizes GEMV (row major) axpyv 256 bit YMM Registers 3 1 9 7 27 21 x = .... 3197 1201 2220 8654 301.. x f 𝑨 𝒙 + 9 3 27 21 = 1 2 x

Blis optimizations Impact – QR Factorization AMD Reference cpu 1.9x improvement 1.11x improvement 1.11x ~ 11% improvement 1.8x ~89.67% 2.12x improvement bli_ssumsqv_unb_var1() is replaced with dotv for norm2 calculation Ubuntu 14.04 LTS 64- bit OS

Profile Data – Cholesky, LU & QR Cholesky Factorization LU Factorization QR Factorization

Profile Data – Cholesky, LU & QR Cholesky Factorization LU Factorization QR Factorization

Blis optimizations Impact – Cholesky & LU Factorization Needs improvement AMD Reference cpu Better than OpenBLAS ~7% improvement Needs improvement Better than OpenBLAS ~14% improvement Ubuntu 14.04 LTS 64- bit OS

Profile Data – Cholesky Square Matrix Size = 128 Square Matrix Size = 160 Square Matrix Size = 640 Square Matrix Size = 1600 To improve performance for small matrices – Optimize dotxv

Profile Data – LU Square Matrix Size = 128 Square Matrix Size = 320 To improve performance for small matrices – Optimize Framework code.

Profile Data – QR Square Matrix Size = 160 Square Matrix Size = 480 Small matrices – L1 routines are dominating

Summary For larger matrix sizes, Cholesky factorization performance is better than OpenBLAS To improve performance of Cholesky for smaller matrices, optimize dotxv routine & framework code LU performance is better than OpenBLAS for matrices greater 320 Performance of QR factorization is improved by more than 2x, but still falling short in performance compared to OpenBLAS for all matrix sizes

Thank You

libflame - Parameters 3 Number of repeats per experiment c Flat matrix storage scheme(s) to test ('c' = col-major; 'r' = row-major; 'g' = general; 'm' = mixed) s Datatype(s) to test 40 Algorithmic blocksize for blocked algorithms 10 Algorithmic blocksize for algorithms-by-blocks 40 Storage blocksize for algorithms-by-blocks 160 Problem size: first to test 1600 Problem size: maximum to test 160 Problem size: increment between experiments 0 Number of SuperMatrix threads (0 = disable SuperMatrix in FLASH front-ends) i Reaction to test failure ('i' = ignore; 's' = sleep() and continue; 'a' = abort)

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2016 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.