High-performance Implementations of Fast Matrix Multiplication with Strassen’s Algorithm Jianyu Huang with Tyler M. Smith, Greg M. Henry, Robert A. van.

Slides:



Advertisements
Similar presentations
Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.
Advertisements

1 Anatomy of a High- Performance Many-Threaded Matrix Multiplication Tyler M. Smith, Robert A. van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G.
Isaac Lyngaas John Paige Advised by: Srinath Vadlamani & Doug Nychka SIParCS,
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
The FLAME Project Faculty: Robert van de Geijn (CS/ICES) Don Batory (CS) Maggie Myers (SDS) John Stanton (Chem) Victor (TACC) Research Staff: Field Van.
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
High Performance Dense Linear Algebra on Spatially Distributed Processors Jeffrey Diamond and Behnam Robatmili Stephen Keckler, Robert van de Geijn, Kazushige.
® Backward Error Analysis and Numerical Software Sven Hammarling NAG Ltd, Oxford
High Performance Computing 1 Numerical Linear Algebra An Introduction.
1 BLIS Matrix Multiplication: from Real to Complex Field G. Van Zee.
Beyond GEMM: How Can We Make Quantum Chemistry Fast? or: Why Computer Scientists Don’t Like Chemists Devin Matthews 9/25/ BLIS Retreat1.
Atlanta, Georgia TiNy Threads on BlueGene/P: Exploring Many-Core Parallelisms Beyond The Traditional OS Handong Ye, Robert Pavel, Aaron Landwehr, Guang.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Fan Zhang, Yang Gao and Jason D. Bakos
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Automatic Performance Tuning Jeremy Johnson Dept. of Computer Science Drexel University.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
JAVA AND MATRIX COMPUTATION
Carnegie Mellon Generation of SIMD Dense Linear Algebra Kernels with Analytical Models Generation of SIMD Dense Linear Algebra Kernels with Analytical.
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, UT Austin 1.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Convolutional Neural Networks for Architectures with Hierarchical Memories Ardavan Pedram VLSI Research Group Stanford University.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
ANR Meeting / PetaQCD LAL / Paris-Sud University, May 10-11, 2010.
Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
June 13-15, 2010SPAA Managing the Complexity of Lookahead for LU Factorization with Pivoting Ernie Chan.
1 Exploiting BLIS to Optimize LU with Pivoting Xianyi Zhang
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Parallel Programming Models
Generating Families of Practical Fast Matrix Multiplication Algorithms
Stanford University.
TI Information – Selective Disclosure
Ioannis E. Venetis Department of Computer Engineering and Informatics
Using BLIS Building Blocks:
The 9th International Workshop on Parallel Matrix Algorithms and Applications (PMAA16) Exploring Vectorization Possibilities on the Intel Xeon Phi for.
High-Performance Matrix Multiplication
BLIS optimized for EPYCTM Processors
BLAS: behind the scenes
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Adaptive Strassen and ATLAS’s DGEMM
Shengxin Zhu The University of Oxford
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Using BLIS Building Blocks:
Improving IPC by Kernel Design
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang University of Houston, USA 1.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

High-performance Implementations of Fast Matrix Multiplication with Strassen’s Algorithm Jianyu Huang with Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn

STRASSEN, from 30,000 feet *Overlook of the Bay Area. Photo taken in Mission Peak Regional Preserve, Fremont, CA. Summer Volker Strassen (Born in 1936, aged 80) Original Strassen Paper (1969)

8 multiplications, 8 additions 7 multiplications, 22 additions Direct Computation Strassen’s Algorithm *Strassen, Volker. "Gaussian elimination is not optimal." Numerische Mathematik 13, no. 4 (1969): One-level Strassen’s Algorithm (In theory)

One-level Strassen (1+14.3% speedup)  8 multiplications → 7 multiplications ; Two-level Strassen (1+30.6% speedup)  64 multiplications → 49 multiplications; d-level Strassen (n 3 /n speedup)  8 d multiplications → 7 d multiplications; Multi-level Strassen’s Algorithm (In theory) M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 –B 11 ); M 3 := A 11 (B 10 –B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 –A 00 )(B 00 +B 01 ); M 6 := (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 – M 4 + M 7 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 – M 1 + M 2 + M 5

Numerical stability: *Grey Ballard, Austin R. Benson, Alex Druinsky, Benjamin Lipshitz, and Oded Schwartz. "Improving the numerical stability of fast matrix multiplication algorithms." arXiv preprint arXiv: (2015). *James Demmel, Ioana Dumitriu, Olga Holtz, and Robert Kleinberg. "Fast matrix multiplication is stable." Numerische Mathematik 106, no. 2 (2007): *Nicholas J. Higham, Accuracy and stability of numerical algorithms. SIAM, Multi-level Strassen’s Algorithm (In theory) M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 –B 11 ); M 3 := A 11 (B 10 –B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 –A 00 )(B 00 +B 01 ); M 6 := (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 – M 4 + M 7 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 – M 1 + M 2 + M 5 One-level Strassen (1+14.3% speedup)  8 multiplications → 7 multiplications ; Two-level Strassen (1+30.6% speedup)  64 multiplications → 49 multiplications; d-level Strassen (n 3 /n speedup)  8 d multiplications → 7 d multiplications;

Strassen’s Algorithm (In practice) M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 –B 11 ); M 3 := A 11 (B 10 –B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 –A 00 )(B 00 +B 01 ); M 6 := (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 – M 4 + M 7 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 – M 1 + M 2 + M 5

M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 –B 11 ); M 3 := A 11 (B 10 –B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 –A 00 )(B 00 +B 01 ); M 6 := (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 – M 4 + M 7 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 – M 1 + M 2 + M 5 Strassen’s Algorithm (In practice) One-level Strassen (1+14.3% speedup)  7 multiplications + 22 additions; Two-level Strassen (1+30.6% speedup)  49 multiplications additions; d-level Strassen (n 3 /n speedup)  Numerical unstable; Not achievable T0T0 T2T2 T3T3 T1T1 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9

Strassen’s Algorithm (In practice) One-level Strassen (1+14.3% speedup)  7 multiplications + 22 additions; Two-level Strassen (1+30.6% speedup)  49 multiplications additions; d-level Strassen (n 3 /n speedup)  Numerical unstable; Not achievable M 0 := (A 00 +A 11 )(B 00 +B 11 ); M 1 := (A 10 +A 11 )B 00 ; M 2 := A 00 (B 01 –B 11 ); M 3 := A 11 (B 10 –B 00 ); M 4 := (A 00 +A 01 )B 11 ; M 5 := (A 10 –A 00 )(B 00 +B 01 ); M 6 := (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 – M 4 + M 7 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 – M 1 + M 2 + M 5 T0T0 T2T2 T3T3 T1T1 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9

Padding Buffers To achieve practical high performance of Strassen’s algorithm…... Our Implementations Memory movement Matrix Size Matrix Shape Parallelism Conventional Implementations Must be square Must be large No Additional Workspace

Padding Buffers To achieve practical high performance of Strassen’s algorithm…... Our Implementations Memory movement Matrix Size Matrix Shape Parallelism Conventional Implementations Must be square Must be large Usually task parallelism No Additional Workspace

Matrix Size Matrix Shape Parallelism Padding Buffers To achieve practical high performance of Strassen’s algorithm…... Can be data parallelism Conventional Implementations Our Implementations Memory movement Must be square Must be large Usually task parallelism No Additional Workspace

Outline Standard Matrix-matrix multiplication Strassen’s Algorithm Reloaded High-performance Implementations Theoretical Model and Analysis Performance Experiments Conclusion

Level-3 BLAS matrix-matrix multiplication (GEMM) (General) matrix-matrix multiplication (GEMM) is supported in the level-3 BLAS* interface as Ignoring transa and transb, GEMM computes We consider the simplified version of GEMM *Dongarra, Jack J., et al. "A set of level 3 basic linear algebra subprograms."ACM Transactions on Mathematical Software (TOMS) 16.1 (1990): 1-17.

State-of-the-art GEMM in BLIS BLAS-like Library Instantiation Software (BLIS) is a portable framework for instantiating BLAS-like dense linear algebra libraries. BLIS provides a refactoring of GotoBLAS algorithm (best-known approach) to implement GEMM. GEMM implementation in BLIS has 6-layers of loops. The outer 5 loops are written in C. The inner-most loop (micro-kernel) is written in assembly for high performance.  Partition matrices into smaller blocks to fit into the different memory hierarchy.  The order of these loops is designed to utilize the cache reuse rate. BLIS opens the black box of GEMM, leading to many applications built on BLIS.  Field Van Zee, and Robert van de Geijn. “BLIS: A Framework for Rapidly Instantiating BLAS Functionality." ACM TOMS 41.3 (2015): 14.  Chenhan D. Yu, *Jianyu Huang, Woody Austin, Bo Xiao, and George Biros. "Performance optimization for the k-nearest neighbors kernel on x86 architectures." In SC’15, p. 7. ACM,  *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.  Devin Matthews, Field Van Zee, and Robert van de Geijn. “High-Performance Tensor Contraction without BLAS.” In submission to SC’16  Kazushige Goto, and Robert van de Geijn. "High-performance implementation of the level-3 BLAS." ACM TOMS 35.1 (2008): 4.  Kazushige Goto, and Robert van de Geijn. "Anatomy of high-performance matrix multiplication." ACM TOMS 34.3 (2008): 12.

State-of-the-art GEMM in BLIS BLAS-like Library Instantiation Software (BLIS) is a portable framework for instantiating BLAS-like dense linear algebra libraries. BLIS provides a refactoring of GotoBLAS algorithm (best-known approach) to implement GEMM. GEMM implementation in BLIS has 6-layers of loops. The outer 5 loops are written in C. The inner-most loop (micro-kernel) is written in assembly for high performance.  Partition matrices into smaller blocks to fit into the different memory hierarchy.  The order of these loops is designed to utilize the cache reuse rate. BLIS opens the black box of GEMM, leading to many applications built on BLIS.  Chenhan D. Yu, *Jianyu Huang, Woody Austin, Bo Xiao, and George Biros. "Performance optimization for the k-nearest neighbors kernel on x86 architectures." In SC’15, p. 7. ACM,  *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.  Devin Matthews, Field Van Zee, and Robert van de Geijn. “High-Performance Tensor Contraction without BLAS.” In submission to SC’16  Field Van Zee, and Robert van de Geijn. “BLIS: A Framework for Rapidly Instantiating BLAS Functionality." ACM TOMS 41.3 (2015): 14.  Kazushige Goto, and Robert van de Geijn. "High-performance implementation of the level-3 BLAS." ACM TOMS 35.1 (2008): 4.  Kazushige Goto, and Robert van de Geijn. "Anatomy of high-performance matrix multiplication." ACM TOMS 34.3 (2008): 12.

Memory Hierarchy Register L1 cache Main memory L2 cache L3 cache 32 KB Smaller, Faster 256 KB 25.6 MB 128 GB AVX256 Ivy Bridge Xeon Phi AVX KB 512 KB GB

*Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications. GotoBLAS algorithm for GEMM in BLIS

+= nCnC nCnC A BjBj CjCj nCnC nCnC main memory 5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A BjBj CjCj nCnC nCnC main memory 5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

mRmR kCkC Pack A i → A i ~ AiAi ~ 3 rd loop around micro-kernel += mCmC CiCi AiAi mCmC L2 cache Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A BjBj CjCj nCnC nCnC main memory 5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

+= nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi mRmR kCkC Pack A i → A i ~ AiAi ~ 3 rd loop around micro-kernel += mCmC CiCi AiAi mCmC L2 cache Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A BjBj CjCj nCnC nCnC main memory 5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

Update C ij mRmR += nRnR kCkC 1 st loop around micro-kernel L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi mRmR kCkC Pack A i → A i ~ AiAi ~ 3 rd loop around micro-kernel += mCmC CiCi AiAi mCmC L2 cache Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A BjBj CjCj nCnC nCnC main memory 5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

1 += 1 micro-kernel Update C ij mRmR += nRnR kCkC 1 st loop around micro-kernel L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi mRmR kCkC Pack A i → A i ~ AiAi ~ 3 rd loop around micro-kernel += mCmC CiCi AiAi mCmC L2 cache Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A BjBj CjCj nCnC nCnC main memory 5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

1 += 1 micro-kernel Update C ij mRmR += nRnR kCkC 1 st loop around micro-kernel L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi mRmR kCkC Pack A i → A i ~ AiAi ~ 3 rd loop around micro-kernel += mCmC CiCi AiAi mCmC L2 cache Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A BjBj CjCj nCnC nCnC main memory 5 th loop around micro-kernel GotoBLAS algorithm for GEMM in BLIS *Field G. Van Zee, and Tyler M. Smith. “Implementing high-performance complex matrix multiplication." In ACM Transactions on Mathematical Software (TOMS), accepted pending modifications.

Outline Conventional Matrix-matrix multiplication Strassen’s Algorithm Reloaded High-performance Implementations Theoretical Model and Analysis Performance Experiments Conclusion

One-level Strassen’s Algorithm Reloaded M 0 :=  (A 00 +A 11 )(B 00 +B 11 ); M 1 :=  (A 10 +A 11 )B 00 ; M 2 :=  A 00 (B 01 –B 11 ); M 3 :=  A 11 (B 10 –B 00 ); M 4 :=  (A 00 +A 01 )B 11 ; M 5 :=  (A 10 –A 00 )(B 00 +B 01 ); M 6 :=  (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 0 + M 3 – M 4 + M 7 C 01 += M 2 + M 4 C 10 += M 1 + M 3 C 11 += M 0 – M 1 + M 2 + M 5 M 0 :=  (A 00 +A 11 )(B 00 +B 11 ); C 00 += M 0 ; C 11 += M 0 ; M 1 :=  (A 10 +A 11 )B 00 ; C 10 += M 1 ; C 11 –= M 1 ; M 2 :=  A 00 (B 01 –B 11 ); C 01 += M 2 ; C 11 += M 2 ; M 3 :=  A 11 (B 10 –B 00 ); C 00 += M 3 ; C 10 += M 3 ; M 4 :=  (A 00 +A 01 )B 11 ; C 01 += M 4 ; C 00 –= M 4 ; M 5 :=  (A 10 –A 00 )(B 00 +B 01 ); C 11 += M 5 ; M 6 :=  (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 6 ; M :=  (X+Y)(V+W); C +=M; D +=M; M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. General operation for one-level Strassen:

M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. High-performance implementation of the general operation ? *

M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

+= nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

+= nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi DiDi 3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

mRmR += nRnR kCkC 1 st loop around micro-kernel Update C ij, D ij L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi DiDi 3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

1 += 1 micro-kernel mRmR += nRnR kCkC 1 st loop around micro-kernel Update C ij, D ij L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi DiDi 3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

1 += 1 micro-kernel mRmR += nRnR kCkC 1 st loop around micro-kernel Update C ij, D ij L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi DiDi 3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. *Jianyu Huang, Tyler Smith, Greg Henry, and Robert van de Geijn. “Strassen’s Algorithm Reloaded.” In submission to SC’16.

1 += 1 micro-kernel Update C ij mRmR += nRnR kCkC 1 st loop around micro-kernel L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi mRmR kCkC Pack A i → A i ~ AiAi ~ 3 rd loop around micro-kernel += mCmC CiCi AiAi mCmC L2 cache Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A B CjCj nCnC nCnC main memory 5 th loop around micro-kernel 1 += 1 micro-kernel mRmR += nRnR kCkC 1 st loop around micro-kernel Update C ij, D ij L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi DiDi 3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory

Two-level Strassen’s Algorithm Reloaded

Two-level Strassen’s Algorithm Reloaded (Continue) M :=  (X 0 +  1 X 1 +  2 X 2 +  3 X 3 )(V+  1 V 1 +  2 V 2 +  3 V 3 ); C 0 +=   M; C 1 +=   M; C 2 +=   M; C 3 +=   M;  i,  i,  i  {-1, 0, 1}. General operation for two-level Strassen: M :=  (X 0 +X 1 +X 2 +X 3 )(V+V 1 +V 2 +V 3 ); C 0 += M; C 1 += M; C 2 += M; C 3 += M;

Additional Levels of Strassen Reloaded The general operation of one-level Strassen: The general operation of two-level Strassen: The general operation needed to integrate k levels of Strassen is given by M :=  (X+  Y)(V+  W); C +=   M; D +=   M;  ,  , ,   {-1, 0, 1}. M :=  (X 0 +  1 X 1 +  2 X 2 +  3 X 3 )(V+  1 V 1 +  2 V 2 +  3 V 3 ); C 0 +=   M; C 1 +=   M; C 2 +=   M; C 3 +=   M;  i,  i,  i  {-1, 0, 1}.

Outline Conventional Matrix-matrix multiplication Strassen’s Algorithm Reloaded High-performance Implementations Theoretical Model and Analysis Performance Experiments Conclusion

Building blocks Integrate the addition of multiple matrices V t into A routine for packing B p into  C/Intel intrinsics A routine for packing A i into  C/Intel intrinsics A micro-kernel for updating an m R ×n R submatrix of C.  SIMD assembly (AVX, AVX512, etc) Integrate the addition of multiple matrices X s into Integrate the update of multiple submatrices of C. BLIS framework Adapted to general operation

Variations on a theme Naïve Strassen  A traditional implementation with temporary buffers. AB Strassen  Integrate the addition of matrices into and. ABC Strassen  Integrate the addition of matrices into and.  Integrate the update of multiple submatrices of C in the micro-kernel.

For architectures with independent L2 cache, we parallelize 3 rd loop (along m C direction) For architectures with shared L2 cache, we parallelize 2 nd loop (along n R direction) For many-core architecture (e.g. Xeon Phi, 240 threads), we parallelize both 3 rd and 2 nd loop Parallelization *Tyler M. Smith, Robert Van De Geijn, Mikhail Smelyanskiy, Jeff R. Hammond, and Field G. Van Zee. "Anatomy of high-performance many-threaded matrix multiplication." In Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, pp IEEE, = 1 micro-kernel mRmR += nRnR kCkC 1 st loop around micro-kernel Update C ij, D ij L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi DiDi 3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory

Outline Conventional Matrix-matrix multiplication Strassen’s Algorithm Reloaded High-performance Implementations Theoretical Model and Analysis Performance Experiments Conclusion

Performance Model Performance Metric Total Time Breakdown Arithmetic Operations Memory Operations

Arithmetic Operations DGEMM  No extra additions One-level Strassen (ABC, AB, Naïve)  7 submatrix multiplications  5 extra additions of submatrices of A and B  12 extra additions of submatrices of C Two-level Strassen (ABC, AB, Naïve)  49 submatrix multiplications  95 extra additions of submatrices of A and B  154 extra additions of submatrices of C Submatrix multiplication Extra additions with submatrices of A, B, C, respectively M 0 :=  (A 00 +A 11 )(B 00 +B 11 ); C 00 += M 0 ; C 11 += M 0 ; M 1 :=  (A 10 +A 11 )B 00 ; C 10 += M 1 ; C 11 –= M 1 ; M 2 :=  A 00 (B 01 –B 11 ); C 01 += M 2 ; C 11 += M 2 ; M 3 :=  A 11 (B 10 –B 00 ); C 00 += M 3 ; C 10 += M 3 ; M 4 :=  (A 00 +A 01 )B 11 ; C 01 += M 4 ; C 00 –= M 4 ; M 5 :=  (A 10 –A 00 )(B 00 +B 01 ); C 11 += M 5 ; M 6 :=  (A 01 –A 11 )(B 10 +B 11 ); C 00 += M 6 ; unit time for one arithmetic operation

Memory Operations Assumption  Simplified memory hierarchy model (fast caches + slow main memory)  R/W fast caches is overlapped with computations. Dominant memory operations  Memory packing in DGEMM/adapted general operaton routine  Reading/writing submatrices of C in (adapted) routine  Reading/writing of the temporary buffer that are part of Naïve Strassen and AB Strassen.

Memory Operations DGEMM One-level  ABC Strassen  AB Strassen  Naïve Strassen bandwidth  [0.5, 1], software prefetching efficiency step function: periodical spikes

Memory Operations DGEMM One-level  ABC Strassen  AB Strassen  Naïve Strassen Two-level  ABC Strassen  AB Strassen  Naïve Strassen bandwidth  [0.5, 1], software prefetching efficiency step function: periodical spikes

Modeled and Actual Performance on Single Core

Modeled Performance Actual Performance Observation (Square Matrices)

Modeled Performance Actual Performance Observation (Square Matrices)

Modeled Performance Actual Performance Observation (Square Matrices)

Modeled Performance Actual Performance Observation (Square Matrices)

Modeled Performance Actual Performance Observation (Square Matrices)

Modeled Performance Actual Performance Observation (Square Matrices)

Modeled Performance Actual Performance Observation (Square Matrices)

Modeled Performance Actual Performance 26.0% 13.1% Observation (Square Matrices) One-level Strassen (1+14.3% speedup)  8 multiplications → 7 multiplications; Two-level Strassen (1+30.6% speedup)  64 multiplications → 49 multiplications; Theoretical Speedup over DGEMM

Both one-level and two-level  For small square matrices, ABC Strassen outperforms AB Strassen  For larger square matrices, this trend reverses Reason  ABC Strassen avoids storing M (M resides in the register)  ABC Strassen increases the number of times for updating submatrices of C Observation (Square Matrices)

1 += 1 micro-kernel Update C ij mRmR += nRnR kCkC 1 st loop around micro-kernel L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi mRmR kCkC Pack A i → A i ~ AiAi ~ 3 rd loop around micro-kernel += mCmC CiCi AiAi mCmC L2 cache Pack B p → B p ~ BpBp ~ 4 th loop around micro-kernel += kCkC ApAp BpBp CjCj kCkC L3 cache += nCnC nCnC A B CjCj nCnC nCnC main memory 5 th loop around micro-kernel 1 += 1 micro-kernel mRmR += nRnR kCkC 1 st loop around micro-kernel Update C ij, D ij L1 cache registers += nRnR nRnR BpBp ~ 2 nd loop around micro-kernel CiCi DiDi 3 rd loop around micro-kernel += mCmC DiDi YiYi mCmC CiCi XiXi AiAi ~ kCkC mRmR L2 cache Pack X i +  Y i → A i ~ 4 th loop around micro-kernel += kCkC YpYp WpWp DjDj kCkC XpXp VpVp CjCj BpBp ~ L3 cache Pack V p +  W p → B p ~ += nCnC nCnC 5 th loop around micro-kernel Y W DjDj X VjVj CjCj nCnC nCnC main memory

What is Rank-k update? Observation (Rank-k Update) k n n m m

Importance of Rank-k update Observation (Rank-k Update) Blocked LU with partial pivoting (getrf) A 21 A 12 A 22

Importance of Rank-k update Observation (Rank-k Update) A 21 A 12 A 22 += ×

Importance of Rank-k update Observation (Rank-k Update) Blocked LU with partial pivoting (getrf) A 21 A 12 A 22

Modeled Performance Actual Performance Observation (Rank-k Update)

Modeled Performance Actual Performance Observation (Rank-k Update)

Modeled Performance Actual Performance Observation (Rank-k Update)

Modeled Performance Actual Performance Observation (Rank-k Update)

Modeled Performance Actual Performance Observation (Rank-k Update)

Modeled Performance Actual Performance Observation (Rank-k Update)

Modeled Performance Actual Performance Observation (Rank-k Update)

For k equal to the appropriate multiple of k C, ABC Strassen performs dramatically better than the other implementations  k = 2k C for one-level Strassen  k = 4k C for two-level Strassen Reason:  ABC Strassen avoids forming the temporary matrice M explicitly in the memory ( M resides in register), especially important when m, n >> k. Observation (Rank-k Update)

Bottom line: A different implementation may have its advantages depending on the problem size

Outline Conventional Matrix-matrix multiplication Strassen’s Algorithm Reloaded High-performance Implementations Theoretical Model and Analysis Performance Experiments Conclusion

Single Node Experiment

1 core 5 core 10 core Square Matrices Rank-k Update

Many-core Experiment

Intel® Xeon Phi™ coprocessor (KNC)

Distributed Memory Experiment

*Robert A. van de Geijn, and Jerrell Watts. "SUMMA: Scalable universal matrix multiplication algorithm." Concurrency-Practice and Experience 9, no. 4 (1997): = × BpBp BpBp Accelerating SUMMA algorithm with Strassen Rank-k update ApAp

Accelerating SUMMA algorithm with Strassen *Robert A. van de Geijn, and Jerrell Watts. "SUMMA: Scalable universal matrix multiplication algorithm." Concurrency-Practice and Experience 9, no. 4 (1997): = × BpBp BpBp B B A C 4k C ApAp Rank-k update

Weak Scalability  An algorithm maintains efficiency as the number of processors varies and problem size/processor is fixed.  SUMMA is weakly scalable Problem size/socket Effective GFLOPS/socket MPI process number, or socket number

Efficiency Problem size/socket is fixed to 16000

Outline Conventional Matrix-matrix multiplication Strassen’s Algorithm Reloaded High-performance Implementations Theoretical Model and Analysis Performance Experiments Conclusion

To achieve practical high performance of Strassen’s algorithm…... Matrix Size Matrix Shape Parallelism Can be data parallelism Conventional Implementations Our Implementations Must be square Must be large Usually task parallelism No Additional Workspace

Summary Present novel insights into the implementations of Strassen Develop a family of Strassen algorithms Derive a performance model that predicts the run time of the various implementations Demonstrate performance benefits for single-core, multi- core, many-core, and distributed memory implementations

Thank you! Jianyu Huang – Tyler Smith – Get BLIS: –