Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium.

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

Load Balancing Parallel Applications on Heterogeneous Platforms.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Lecture 19: Parallel Algorithms

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Lecture 6: Multicore Systems

1 Computational models of the physical world Cortical bone Trabecular bone.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

April 19, 2010HIPS Transforming Linear Algebra Libraries: From Abstraction to Parallelism Ernie Chan.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.

CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.

SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

1 Incorporating Iterative Refinement with Sparse Cholesky April 2007 Doron Pearl.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

Department of Electronic Engineering, Tsinghua University Nano-scale Integrated Circuit and System Lab. Performance Analysis of Parallel Sparse LU Factorization.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

Presented by PLASMA (Parallel Linear Algebra for Scalable Multicore Architectures) ‏ The Innovative Computing Laboratory University of Tennessee Knoxville.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Background Computer System Architectures Computer System Software.

Intro to Scientific Libraries Intro to Scientific Libraries Blue Waters Undergraduate Petascale Education Program May 29 – June

Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance

June 9-11, 2007SPAA SuperMatrix Out-of-Order Scheduling of Matrix Operations for SMP and Multi-Core Architectures Ernie Chan The University of Texas.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

1/24 UT College of Engineering Tutorial Accelerating Linear Algebra on Heterogeneous Architectures of Multicore and GPUs using MAGMA and DPLASMA and StarPU.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Ioannis E. Venetis Department of Computer Engineering and Informatics

Conception of parallel algorithms

Prabhanjan Kambadur, Open Systems Lab, Indiana University

Lecture 22: Parallel Algorithms

Nathan Grabaskas: Batched LA and Parallel Communication Optimization

Compiler Back End Panel

CS 201 Compiler Construction

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Compiler Back End Panel

Dense Linear Algebra (Data Distributions)

Multicore and GPU Programming

Presentation transcript:

Targeting Multi-Core systems in Linear Algebra applications Alfredo Buttari, Jack Dongarra, Jakub Kurzak and Julien Langou Petascale Applications Symposium Pittsburgh Supercomputing Center, June 22-23, 2007

The free lunch is over Problem power consumption heat dissipation pins Solution reduce clock and increase execution units = Multicore Consequence Non-parallel software won't run any faster. A new approach to programming is required. Hardware Software

What is a Multicore processor, BTW? “a processor that combines two or more independent processors into a single package” (wikipedia)‏ What about: types of core?  homogeneous (AMD Opteron, Intel Woodcrest...)‏  heterogeneous (STI Cell, Sun Niagara...)‏ memory?  how is it arranged? bus?  is it going to be fast enough? cache?  shared? (Intel/AMD)‏  non present at all? (STI Cell)‏ communications?

Parallelism in Linear Algebra software so far LAPACK Threaded BLAS PThreadsOpenMP ScaLAPACK PBLAS BLACS + MPI Shared MemoryDistributed Memory parallelism

Parallelism in Linear Algebra software so far LAPACK Threaded BLAS PThreadsOpenMP ScaLAPACK PBLAS BLACS + MPI Shared MemoryDistributed Memory parallelism

Parallelism in LAPACK: Cholesky factorization DPOTF2: BLAS-2 non-blocked factorization of the panel DTRSM: BLAS-3 updates by applying the transformation computed in DPOTF2 DGEMM (DSYRK): BLAS-3 updates trailing submatrix

Parallelism in LAPACK: Cholesky factorization BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound. strict synchronizations poor parallelism poor scalability

Parallelism in LAPACK: Cholesky factorization The execution flow if filled with stalls due to synchronizations and sequential operations. Time

Parallelism in LAPACK: Cholesky factorization do DPOTF2 on for all do DTRSM on end for all do DGEMM on end Tiling operations:

Parallelism in LAPACK: Cholesky factorization Cholesky can be represented as a Directed Acyclic Graph (DAG) where nodes are subtasks and edges are dependencies among them. As long as dependencies are not violated, tasks can be scheduled in any order. 3:34:3 3:24:2 2:2 3:24:2 2:13:14:1 1:1 4:24:3 1:1 2:12:2 3:1 4:1 3:33:2 5:15:2 5:3 5:45:5 4:4

Time Parallelism in LAPACK: Cholesky factorization  higher flexibility  some degree of adaptativity  no idle time  better scalability Cost:

Parallelism in LAPACK: block data layout Column-Major Block data layout

Column-Major Parallelism in LAPACK: block data layout Block data layout

Column-Major Parallelism in LAPACK: block data layout Block data layout

The use of block data layout storage can significantly improve performance Parallelism in LAPACK: block data layout

Cholesky: performance

Parallelism in LAPACK: LU/QR factorizations DGETF2: BLAS-2 non-blocked panel factorization DTRSM: BLAS-3 updates U with transformation computed in DGETF2 DGEMM: BLAS-3 updates the trailing submatrix

Parallelism in LAPACK: LU/QR factorizations The LU and QR factorizations algorithms in LAPACK don't allow for 2D distribution and block storage format. LU: pivoting takes into account the whole panel and cannot be split in a block fashion. QR: the computation of Householder reflectors acts on the whole panel. The application of the transformation can only be sliced but not blocked.

Time Parallelism in LAPACK: LU/QR factorizations

LU factorization: performance

Multicore friendly, “delightfully parallel * ”, algorithms Computer Science can't go any further on old algorithms. We need some math... * quote from Prof. S. Kale

The QR factorization in LAPACK Assume that is the part of the matrix that has been already factorized and contains the Householder reflectors that determine the matrix Q. The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections.

The QR factorization in LAPACK The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections. =DGEQR2( )

The QR factorization in LAPACK The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections. =DLARFB( )‏

The QR factorization in LAPACK The QR transformation factorizes a matrix A into the factors Q and R where Q is unitary and R is upper triangular. It is based on Householder reflections. How does it compare to LU? It is stable because it uses Householder transformations that are orthogonal It is more expensive than LU because its operation count is versus

Multicore friendly algorithms =DGEQR2( )‏ A different algorithm can be used where operations can be broken down into tiles. The QR factorization of the upper left tile is performed. This operation returns a small R factor and the corresponding Householder reflectors.

=DLARFB( )‏ Multicore friendly algorithms A different algorithm can be used where operations can be broken down into tiles. All the tiles in the first block-row are updated by applying the transformation computed at the previous step.

1 =DGEQR2( )‏ Multicore friendly algorithms A different algorithm can be used where operations can be broken down into tiles. The R factor computed at the first step is coupled with one tile in the block-column and a QR factorization is computed. Flops can be saved due to the shape of the matrix resulting from the coupling.

1 =DLARFB( )‏ Multicore friendly algorithms A different algorithm can be used where operations can be broken down into tiles. Each couple of tiles along the corresponding block rows is updated by applying the transformations computed in the previous step. Flops can be saved considering the shape of the Householder vectors.

1 =DGEQR2( )‏ Multicore friendly algorithms A different algorithm can be used where operations can be broken down into tiles. The last two steps are repeated for all the tiles in the first block-column.

1 =DLARFB( )‏ Multicore friendly algorithms A different algorithm can be used where operations can be broken down into tiles. The last two steps are repeated for all the tiles in the first block-column.

1 =DLARFB( )‏ Multicore friendly algorithms A different algorithm can be used where operations can be broken down into tiles. The last two steps are repeated for all the tiles in the first block-column. 25% more Flops than the LAPACK version!!!* *we are working on a way to remove these extra flops.

Multicore friendly algorithms

Very fine granularity Few dependencies, i.e., high flexibility for the scheduling of tasks Block data layout is possible

Time Multicore friendly algorithms Execution flow on a 8-way dual core Opteron.

Multicore friendly algorithms

Current work and future plans

Implement LU factorization on multicores Is it possible to apply the same approach to two-sided transformations (Hessenberg, Bi-Diag, Tri-Diag)? Explore techniques to avoid extra flops Implement the new algorithms on distributed memory architectures (J. Langou and J. Demmel)‏ Implement the new algorithms on the Cell processor Explore automatic exploitation of parallelism through graph driven programming environments

AllReduce algorithms 43 The QR factorization of a long and skinny matrix with its data partitioned vertically across several processors arises in a wide range of applications. Input: A is block distributed by rows Output: Q is block distributed by rows R is global A1A1 A2A2 A3A3 Q1Q1 Q2Q2 Q3Q3 R

AllReduce algorithms in iterative methods with multiple right-hand sides (block iterative methods:)‏  Trilinos (Sandia National Lab.) through Belos (R. Lehoucq, H. Thornquist, U. Hetmaniuk).  BlockGMRES, BlockGCR, BlockCG, BlockQMR, … in iterative methods with a single right-hand side  s-step methods for linear systems of equations (e.g. A. Chronopoulos),  LGMRES (Jessup, Baker, Dennis, U. Colorado at Boulder) implemented in PETSc,  Recent work from M. Hoemmen and J. Demmel (U. California at Berkeley). in iterative eigenvalue solvers,  PETSc (Argonne National Lab.) through BLOPEX (A. Knyazev, UCDHSC),  HYPRE (Lawrence Livermore National Lab.) through BLOPEX,  Trilinos (Sandia National Lab.) through Anasazi (R. Lehoucq, H. Thornquist, U. Hetmaniuk),  PRIMME (A. Stathopoulos, Coll. William & Mary )‏ They are used in:

AllReduce algorithms A0A0 A1A1 processes time

R 0 (0)‏ (,)‏QR ()‏ A0A0 V 0 (0)‏ R 1 (0)‏ (,)‏QR ()‏ A1A1 V 1 (0)‏ processes time 1 1 AllReduce algorithms

R 0 (0)‏ (,)‏QR ()‏ A0A0 V 0 (0)‏ )‏ R 0 (0)‏ R 1 (0)‏ (,)‏QR ()‏ A1A1 V 1 (0)‏ processes time ( AllReduce algorithms

R 0 (0)‏ (,)‏QR ()‏ A0A0 V 0 (0)‏ R 0 (1)‏ (,)‏QR ()‏ R 0 (0)‏ R 1 (0)‏ V 0 (1)‏ V 1 (1)‏ R 1 (0)‏ (,)‏QR ()‏ A1A1 V 1 (0)‏ processes time AllReduce algorithms

R 0 (0)‏ (,)‏QR ()‏ A0A0 V 0 (0)‏ R 0 (1)‏ (,)‏QR ()‏ R 0 (0)‏ R 1 (0)‏ V 0 (1)‏ V 1 (1)‏ InIn Apply (to)‏ V 0 (1)‏ 0n0n V 1 (1)‏ Q 0 (1)‏ Q 1 (1)‏ R 1 (0)‏ (,)‏QR ()‏ A1A1 V 1 (0)‏ processes time AllReduce algorithms

R 0 (0)‏ (,)‏QR ()‏ A0A0 V 0 (0)‏ R 0 (1)‏ (,)‏QR ()‏ R 0 (0)‏ R 1 (0)‏ V 0 (1)‏ V 1 (1)‏ InIn Apply (to)‏ V 0 (1)‏ 0n0n V 1 (1)‏ Q 0 (1)‏ Q 1 (1)‏ Q 0 (1)‏ R 1 (0)‏ (,)‏QR ()‏ A1A1 V 1 (0)‏ processes time Q 1 (1)‏ AllReduce algorithms

R 0 (0)‏ (,)‏QR ()‏ A0A0 V 0 (0)‏ R 0 (1)‏ (,)‏QR ()‏ R 0 (0)‏ R 1 (0)‏ V 0 (1)‏ V 1 (1)‏ InIn Apply (to)‏ V 0 (1)‏ 0n0n V 1 (1)‏ Q 0 (1)‏ Q 1 (1)‏ Apply (to)‏ 0n0n V 0 (0)‏ Q 0 (1)‏ Q0Q0 R 1 (0)‏ (,)‏QR ()‏ A1A1 V 1 (0)‏ Apply (to)‏ V 1 (0)‏ Q 1 (1)‏ Q1Q1 processes time 0n0n AllReduce algorithms

processes time R 0 (0)‏ ()‏ QR ()‏ A0A0 R 0 (1)‏ ()‏ QR ()‏ R 0 (0)‏ R 1 (0)‏ ()‏ QR ()‏ A1A1 R 2 (0)‏ ()‏ QR ()‏ A2A2 R 2 (1)‏ ()‏ QR ()‏ R 2 (0)‏ R 3 (0)‏ ()‏ QR ()‏ A3A3 R( QR ()‏ R 0 (1)‏ R 2 (1)‏ AllReduce algorithms

AllReduce algorithms: performance Weak ScalabilityStrong Scalability

CellSuperScalar and SMPSuperScalar uses source-to-source translation to determine dependencies among tasks scheduling of tasks is performed automatically by means of the features provided by a library it is easily possible to explore different scheduling policies all of this is obtained by instructing the code with pragmas and, thus, is transparent to other compilers

for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } void sgemm_tile(float *A, float *B, float *C)‏ void strsm_tile(float *T, float *B)‏ void ssyrk_tile(float *A, float *C)‏ CellSuperScalar and SMPSuperScalar

for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); } #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void sgemm_tile(float *A, float *B, float *C)‏ #pragma css task input (T[64][64]) inout(B[64][64])‏ void strsm_tile(float *T, float *B)‏ #pragma css task input(A[64][64], B[64][64]) inout(C[64][64])‏ void ssyrk_tile(float *A, float *C)‏ CellSuperScalar and SMPSuperScalar

Thank you