Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc James Demmel Katherine Yelick Yozo Hida Michael deLorimier Shoaib Kamil Rajesh Nishtala Benjamin Lee Context The performance of many applications is dominated by a few computational kernels. Needle in a haystack—Planar slice of a large space of mathematically equivalent dense matrix multiply implementations: Each square is an implementation color- coded by its performance (Mflop/s) on a 333 MHz Sun Ultra- IIi based workstation. It is not obvious how to model this space. Platform variability—Distribution of performance over dense matrix multiply implementations on 8 different platforms (architecture + compiler): Performance tuning for any one platform must be redone for each new platform. BeBOP Berkeley Benchmarking and OPtimization Group www.cs.berkeley.edu/~richie/bebop An Approach to Automatic Tuning For each kernel, identify and generate a space of implementations, and search for the best one. Tuning Sparse Matrix-Vector Multiply The S PARSITY system (Im & Yelick, 1999) applies the methodology to y=Ax, where A is sparse. Extensions to New Kernels Preliminary results for symmetric A, A T A, and triangular solve. Future Work Integrating with applications, new kernels, further automation, understanding architectural implications. Cache blocking—Performance at various cache block sizes on a latent semantic indexing matrix. Sparse triangular solve—Implementation design space includes SSE2 instructions and “switch-to-dense.” A T A times a vector—The matrix A is brought through the memory hierarchy only once.  Applications need fast kernels Scientific computing; information retrieval Dense and sparse linear algebra Multimedia; audio and image processing Fast transforms Databases Sorting Security “Cryptokernels” (e.g., modular exponentiation)  Hardware complexity is increasing Microprocessor performance difficult to model Widening processor-memory gap; deep memory hierarchies  Implementation space Conceptually, the set of “interesting” implementations Depend on kernel and input May vary: oInstruction mix and order oMemory access patterns oData structures and precisions oMathematical formulation  Searching How? oExhaustively oHeuristically, guided by models When? oOnce per kernel and platform oAt compile time oAt run-time oHybrid approaches  Sparse matrix data structures Store only non-zeros Data structure overhead + irregular memory access  Implementation space Register blocking: exploit existing dense blocks Cache blocking: create better locality in x, y Multiple vectors: reuse elements of A  Searching example: selecting a register block size Off-line: One-time characterization of performance on a dense matrix stored in sparse format for all r, c At run-time: Estimate rxc fill, and choose r, c to maximize Mflops dense (r,c) / Fill(r,c) This approach has been applied successfully in dense linear algebra (PHiPAC ‘97; ATLAS ‘98) and signal processing (FFTW ‘98; SPIRAL ‘00), among others. Register blocking profile—One-time characterization of the machine (Mflop/s). Exploiting symmetry—When A=A T, only half the matrix need be stored, and each element used twice.  Symmetric sparse matrix-vector multiply Only store half of the non-zeros Reuse each stored element twice  Sparse triangular solve Compute T -1 x, where T is a sparse triangular matrix Exploit naturally occurring dense structure when T comes from certain applications (LU factorization)  Multiplying A T A by a vector A can be brought through the memory hierarchy once Arises in linear programming problems, among others (S PARSITY system optimizations also apply to these kernels!) Exploiting new structures—Symmetric matrix from a fluid flow modeling problem (left); triangular matrix from LU factorization (right). Why do these four profiles look so different?—We hope to answer this question and understand the implications for current and future architectures and applications.  Integration with applications and existing software libraries  Creating (via reordering) or exploiting other matrix structures  New sparse kernels (e.g., powers A k, triple product A T MA)  Further automation: generating implementation generators  Understanding performance in terms of underlying architectures S PARSITY —Performance improvement after run-time block-size selection. Multiple vectors—Significant speedups are possible when multiplying by several vectors. Register blocking example—Portion of a sparse matrix with a 4x3 register block grid superimposed.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Similar presentations

Presentation on theme: "Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Similar presentations

Presentation on theme: "Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc."— Presentation transcript:

Similar presentations

About project

Feedback