Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

Slides:

Advertisements

Similar presentations

Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.

Advertisements

Factoring of Large Numbers using Number Field Sieve Matrix Step Chandana Anand, Arman Gungor, and Kimberly A. Thomas ECE 646 Fall 2006.

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Chapter 3 The Efficiency of Algorithms

A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

© 2016 Cengage Learning. All Rights Reserved. May not be copied, scanned, or duplicated, in whole or in part, except for use as permitted in a license.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Makoto Kudoh*1, Hisayasu Kuroda*1,

Matrix Sparsification. Problem Statement Reduce the number of 1s in a matrix.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

Towards a Billion Routing Lookups per Second in Software  Author: Marko Zec, Luigi, Rizzo Miljenko Mikuc  Publisher: SIGCOMM Computer Communication Review,

Efficiency and Flexibility of Jagged Arrays Geir Gundersen Department of Informatics University of Bergen Norway Joint work with Trond Steihaug.

Chapter 2, Part A Descriptive Statistics: Tabular and Graphical Presentations n Summarizing Categorical Data n Summarizing Quantitative Data Categorical.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

JAVA AND MATRIX COMPUTATION

March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.

March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Invitation to Computer Science 6th Edition Chapter 3 The Efficiency of Algorithms.

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.

IMP: Indirect Memory Prefetcher

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Paper_topic: Parallel Matrix Multiplication using Vertical Data.

Sunpyo Hong, Hyesoon Kim

Irregular Applications –Sparse Matrix Vector Multiplication

L20: Sparse Matrix Algorithms, SIMD review November 15, 2012.

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Parallel Programming & Cluster Computing Linear Algebra Henry Neeman, University of Oklahoma Paul Gray, University of Northern Iowa SC08 Education Program’s.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Matrix Multiplication in Hadoop

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Auburn University

Analysis of Sparse Convolutional Neural Networks

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.

Genomic Data Clustering on FPGAs for Compression

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

BLAS: behind the scenes

for more information ... Performance Tuning

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Memory System Performance Chapter 3

Invitation to Computer Science 5th Edition

Presentation transcript:

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007

Outline  What is Sparse Matrix-Vector Multiply (SpMV)? Why benchmark it?  How to benchmark it?  Past approaches  Our approach  Results  Conclusions and directions for future work  What is Sparse Matrix-Vector Multiply (SpMV)? Why benchmark it?  How to benchmark it?  Past approaches  Our approach  Results  Conclusions and directions for future work

SpMV  Sparse Matrix-(dense)Vector Multiply  Multiply a dense vector by a sparse matrix (one whose entries are mostly zeroes)  Why do we need a benchmark?  SpMV is an important kernel in scientific computation  Vendors need to know how well their machines perform it  Consumers need to know which machines to buy  Existing benchmarks do a poor job of approximating SpMV  Sparse Matrix-(dense)Vector Multiply  Multiply a dense vector by a sparse matrix (one whose entries are mostly zeroes)  Why do we need a benchmark?  SpMV is an important kernel in scientific computation  Vendors need to know how well their machines perform it  Consumers need to know which machines to buy  Existing benchmarks do a poor job of approximating SpMV

Existing Benchmarks  The most widely used method for ranking computers is still the LINPACK benchmark, used exclusively by the Top 500 supercomputer list  Benchmark suites like the High Performance Computing Challenge (HPCC) Suite seek to change this by including other benchmarks  Even the benchmarks in HPCC do not model SpMV however  This work is proposed for inclusion into the HPCC suite  The most widely used method for ranking computers is still the LINPACK benchmark, used exclusively by the Top 500 supercomputer list  Benchmark suites like the High Performance Computing Challenge (HPCC) Suite seek to change this by including other benchmarks  Even the benchmarks in HPCC do not model SpMV however  This work is proposed for inclusion into the HPCC suite

Benchmarking SpMV is hard!  Issues to consider:  Matrix formats  Memory access patterns  Performance optimizations and why we need to benchmark them  Preexisting benchmarks that perform SpMV do not take all of this into account  Issues to consider:  Matrix formats  Memory access patterns  Performance optimizations and why we need to benchmark them  Preexisting benchmarks that perform SpMV do not take all of this into account

Matrix Formats  We store only the nonzero entries in sparse matrices  This leads to multiple ways of storing the data, based on how we index it  Coordinate, CSR, CSC, ELLPACK,…  Use Compressed Sparse Row (CSR) as our baseline format as it provides best overall unoptimized performance across many architectures  We store only the nonzero entries in sparse matrices  This leads to multiple ways of storing the data, based on how we index it  Coordinate, CSR, CSC, ELLPACK,…  Use Compressed Sparse Row (CSR) as our baseline format as it provides best overall unoptimized performance across many architectures

CSR SpMV Example (M,N) = (4,5) NNZ = 8 row_start: (0,2,4,6,8) col_idx: (0,1,0,2,1,3,2,4) values: (1,2,3,4,5,6,7,8)

Memory Access Patterns  Unlike dense case, memory access patterns differ for matrix and vector elements  Matrix elements: unit stride  Vector elements: indirect access for the source vector (the one multiplied by the matrix)  This leads us to propose three categories for SpMV problems:  Small: everything fits in cache  Medium: source vector fits in cache, matrix does not  Large: source vector does not fit in cache  These categories will exercise the memory hierarchy differently and so may perform differently  Unlike dense case, memory access patterns differ for matrix and vector elements  Matrix elements: unit stride  Vector elements: indirect access for the source vector (the one multiplied by the matrix)  This leads us to propose three categories for SpMV problems:  Small: everything fits in cache  Medium: source vector fits in cache, matrix does not  Large: source vector does not fit in cache  These categories will exercise the memory hierarchy differently and so may perform differently

Examples from Three Platforms  Intel Pentium 4  2.4 GHz  512 KB cache  Intel Itanium 2  1 GHz  3 MB cache  AMD Opteron  1.4 GHz  1 MB cache  Intel Pentium 4  2.4 GHz  512 KB cache  Intel Itanium 2  1 GHz  3 MB cache  AMD Opteron  1.4 GHz  1 MB cache  Data collected using a test suite of 275 matrices taken from the University of Florida Sparse Matrix Collection  Performance is graphed vs. problem size  Data collected using a test suite of 275 matrices taken from the University of Florida Sparse Matrix Collection  Performance is graphed vs. problem size

horizontal axis = matrix dimension or vector length vertical axis = density in nnz/row colored dots represent unoptimized performance of real matrices

Performance Optimizations  Many different optimizations possible  One family of optimizations involves blocking the matrix to improve reuse at a particular level of the memory hierarchy  Register blocking - very often useful  Cache blocking - not as useful  Which optimizations to use?  HPCC framework allows significant optimization by the user - we don’t want to go as far  Automatic tuning at runtime permits a reasonable comparison of architectures, by trying the same optimizations on each one  We will use only the register-blocking optimization (BCSR), which is implemented in the OSKI automatic tuning system for sparse matrix kernels developed at Berkeley  Prior research has found register blocking to be applicable to a number of real-world matrices, particularly ones from finite element applications  Many different optimizations possible  One family of optimizations involves blocking the matrix to improve reuse at a particular level of the memory hierarchy  Register blocking - very often useful  Cache blocking - not as useful  Which optimizations to use?  HPCC framework allows significant optimization by the user - we don’t want to go as far  Automatic tuning at runtime permits a reasonable comparison of architectures, by trying the same optimizations on each one  We will use only the register-blocking optimization (BCSR), which is implemented in the OSKI automatic tuning system for sparse matrix kernels developed at Berkeley  Prior research has found register blocking to be applicable to a number of real-world matrices, particularly ones from finite element applications

Both unoptimized and optimized SpMV matter  Why we need to measure optimized SpMV:  Some platforms benefit more from performance tuning than others  In the case of the tested platforms, Itanium 2 and Opteron gain vs. P4 when we tune using OSKI  Why we need to measure unoptimized SpMV:  Some SpMV problems are more resistant to optimization  To be effective, register blocking needs a matrix with a dense block structure  Not all sparse matrices have one  Graphs on next slide illustrate this  Why we need to measure optimized SpMV:  Some platforms benefit more from performance tuning than others  In the case of the tested platforms, Itanium 2 and Opteron gain vs. P4 when we tune using OSKI  Why we need to measure unoptimized SpMV:  Some SpMV problems are more resistant to optimization  To be effective, register blocking needs a matrix with a dense block structure  Not all sparse matrices have one  Graphs on next slide illustrate this

horizontal axis = matrix dimension or vector length vertical axis = density in nnz/row blank dots represent real matrices that OSKI could not tune due to lack of a dense block structure colored dots represent speedups obtained by OSKI’s tuning

So what do we do?  We have a large search space of matrices to examine  We could just do lots of SpMV on real-world matrices. However  It’s not portable. Several GB to store and transport. Our test suite takes up 8.34 GB of space  Appropriate set of matrices is always changing as machines grow larger  Instead, we can randomly generate sparse matrices that mirror real-world matrices by matching certain properties of these matrices  We have a large search space of matrices to examine  We could just do lots of SpMV on real-world matrices. However  It’s not portable. Several GB to store and transport. Our test suite takes up 8.34 GB of space  Appropriate set of matrices is always changing as machines grow larger  Instead, we can randomly generate sparse matrices that mirror real-world matrices by matching certain properties of these matrices

Matching Real Matrices With Synthetic Ones  Randomly generated matrices for each of 275 matrices taken from the Florida collection  Matched real matrices in dimension, density (measured in NNZ/row), blocksize, and distribution of nonzero entries  Nonzero distribution was measured for each matrix by looking at what fraction of nonzero entries are in bands a certain percentage away from the main diagonal  Randomly generated matrices for each of 275 matrices taken from the Florida collection  Matched real matrices in dimension, density (measured in NNZ/row), blocksize, and distribution of nonzero entries  Nonzero distribution was measured for each matrix by looking at what fraction of nonzero entries are in bands a certain percentage away from the main diagonal

Band Distribution Illustration What proportion of the nonzero entries fall into each of these bands 1-5? We use 10 bands instead of 5, but have shown 5 for simplicity.

In these graphs, real matrices are denoted by a red R, and synthetic matrices by a green S. Real matrices are connected by a line whose color indicates which matrix was faster to the synthetic matrices created to approximate them.

Remaining Issues  We’ve found a reasonable way to model real matrices, but benchmark suites want less output. HPCC requires its benchmarks to report only a few numbers, preferably just one  Challenges in getting there  As we’ve seen, SpMV performance depends greatly on the matrix, and there is a large range of problem sizes. How do we capture this all? Stats on Florida matrices:  Dimension ranges from a few hundred to over a million  NNZ/row ranges from 1 to a few hundred  How to capture performance of matrices with small dense blocks that benefit from register blocking?  What we’ll do:  Bound the set of synthetic matrices we generate  Determine which numbers to report that we feel capture the data best  We’ve found a reasonable way to model real matrices, but benchmark suites want less output. HPCC requires its benchmarks to report only a few numbers, preferably just one  Challenges in getting there  As we’ve seen, SpMV performance depends greatly on the matrix, and there is a large range of problem sizes. How do we capture this all? Stats on Florida matrices:  Dimension ranges from a few hundred to over a million  NNZ/row ranges from 1 to a few hundred  How to capture performance of matrices with small dense blocks that benefit from register blocking?  What we’ll do:  Bound the set of synthetic matrices we generate  Determine which numbers to report that we feel capture the data best

Bounding the Benchmark Set  Limit to square matrices  Look over only a certain range of problem dimensions and NNZ/row  Since dimension range is so huge, restrict dimension to powers of 2  Limit blocksizes tested to ones in {1,2,3,4,6,8} x {1,2,3,4,6,8}  These were the most common ones encountered in prior research with matrices that mostly had dense block structures  Here are the limits based on the matrix test suite:  Dimension <= 2^20 (a little over one million)  24 <= NNZ/row <= 34 (avg. NNZ/row for real matrix test suite is 29)  Generate matrices with nonzero entries distributed (band distribution) based on statistics for the test suite as a whole  Limit to square matrices  Look over only a certain range of problem dimensions and NNZ/row  Since dimension range is so huge, restrict dimension to powers of 2  Limit blocksizes tested to ones in {1,2,3,4,6,8} x {1,2,3,4,6,8}  These were the most common ones encountered in prior research with matrices that mostly had dense block structures  Here are the limits based on the matrix test suite:  Dimension <= 2^20 (a little over one million)  24 <= NNZ/row <= 34 (avg. NNZ/row for real matrix test suite is 29)  Generate matrices with nonzero entries distributed (band distribution) based on statistics for the test suite as a whole

Condensing the Data  This is a lot of data  11 x 12 x 36 = 4752 matrices to run  Tuned and untuned cases are separated, as they highlight differences between platforms  Untuned data will only come from unblocked matrices  Tuned data will come from the remaining (blocked) matrices  In each case (blocked and unblocked), report the maximum and median MFLOP rates to capture small/medium/large behavior  When forced to report one number, report the blocked median  This is a lot of data  11 x 12 x 36 = 4752 matrices to run  Tuned and untuned cases are separated, as they highlight differences between platforms  Untuned data will only come from unblocked matrices  Tuned data will come from the remaining (blocked) matrices  In each case (blocked and unblocked), report the maximum and median MFLOP rates to capture small/medium/large behavior  When forced to report one number, report the blocked median

Output UnblockedBlocked MaxMedianMaxMedian Pentium Itanium Opteron (all numbers MFLOP/s) UnblockedBlocked MaxMedianMaxMedian Pentium Itanium Opteron (all numbers MFLOP/s)

How well does the benchmark approximate real SpMV performance? These graphs show the benchmark numbers as horizontal lines versus the real matrices which are denoted by circles.

Output  Matrices generated by the benchmark fall into small/medium/large categories as follows: Pentium 4Itanium 2Opteron Small17%33%23% Medium42%50%44% Large42%17%33%

One More Problem  Takes too long to run:  Pentium 4: 150 minutes  Itanium 2: 128 minutes  Opteron: 149 minutes  How to cut down on this? HPCC would like our benchmark to run in 5 minutes  Takes too long to run:  Pentium 4: 150 minutes  Itanium 2: 128 minutes  Opteron: 149 minutes  How to cut down on this? HPCC would like our benchmark to run in 5 minutes

 Test fewer problem dimensions  The largest ones do not give any extra information  Test fewer NNZ/row  Once dimension gets large enough, small variations in NNZ/row have little effect  These decisions are all made by a runtime estimation algorithm  Benchmark SpMV data supports this  Test fewer problem dimensions  The largest ones do not give any extra information  Test fewer NNZ/row  Once dimension gets large enough, small variations in NNZ/row have little effect  These decisions are all made by a runtime estimation algorithm  Benchmark SpMV data supports this Cutting Runtime

Sample graphs of benchmark SpMV for 1x1 and 3x3 blocked matrices

Output Comparison UnblockedBlocked MaxMedianMaxMedian Pentium (699)(307)(1961)(530) Itanium (443)(343)(2177)(753) Opteron (396)(170)(1178)(273) UnblockedBlocked MaxMedianMaxMedian Pentium (699)(307)(1961)(530) Itanium (443)(343)(2177)(753) Opteron (396)(170)(1178)(273)

Runtime Comparison FullShortened Pentium 4150 min3 min Itanium 2128 min3 min Opteron149 min3 min FullShortened Pentium 4150 min3 min Itanium 2128 min3 min Opteron149 min3 min

Conclusions and Directions for the Future  SpMV is hard to benchmark because performance varies greatly depending on the matrix  Carefully chosen synthetic matrices can be used to approximate SpMV  A benchmark that reports one number and runs quickly is harder, but we can do reasonably well by looking at the median  In the future:  Tighter maximum numbers  Parallel version  Software available at  SpMV is hard to benchmark because performance varies greatly depending on the matrix  Carefully chosen synthetic matrices can be used to approximate SpMV  A benchmark that reports one number and runs quickly is harder, but we can do reasonably well by looking at the median  In the future:  Tighter maximum numbers  Parallel version  Software available at