University of California, Berkeley

Slides:



Advertisements
Similar presentations
Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…
Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Vector Processors Part 2 Performance. Vector Execution Time Enhancing Performance Compiler Vectorization Performance of Vector Processors Fallacies and.
The University of Adelaide, School of Computer Science
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Data Locality CS 524 – High-Performance Computing.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Makoto Kudoh*1, Hisayasu Kuroda*1,
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Sunpyo Hong, Hyesoon Kim
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
1 Lecture 5a: CPU architecture 101 boris.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Yang Gao and Dr. Jason D. Bakos
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Analysis of Sparse Convolutional Neural Networks
Ioannis E. Venetis Department of Computer Engineering and Informatics
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Optimizing Cache Performance in Matrix Multiplication
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Optimizing Cache Performance in Matrix Multiplication
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Flow Path Model of Superscalars
for more information ... Performance Tuning
Bojian Zheng CSCD70 Spring 2018
L18: CUDA, cont. Memory Hierarchy and Examples
Presented by: Isaac Martin
Linchuan Chen, Peng Jiang and Gagan Agrawal
Memory Hierarchies.
Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
Multivector and SIMD Computers
Siddhartha Chatterjee
Memory System Performance Chapter 3
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Presentation transcript:

Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply University of California, Berkeley Berkeley Benchmarking and Optimization Group (BeBOP) http://bebop.cs.berkeley.edu Benjamin C. Lee, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick 16 August 2004

Performance Tuning Challenges Computational Kernels Sparse Matrix-Vector Multiply (SpMV): y = y + Ax A : Sparse matrix, symmetric ( i.e., A = AT ) x, y : Dense vectors Sparse Matrix-Multiple Vector Multiply (SpMM): Y = Y + AX X, Y : Dense matrices Performance Tuning Challenges Sparse code characteristics High bandwidth requirements (matrix storage overhead) Poor locality (indirect, irregular memory access) Poor instruction mix (low ratio of flops to memory operations) SpMV performance less than 10% of machine peak Performance depends on kernel, matrix, and architecture Computational Kernels: Performance dominated by a few computational kernels in scientific applications Performance Tuning Challenges: High bandwidth requirements: Load operations for matrix integer indices and floating point values Poor locality: Indirect column accesses for row storage format; Irregular memory accesses Poor instruction mix: computational intensity = 2 (SpMV) and 4 (Symm SpMV) Berkeley Benchmarking and Optimization Group

Optimizations: Register Blocking (1/3) Berkeley Benchmarking and Optimization Group

Optimizations: Register Blocking (2/3) Register Blocking Technique: Logically divide matrix into rxc blocks Computation of SpMV proceeds block-by-block Fully unroll block multiply BCSR with uniform, aligned grid Berkeley Benchmarking and Optimization Group

Optimizations: Register Blocking (3/3) Performance Implications: Reduces loop and indexing overhead, irregularity Increases temporal locality and register reuse for x Example: 50% increase in performance despite fill ratio of 1.5 Fill-in zeros: Trade extra flops for better blocked efficiency Berkeley Benchmarking and Optimization Group

Optimizations: Matrix Symmetry Symmetric Storage Assume compressed sparse row (CSR) storage Store half the matrix entries (e.g., upper triangle) Performance Implications Same flops Halves memory accesses to the matrix Same irregular, indirect memory accesses For each stored non-zero A(i, j) y ( i ) += A ( i , j ) * x ( j ) y ( j ) += A ( i , j ) * x ( i ) Special consideration of diagonal elements SpMV proceeds simultaneously for a stored element and its transpose Indirect memory accesses: Matrix stored in CSR and j indexes the columns The j indices are indirect accesses The total number of indirect accesses is the same for symm and non-symm Refer to tech report for discussion of diagonal elements Berkeley Benchmarking and Optimization Group

Optimizations: Multiple Vectors Performance Implications Reduces loop overhead Amortizes the cost of reading A for v vectors X k v Vector Blocking Technique: Targets SpMM (Y  Y + A*X) where X and Y are equivalently k dense column vectors of length n Suppose X consists of k column vectors We apply A to X in groups of v vectors at a time For each non-zero of A, we unroll by v the multiply against the corresponding vectors of x. A Y Berkeley Benchmarking and Optimization Group

Optimizations: Register Usage (1/3) Register Blocking Assume column-wise unrolled block multiply Destination vector elements in registers ( r ) x r c A y Berkeley Benchmarking and Optimization Group

Optimizations: Register Usage (2/3) Symmetric Storage Doubles register usage ( 2r ) Destination vector elements for stored block Source vector elements for transpose block x r c A y Berkeley Benchmarking and Optimization Group

Optimizations: Register Usage (3/3) Vector Blocking Scales register usage by vector width ( 2rv ) X k v A Y Berkeley Benchmarking and Optimization Group

Evaluation: Methodology Three Platforms Sun Ultra 2i, Intel Itanium 2, IBM Power 4 Matrix Test Suite Twelve matrices Dense, Finite Element, Linear Programming, Assorted Reference Implementation No symmetry, no register blocking, single vector multiplication Tuning Parameters SpMM code characterized by parameters ( r , c , v ) Register block size : r x c Vector width : v Berkeley Benchmarking and Optimization Group

Evaluation: Exhaustive Search Performance 2.1x max speedup (1.4x median) from symmetry (SpMV) {Symm BCSR Single Vector} vs {Non-Symm BCSR Single Vector} 2.6x max speedup (1.1x median) from symmetry (SpMM) {Symm BCSR Multiple Vector} vs {Non-Symm BCSR Multiple Vector} 7.3x max speedup (4.2x median) from combined optimizations {Symm BCSR Multiple Vector} vs {Non-Symm CSR Single Vector} Storage 64.7% max savings (56.5% median) in storage Savings > 50% possible when combined with register blocking 9.9% increase in storage for a few cases Increases possible when register block size results in significant fill Storage: Greater than 2x reduction in storage possible because of register blocking and fewer integer indices. Berkeley Benchmarking and Optimization Group

Performance Results: Sun Ultra 2i Berkeley Benchmarking and Optimization Group

Performance Results: Sun Ultra 2i Berkeley Benchmarking and Optimization Group

Performance Results: Sun Ultra 2i -- Small but consistent gains from symmetry alone in FEM matrices. -- Greatest gains arise from symmetry when combined with register blocking. -- Significant gains from multiple vector optimization. -- No consistent significant performance gain from symmetry with register and vector blocking. Symmetry is still worthwhile, however, due to storage savings without significant loss in performance. -- Cumulative effects of the three techniques. Berkeley Benchmarking and Optimization Group

Performance Results: Intel Itanium 2 -- Negligible gains from symmetry alone and with register blocking. -- Greatest gains arise from symmetry when combined with register and vector blocking. -- Note than in several cases ‘Symm Ref’ is less than ‘Non-Symm Ref’. In these cases, it is necessary to apply other optimizations with symmetry to get the storage savings without a loss in performance. Berkeley Benchmarking and Optimization Group

Performance Results: IBM Power 4 -- Significant gains from symmetry with register blocking -- Cumulative gains of the three optimizations Berkeley Benchmarking and Optimization Group

Automated Empirical Tuning Exhaustive search infeasible Cost of matrix conversion to blocked format Parameter Selection Procedure Off-line benchmark Symmetric SpMM performance for dense matrix D in sparse format { Prcv(D) | 1 ≤ r,c ≤ bmax and 1 ≤ v ≤ vmax }, Mflop/s Run-time estimate of fill Fill is number of stored values divided by number of original non-zeros { frc(A) | 1 ≤ r,c ≤ bmax }, always at least 1.0 Heuristic performance model Choose ( r , c , v ) to maximize estimate of optimized performance maxrcv { Prcv(A) = Prcv (D) / frc (A) | 1 ≤ r,c ≤ bmax and 1 ≤ v ≤ min( vmax , k ) } Motivation: Trying all or a subset of register block sizes in infeasible if the matrix is known only at run-time, due to the cost of converting the matrix into blocked format. Although the cost of a single conversion is acceptable in many contexts (i.e., iterative solvers) where the cost is amortized over many SpMVs, exhaustive search becomes infeasible for even modest values of b_max >= 4. The fill can be estimated cheaply, while the total run-time cost of steps 2 and 3, followed by a single conversion is not much more expensive than the conversion alone. Off-line benchmark: Once per machine, measure the performance of symmetric SpMM for dense matrix in sparse format. Run-time search: When the matrix is known at run-time, compute an estimate of the fill over all (r,c). This is the empirical search part. Heuristic performance model: Prcv(D) is empirical estimate of expected performance and Frc(A) reduces this value according to the extra flops per non-zero due to fill Berkeley Benchmarking and Optimization Group

Evaluation: Heuristic Search Heuristic Performance Always achieves at least 93% of best performance from exhaustive search Ultra 2i, Itanium 2 Always achieves at least 85% of best performance from exhaustive search Power 4 Performance Bounds Measured performance achieves 82% of PAPI bound on Ultra and Itanium Measured performance achieves 42% of PAPI bound on Itanium 2 Berkeley Benchmarking and Optimization Group

Performance Results: Sun Ultra 2i Berkeley Benchmarking and Optimization Group

Performance Results: Intel Itanium 2 Berkeley Benchmarking and Optimization Group

Performance Results: IBM Power 4 Berkeley Benchmarking and Optimization Group

Berkeley Benchmarking and Optimization Group Performance Models Model Characteristics and Assumptions Considers only the cost of memory operations Accounts for minimum effective cache and memory latencies Considers only compulsory misses (i.e., ignore conflict misses) Ignores TLB misses Execution Time Model Loads and cache misses Analytic model (based on data access patterns) Hardware counters (via PAPI) Charge ai for hits at each cache level T = (L1 hits) a1 + (L2 hits) a2 + (Mem hits) amem T = (Loads) a1 + (L1 misses) (a2 – a1) + (L2 misses) (amem – a2) Model Assumptions – Cost of Memory Operations: SpMV and SpMM are memory bound, most execution time spent streaming through the matrix. Assume write-back caches with sufficient store buffer capacity to ignore the cost of stores Model Assumptions – Latencies Effective access times (alpha_i) are determined by a variety of streaming benchmarks (Saavedra-Barrera, MAPS) Saavedra-Barrera measures average time per load when repeatedly streaming through an array of length N and stride s, varying both N and s MAPS measures, for arrays of various lengths, the average time per load for (1) a unit stride access pattern and (2) a random access pattern Model Assumptions – Compulsory Misses Lower bound on cache misses for upper bound on performance Model Assumptions – Negligible TLB misses Most time spent streaming through the matrix with stride one accesses Berkeley Benchmarking and Optimization Group

Evaluation: Performance Bounds Measured Performance vs. PAPI Bound Measured performance is 68% of PAPI bound, on average FEM applications are closer to bound than non-FEM matrices Performance Bounds Measured performance achieves 82% of PAPI bound on Ultra and Itanium Measured performance achieves 42% of PAPI bound on Itanium 2 Berkeley Benchmarking and Optimization Group

Performance Results: Sun Ultra 2i Berkeley Benchmarking and Optimization Group

Performance Results: Intel Itanium 2 Reference: FEM Average Speedup: 0.8819 Overall Average Speedup: 0.9223 Overall Median Speedup: 0.8138 Sparse Max Speedup: 1.1719 (10) Register: FEM Average Speedup: 1.1141 Overall Average Speedup: 1.1771 Overall Median Speedup: 1.0960 Sparse Max Speedup: 2.0123 (12) Register Multiple Vector: FEM Average Speedup: 1.7239 Overall Average Speedup: 1.6555 Overall Median Speedup: 1.6015 Sparse Max Speedup: 2.5780 (8) Performance Upper Bound (Analytic): FEM Average Percentage: 42.35 Overall Average Percentage: 40.17 Overall Median Percentage: 45.18 Sparse Max Percentage: 51.28 (4) Performance Upper Bound (PAPI): FEM Average Percentage: 42.67 Overall Average Percentage: 41.79 Berkeley Benchmarking and Optimization Group

Performance Results: IBM Power 4 Reference: FEM Average Speedup: 1.4188 Overall Average Speedup: 1.3655 Overall Median Speedup: 1.2980 Sparse Max Speedup: 2.0078 (8) Register: FEM Average Speedup: 1.6393 Overall Average Speedup: 1.1771 Overall Median Speedup: 1.6976 Sparse Max Speedup: 2.0807 (10) Register Multiple Vector: FEM Average Speedup: 1.1444 Overall Average Speedup: 1.1941 Overall Median Speedup: 1.1457 Sparse Max Speedup: 1.7763 (9) Performance Upper Bound (Analytic): FEM Average Percentage: 52.22 Overall Average Percentage: 50.92 Overall Median Percentage: 49.93 Sparse Max Percentage: 63.80 (3) Berkeley Benchmarking and Optimization Group

Berkeley Benchmarking and Optimization Group Conclusions Matrix Symmetry Optimizations Symmetric Performance: 2.6x speedup (1.1x median) Overall Performance: 7.3x speedup (4.15x median) Symmetric Storage: 64.7% savings (56.5% median) Cumulative performance effects Automated Empirical Tuning Always achieves at least 85-93% of best performance from exhaustive search Performance Modeling Models account for symmetry, register blocking, multiple vectors Measured performance is 68% of predicted performance (PAPI) Berkeley Benchmarking and Optimization Group

Current & Future Directions Parallel SMP Kernels Multi-threaded versions of optimizations Extend performance models to SMP architectures Self-Adapting Sparse Kernel Interface Provides low-level BLAS-like primitives Hides complexity of kernel-, matrix-, and machine-specific tuning Provides new locality-aware kernels New locality-aware kernels – Ax, A^Tx, A^Tax, A^kx, triple products Allows user inspection, control of tuning in the spirit of FFTW. Self-profiling – The code may periodically retune an application Berkeley Benchmarking and Optimization Group

Berkeley Benchmarking and Optimization Group Appendices Berkeley Benchmarking and Optimization Group http://bebop.cs.berkeley.edu Conference Paper: “Performance Models for Evaluation and Automatic Tuning of Symmetric Sparse Matrix-Vector Multiply” http://www.cs.berkeley.edu/~blee20/publications/lee2004-icpp-symm.pdf Technical Report: “Performance Optimizations and Bounds for Sparse Symmetric Matrix-Multiple Vector Multiply” http://www.cs.berkeley.edu/~blee20/publications/lee2003-tech-symm.pdf Berkeley Benchmarking and Optimization Group

Berkeley Benchmarking and Optimization Group Appendices Berkeley Benchmarking and Optimization Group

Performance Results: Intel Itanium 1 Reference: FEM Average Speedup: 0.8249 Overall Average Speedup: 0.8432 Overall Median Speedup: 82.87 Sparse Max Speedup: 0.9508 (9) Register: FEM Average Speedup: 1.1798 Overall Average Speedup: 1.2384 Overall Median Speedup: 1.2183 Sparse Max Speedup: 1.3550 (10) Register Multiple Vector: FEM Average Speedup: 0.9490 Overall Average Speedup: 1.0040 Overall Median Speedup: 0.9744 Sparse Max Speedup: 1.5317 (9) Performance Upper Bound (Analytic): FEM Average Percentage: 63.48 Overall Average Percentage: 61.58 Overall Median Percentage: 61.00 Sparse Max Percentage: 73.74 (8) Performance Upper Bound (PAPI): FEM Average Percentage: 76.44 Overall Average Percentage: 80.56 Overall Median Percentage: 77.16 Sparse Max Percentage: 101.07 (12) Berkeley Benchmarking and Optimization Group