When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP.

Slides:



Advertisements
Similar presentations
Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.
Advertisements

Adaptable benchmarks for register blocked sparse matrix-vector multiplication ● Berkeley Benchmarking and Optimization group (BeBOP) ● Hormozd Gahvari.
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Instruction Set Design
1 Optimizing compilers Managing Cache Bercovici Sivan.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Makoto Kudoh*1, Hisayasu Kuroda*1,
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Background Gaussian Elimination Fault Tolerance Single or multiple core failures: Single or multiple core additions: Simultaneous core failures and additions:
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
JAVA AND MATRIX COMPUTATION
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
University of California, Berkeley
COSC3330 Computer Architecture
Ioannis E. Venetis Department of Computer Engineering and Informatics
The Hardware/Software Interface CSE351 Winter 2013
Section 7: Memory and Caches
The University of Adelaide, School of Computer Science
High-Performance Matrix Multiplication
The Problem Finding a needle in haystack An expert (CPU)
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Cache Memory Presentation I
for more information ... Performance Tuning
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
CSCI1600: Embedded and Real Time Software
Memory Hierarchies.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Ann Gordon-Ross and Frank Vahid*
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Memory System Performance Chapter 3
COMP755 Advanced Operating Systems
CSCI1600: Embedded and Real Time Software
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP Project, U.C. Berkeley June,

General Approach Sparse kernels are prevalent in many applications Automatically tune kernels to get optimum performance –Naïve performance is typically less than 10% of machine peak –Performance dependent on matrix and platform –Numerous parameters that can be adjusted Create analytic performance models Use the performance models as a basis for a system that will automatically pick the correct parameters Verify performance models using built-in hardware counters

Introduction and Overview Sparse Matrix Vector Multiply (SpM x V): y  y+A∙x –x, y are dense vectors We will call x the source vector and y the destination vector –A is a sparse matrix (<1% of entries are nonzero) Naïve Performance of Pitfalls –High memory bandwidth requirements –Poor locality (indirect and irregular memory accesses) –Poor instruction mix (low flops to memory operations ratio) matrix: A vector: x vector: y

Cache Blocking Assume a Compressed Sparse Row (CSR) format Cache blocking breaks the CSR matrix into multiple smaller CSR matrices –Improves temporal locality in accesses to source vector –Adds extra set of row pointers for each block Goal: Given a matrix and a machine combination, pick the correct cache block size Key Tradeoff –Does the benefit of the added temporal locality outweigh the costs associated with accessing the added overhead An extension of the work done by Eun-Jin Im in the work on Sparsity r c

New Cache Blocking Optimizations Row Start/End –Band matrices may lead to many rows with all zeros –New optimization avoids iterating over these rows by adding starting row and ending row on which nonzero elements exist –Example: Exploiting Cache Block Structure –Unlike previous versions of the code, treat each of the sub matrices as an individual sparse matrix –Allow for easier incorporation of different kernels and multiple levels of cache blocking –Negligible performance costs because of few number of cache blocks

Related Work Traditional dense methods to find tile sizes cannot be used because of indirect and irregular memory accesses. Similar work in sparse case –Temam and Jalby, Heras et al., Fraguela et al. have developed probabilistic models that predict cache misses. –Distinguish themselves by their ability to account for self and cross interference misses but assume uniform distribution of non-zeros –Our models account for multiple levels in the memory system including the TLB and explicitly model the execution time Other areas –Gropp et al. and Herber et. al use similar techniques to tune applications in other domains

Analytic Models of the Memory System Execution Time Model –Cache misses are modeled and verified with hardware counters –Charge α i for hits at each cache level T = (L1 hits) α 1 + (L2 hits) α 2 + (mem hits) α mem + (TLB misses) α TLB –Cache Models Lower bound on cache misses assume only compulsory misses Upper bound is same as lower bound except that every access to the source and destination vectors miss Based on models by Vuduc et al. in Super Computing 2002

Analytic Models of Memory System (cont.) Models are too simple to show performance advantages of Cache Blocking –According to models, Cache Blocking has no benefit since the blocking adds over head to the data structure –Need to model one of the levels of the memory system more accurately to expose advantages of locality –Empirical evidence suggests largest performance gains using cache blocking come from minimizing TLB misses Empirical evidence suggests two categories of block sizes that work For each row and column block size shown above, the value in the cell contains the number of matrices out of our 14 test matrices whose performance was within 90% of peak if that block size was chosen.

Analytic Model Verification 14 test matrices in our suite with varying density and structure –Except the dense matrix in sparse format, all the matrices are too sparse to be aided by register blocking –Source vectors for all matrices are much larger than largest level of cache –Wide range of applications including, Linear Programming, Web connectivity, and Latent Semantic Indexing. 3 Platforms –Intel Itanium 2 –IBM Power 4 –Intel Pentium 3 Where available we used the hardware counters through PAPI

Model Verification Analytic Model –Over predicts performance by more than a factor of 2 –Relative performance is well predicted implying that it can be used as the basis for a heuristic –Cache Blocking improves performance of certain matrices and doesn’t yield improvements on others PAPI Model –Instead of lower and upper bound models for cache and TLB misses true hardware counter values are used –Still over predict performance implying that there is time that we are not accounting for.

L3 Data Cache Miss Reduction Matrices with largest speedups experience largest decreases in the number of cache misses More than 1 order of magnitude less misses in some cases

TLB Miss Reduction Matrices with largest speedups experience largest decreases in the number of cache misses More than 2 orders of magnitude less misses in some cases

Speedup For the same matrix speedup varies. Platforms with larger processor memory gaps are improved by cache blocking Matrices (5 and 6 experience the best speedups on all the platforms Matrices (12-14) which are extremely sparse show very little speedup

Sparse Band Matrices Row Start/End optimization greatly reduces sensitivity to block size Best Performance insensitive to row block size Cache blocking is not useful for a banded matrix because there is less cache thrashing RSE optimization forgives the choice of a bad block size

When and Why Cache Blocking Works Matrix Structure –Matrices that have a large column dimension aided with cache blocking The gains through temporal locality in accesses to the source vector x far outweigh the costs of the added data structure –If the matrix is too sparse cache blocking does not help There is not enough temporal locality to amortize the added overhead –Band matrices are not aided by cache blocking Matrix already has optimal access pattern

Platform Evaluation Effectiveness of cache blocking varies across platforms for same matrix –As average number of cycles to access main memory goes up cache blocking helps since it reduces expensive accesses to main memory TLB misses account for a non trivial part of the execution time –Larger page sizes would mitigate this problem In SpM x V there are two types of access –Streaming access to matrix (only spatial locality exploited) Keeping matrix in cache wastes space for source vector –Random access to source vector (both spatial and temporal locality exploited) The more of the source vector that can stay in cache the better –Future work for machines without caches (SX1) or machines with uncached loads (Cray X1)

Conclusions Cache Blocking of SpM x V works when the benefits of the added temporal locality in the source vector x outweighs the costs of the added data structure overhead We use analytic models to select cache block size –Models are good at predicting the relative performance Cache Blocking appears to be most effective on matrices in which the column dimension is a lot greater than the row dimension Banded matrices are not aided by cache blocking since the structure already lends itself to the optimal access pattern. –Row Start/End optimization forgives a bad block size choice Full Technical Report –Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply (UCB/CSD , June, 2004.) Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick Project website: My website:

References (1/3) Full Technical Report –Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply (UCB/CSD , June, 2004.) Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, Katherine A. Yelick Other References –1. S. Browne, J. Dongarra, N. Garner, K. London, and P. Mucci. A scalable crossplatform infrastructure for application performance tuning using hardware counters. In Proceedings of Supercomputing, November –2. B. B. Fraguela, R. Doallo, and E. L. Zapata. Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Processing Letters, 9(3), –3. W. D. Gropp, D. K. Kasushik, D. E. Keyes, and B. F. Smith. Towards realistic bounds for implicit CFD codes. In Proceedings of Parallel Computational Fluid Dynamics, pages 241{248, 1999.

References (2/3) –4. G. Heber, A. J. Dolgert, M. Alt, K. A. Mazurkiewicz, and L. Stringer. Fracture mechanics on the Intel Itanium architecture: A case study. In Workshop on EPIC Architectures and Compiler Technology (ACM MICRO 34), Austin, TX, –5. D. B. Heras, V. B. Perez, J. C. C. Dominguez, and F. F. Rivera. Modeling and improving locality for irregular problems: sparse matrix-vector product on cache memories as a case study. In HPCN Europe, pages 201{210, –6. E.-J. Im. Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May –7. Y. Saad. SPARSKIT: A basic toolkit for sparse matrix computations, –8. R. H. Saavedra-Barrera. CPU Performance Evaluation and Execution Time Prediction Using Narrow Spectrum Benchmarking. PhD thesis, University of California, Berkeley, February 1992.

References (3/3) –9. A. Snavely, L. Carrington, and N. Wolter. Modeling application performance by convolving machine signatures with application proles –10. O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing '92, –11. R. Vuduc, J. W. Demmel, K. A. Yelick, S. Kamil, R. Nishtala, and B. Lee. Performance optimizations and bounds for sparse matrix-vector multiply. In Proceedings of Supercomputing, Baltimore, MD, USA, November –12. R. W. Vuduc. Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley, This work was supported in part by the National Science Foundation under ACI ACI and gifts from: Intel Corporation, Hewlett Packard, and Sun Microsystems. The information presented here does not necessarily reflect the position or the policy of the Government and no official endorsement should be inferred.