" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.

Slides:



Advertisements
Similar presentations
A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
Advertisements

A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
ACCELERATING GOOGLE’S PAGERANK Liz & Steve. Background  When a search query is entered in Google, the relevant results are returned to the user in an.
MATH 685/ CSI 700/ OR 682 Lecture Notes
Solving Linear Systems (Numerical Recipes, Chap 2)
Rayan Alsemmeri Amseena Mansoor. LINEAR SYSTEMS Jacobi method is used to solve linear systems of the form Ax=b, where A is the square and invertible.
Modern iterative methods For basic iterative methods, converge linearly Modern iterative methods, converge faster –Krylov subspace method Steepest descent.
Parallelized variational EM for Latent Dirichlet Allocation: An experimental evaluation of speed and scalability Ramesh Nallapati, William Cohen and John.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
Sparse Matrix Methods Day 1: Overview Day 2: Direct methods
The Landscape of Ax=b Solvers Direct A = LU Iterative y’ = Ay Non- symmetric Symmetric positive definite More RobustLess Storage (if sparse) More Robust.
Instrumentation and Profiling David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
A Solenoidal Basis Method For Efficient Inductance Extraction H emant Mahawar Vivek Sarin Weiping Shi Texas A&M University College Station, TX.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
A Factored Sparse Approximate Inverse software package (FSAIPACK) for the parallel preconditioning of linear systems Massimiliano Ferronato, Carlo Janna,
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Numerical Linear Algebra IKI Course outline Review linear algebra Square linear systems Least Square Problems Eigen Problems Text: Applied Numerical.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
September 15, Utilizing CUDA for Preconditioned GMRES Solvers DCABES’09 Shiming Xu 1, Hai Xiang Lin 1, Wei Xue 2, and Ke Wang 3 1 Delft Institute.
Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
Fast Low-Frequency Impedance Extraction using a Volumetric 3D Integral Formulation A.MAFFUCCI, A. TAMBURRINO, S. VENTRE, F. VILLONE EURATOM/ENEA/CREATE.
Qualifier Exam in HPC February 10 th, Quasi-Newton methods Alexandru Cioaca.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.
Linear Systems Iterative Solutions CSE 541 Roger Crawfis.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
1 Matrix Algebra and Random Vectors Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.
Linear Systems – Iterative methods
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)
Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Consider Preconditioning – Basic Principles Basic Idea: is to use Krylov subspace method (CG, GMRES, MINRES …) on a modified system such as The matrix.
Circuit Simulation using Matrix Exponential Method Shih-Hung Weng, Quan Chen and Chung-Kuan Cheng CSE Department, UC San Diego, CA Contact:
A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
MA237: Linear Algebra I Chapters 1 and 2: What have we learned?
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0.
1 SYSTEM OF LINEAR EQUATIONS BASE OF VECTOR SPACE.
A few words on locality and arrays
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Ioannis E. Venetis Department of Computer Engineering and Informatics
Lecture 19 MA471 Fall 2003.
for more information ... Performance Tuning
A Cloud System for Machine Learning Exploiting a Parallel Array DBMS
Memory Hierarchies.
A robust preconditioner for the conjugate gradient method
Ann Gordon-Ross and Frank Vahid*
"Developing an Efficient Sparse Matrix Framework Targeting SSI Applications" Diego Rivera and David Kaeli The Center for Subsurface Sensing and Imaging.
Matrix Algebra and Random Vectors
ECE 576 POWER SYSTEM DYNAMICS AND STABILITY
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
Linear Algebra Lecture 16.
2019/9/14 The Deep Learning Vision for Heterogeneous Network Traffic Control Proposal, Challenges, and Future Perspective Author: Nei Kato, Zubair Md.
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" Diego Rivera 1, David Kaeli 1 and Misha Kilmer 2 1 Department of Electrical and Computer Engineering Northeastern University, Boston, MA {drivera, 2 Department of Mathematics Tufts University, Medford, MA ICSS Institute for Complex Scientific Software Approximate inverse preconditioner: SPAI, MR, etc. The PIN tool was used to capture cache events. LRU and random replacement policies were modeled Several matrices were evaluated. Results from four representative matrices are shown below: Plans and future work Developing a benchmark suite for evaluating how best fill-in can be used for a given memory hierarchy and application code Arriving at an algorithmic approach to select the best values of the preconditioner parameters for a given memory hierarchy Proposing a new portable ILU-type preconditioner that does dynamic matrix fill-in:  Reordering technique for improving temporal locality  Adapting the number of non-zero elements to the block’s size of the highest cache level for improving spatial locality Objective To improve the performance of preconditioners targeting sparse matrices To accelerate the memory accesses associated with these codes Motivation Prior work targeted Krylov subspace methods However, little has been done in the case of preconditioners “Nothing will be more central to computational science in the next century than the art of transforming a problem that appears intractable into another whose solution can be approximated rapidly. For Krylov subspace matrix iterations, this is preconditioning” from Numerical Linear Algebra by Trefethen and Bau (1997). Common target applications Computational time is a barrier in these applications Parallel processing can be used to lower this barrier The sparsity of the data reduces the effectiveness of direct parallel computation Preconditioners can be used to accelerate the convergence of Krylov subspace methods A drawback of these approaches is that it is difficult to choose good values for their tuning-parameters Choosing good values depends heavily on the structure of non- zero elements of the coefficient matrix In our work we have found that it depends also on the memory hierarchy machine used to compute the solution What about tuning memory access patterns of preconditioner techniques? Acknowledgement This project is supported by the National Science Foundation’s Computing and Communication Foundations Division, grant number CCF and the Institute of Complex Scientific Software. Preconditioner Ax=b Solution to the linear system M -1 Ax=M -1 b Iterative Method Weather Simulations Turbulence problems in airplanes DNA models A (m,m) x (m) = b (m) Results for ILUD preconditioner and method GMRES, 14 possible values for each parameter (drop tolerance, diagonal compensation parameter). There are 378 possible combinations. drop tolerance, diagonal compensation parameter and tolerance ratio , , permtol ILUDP ……... drop tolerance, diagonal compensation parameter ,  ILUD level-of-fill, drop tolerance and tolerance ratio , , permtol ILUTP level-of-fill, drop tolerance ,, ILUT level-of-fill  ILU(  ) Description parametersParametersPreconditioner Target preconditioners 1 GB RAM2 GB RAMRAM All the cache levels use a pseudo-random All cache levels use a pseudo-LRU Replacement algorithm N/A1 MB 8-wayLevel 3 8MB 2-way512 KB 8-wayLevel 2 64KB 4-way for data8KB 4-way for dataLevel 1 Ultra Sparc-III 750 MHzIntel XEON 3.06 GHz Evaluation environment 0% 21% 100% 48% Numerical symmetry (NS) Torso3 Cage14 Ldoor Raefsky3 Name 0259,1564,429, ,505,78527,130, ,20342,493, ,2001,488,768 NS/BRowsNon-zero elements Matrices Raefsky3Ldoor Cage14 Torso3 Relation numerical-symmetry/matrix-bandwidth decreases in this direction Error norm vs. 13 first duple sorted in increasing order for execution time of ILUT and GMRES DTLB DL1 L2 Ultra Sparc-III DTLB DL1 L2 L3 Intel XEON Correlation of load accesses and execution time Cost of preconditionerVs.Cost of Krylov method = Not interesting, usually easy problems Case for Xeon: cost of preconditioner domains execution time. It is desired to pay less for the preconditioner >> << Case for Ultra: cost of Krylov method domains execution time. It is desired to pay more for the preconditioner In the i th iteration of the outer loop: Data accessed but not modified The i th row Data accessed and modified Data not accessed