CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections 13.1-13.5.

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections 13.1-13.5

Conjugate gradient iteration One matrix-vector multiplication per iteration Two vector dot products per iteration Four n-vectors of working storage x 0 = 0, r 0 = b, d 0 = r 0 for k = 1, 2, 3,... α k = (r T k-1 r k-1 ) / (d T k-1 Ad k-1 ) step length x k = x k-1 + α k d k-1 approx solution r k = r k-1 – α k Ad k-1 residual β k = (r T k r k ) / (r T k-1 r k-1 ) improvement d k = r k + β k d k-1 search direction

Preconditioned conjugate gradient iteration x 0 = 0, r 0 = b, d 0 = B -1 r 0, y 0 = B -1 r 0 for k = 1, 2, 3,... α k = (y T k-1 r k-1 ) / (d T k-1 Ad k-1 ) step length x k = x k-1 + α k d k-1 approx solution r k = r k-1 – α k Ad k-1 residual y k = B -1 r k preconditioning solve β k = (y T k r k ) / (y T k-1 r k-1 ) improvement d k = y k + β k d k-1 search direction One matrix-vector multiplication per iteration One solve with preconditioner per iteration

Incomplete Cholesky factorization (IC, ILU) Compute factors of A by Gaussian elimination, but ignore fill Preconditioner B = R T R  A, not formed explicitly Compute B -1 z by triangular solves (in time nnz(A)) Total storage is O(nnz(A)), static data structure Either symmetric (IC) or nonsymmetric (ILU) x ARTRT R

Incomplete Cholesky and ILU: Variants Allow one or more “levels of fill” unpredictable storage requirements Allow fill whose magnitude exceeds a “drop tolerance” may get better approximate factors than levels of fill unpredictable storage requirements choice of tolerance is ad hoc Partial pivoting (for nonsymmetric A) “Modified ILU” (MIC): Add dropped fill to diagonal of U or R A and R T R have same row sums good in some PDE contexts 2 3 1 4 2 3 1 4

Matrix permutations for iterative methods Symmetric matrix permutations don’t change the convergence of unpreconditioned CG Symmetric matrix permutations affect the quality of an incomplete factorization – poorly understood, controversial Often banded (RCM) is best for IC(0) / ILU(0) Often min degree & nested dissection are bad for no-fill incomplete factorization but good if some fill is allowed Some experiments with orderings that use matrix values e.g. “minimum discarded fill” order sometimes effective but expensive to compute

Nonsymmetric matrix permutations for iterative methods Dynamic permutation: ILU with row or column pivoting E.g. ILUTP (Saad), Matlab “luinc” More robust but more expensive than ILUT Static permutation: Try to increase diagonal dominance E.g. bipartite weighted matching (Duff & Koster) Often effective; no theory for quality of preconditioner Field is not well understood, to say the least

Row permutation for heavy diagonal Row permutation for heavy diagonal [Duff, Koster] Represent A as a weighted, undirected bipartite graph (one node for each row and one node for each column) Find matching (set of independent edges) with maximum product of weights Permute rows to place matching on diagonal Matching algorithm also gives a row and column scaling to make all diag elts =1 and all off-diag elts <=1 15234 1 5 2 3 4 A 1 5 2 3 4 1 5 2 3 4 15234 4 2 5 3 1 PA

Preconditioned conjugate gradient iteration x 0 = 0, r 0 = b, d 0 = B -1 r 0, y 0 = B -1 r 0 for k = 1, 2, 3,... α k = (y T k-1 r k-1 ) / (d T k-1 Ad k-1 ) step length x k = x k-1 + α k d k-1 approx solution r k = r k-1 – α k Ad k-1 residual y k = B -1 r k preconditioning solve β k = (y T k r k ) / (y T k-1 r k-1 ) improvement d k = y k + β k d k-1 search direction Several vector inner products per iteration (easy to parallelize) One matrix-vector multiplication per iteration (medium to parallelize) One solve with preconditioner per iteration (hard to parallelize)

Incomplete Cholesky and ILU: Issues Choice of parameters good: smooth transition from iterative to direct methods bad: very ad hoc, problem-dependent tradeoff: time per iteration (more fill => more time) vs # of iterations (more fill => fewer iters) Effectiveness condition number usually improves (only) by constant factor (except MIC for some problems from PDEs) still, often good when tuned for a particular class of problems Parallelism Triangular solves are not very parallel Reordering for parallel triangular solve by graph coloring Dynamic coloring: ILUM (Saad)

Sparse approximate inverses Compute B -1  A explicitly Minimize || A B -1 – I || F (in parallel, by columns) Variants: factored form of B -1, more fill,.. Good: very parallel, seldom breaks down Bad: effectiveness varies widely AB -1

Complexity of linear solvers 2D3D Dense Cholesky:O(n 3 ) Sparse Cholesky:O(n 1.5 )O(n 2 ) CG, exact arithmetic: O(n 2 ) CG, no precond: O(n 1.5 )O(n 1.33 ) CG, modified IC: O(n 1.25 )O(n 1.17 ) CG, support trees: O(n 1.20 ) -> O(n 1+ )O(n 1.75 ) -> O(n 1.31 ) Multigrid:O(n) n 1/2 n 1/3 Time to solve model problem (Poisson’s equation) on regular mesh

Lay out matrix and vectors by rows Hard part is matrix-vector product y = A*x Algorithm Each processor j: Broadcast x(j) Compute y(j) = A(j,:)*x May send more of x than needed Partition / reorder matrix to reduce communication x y P0 P1 P2 P3 P0 P1 P2 P3 Matrix-vector product: Parallel implementation

Sparse Matrix-Vector Multiplication

Graph partitioning in theory If G is a planar graph with n vertices, there exists a set of at most sqrt(6n) vertices whose removal leaves no connected component with more than 2n/3 vertices. (“Planar graphs have sqrt(n)-separators.”) “Well-shaped” finite element meshes in 3 dimensions have n 2/3 - separators. Also some other classes of graphs – trees, graphs of bounded genus, chordal graphs, bounded-excluded- minor graphs, … Mostly these theorems come with efficient algorithms, but they aren’t used much.

Graph partitioning in practice Graph partitioning heuristics have been an active research area for many years, often motivated by partitioning for parallel computation. See CS 240A. Some techniques: Spectral partitioning (uses eigenvectors of Laplacian matrix of graph) Geometric partitioning (for meshes with specified vertex coordinates) Iterative-swapping (Kernighan-Lin, Fiduccia-Matheysses) Breadth-first search (GLN 7.3.3, fast but dated) Many popular modern codes (e.g. Metis, Chaco) use multilevel iterative swapping Matlab graph partitioning toolbox: see course web page

Parallel Incomplete Cholesky and ILU: Issues Computing the preconditioner Parallel direct methods well developed But IC/ILU is sparser => harder to speed up Still, you only have to do it once Applying the preconditioner Triangular solves are not very parallel Reordering by graph coloring (see example) But the orderings are not great for convergence

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections 13.1-13.5.

Similar presentations

Presentation on theme: "CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections 13.1-13.5."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections 13.1-13.5.

Similar presentations

Presentation on theme: "CS 290H Administrivia: May 14, 2008 Course project progress reports due next Wed 21 May. Reading in Saad (second edition): Sections 13.1-13.5."— Presentation transcript:

Similar presentations

About project

Feedback