Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University.

Similar presentations


Presentation on theme: "Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University."— Presentation transcript:

1 Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University of Missouri—St. Louis ICCS 2013

2 Toeplitz  What is Toeplitz?  Where does it come from? –One dimensional regular grid  Another example –The standard Laplacian. But it is often treated as a sparse matrix. 2

3 Multilevel Toeplitz  Multilevel Toeplitz is defined w.r.t. the number of levels  Where does it come from? –d-dimensional regular grid  Think about 2D Laplacian 3

4 Why Solve (Multilevel) Toeplitz Systems?  Scattered data interpolation [ Figure from http://pythonhosted.org/ ]http://pythonhosted.org/  The covariance matrix K is multilevel Toeplitz when the X i ’s are on a regular grid  K -1 also appear in many other problems, such as maximum likelihood estimation 4

5 How to Solve?  General direct methods via matrix factorizations: O(n 3 )  Fast direct methods for 1-level Toepiltz: O(n 2 ) –Levinson-Durbin, 1947, 1960 –Bareiss, 1969  Superfast direct methods for 1-level Toeplitz: O(n log α n) –Pan, 1993 –Stewart, 2003 –Chandrasekaran et al, 2007  Methods for specialized systems: banded, block Toeplitz, Toeplitz block, etc  General method for any-level Toeplitz: O(n log n) –Chan and Jin, 2007 –Use an iterative solver (e.g., conjugate gradient) –Matrix-vector multiplication through FFT –Circulant preconditioner 5 We parallelize this method

6 Conjugate Gradient (CG) 6 Multilevel Toeplitz Multilevel circulant

7 Toeplitz-Multiply  Circulant embedding (1-level case)  In case of symmetry, both T and C are represented by the first columns, t and c  Circulant-multiply: 1.λ = fft(c) 2.v’ = ifft( λ.* fft(y’) )  Simple to generalize to d-level case –t is a d-dimensional tensor –Circulant embedding done along all dimensions –FFT and IFFT become multidimensional 7

8 Circulant Preconditioning Multilevel Toeplitz T  Multilevel circulant preconditioner M data representation t (d-D tensor) data representation m (d-D tensor, same size) A 3-D Example to construct m: 1.Initialize m( :, :, : ) = t( :, :, : ) 2.s( j, :, : ) = [ (n 1 -j) * m( j, :, : ) + j * m( n 1 -j, :, : ) ] / n 1, j = 0:n 1 -1; then copy s to m 3.s( :, j, : ) = [ (n 2 -j) * m( :, j, : ) + j * m( :, n 2 -j, : ) ] / n 1, j = 0:n 2 -1; then copy s to m 4.s( :, :, j ) = [ (n 3 -j) * m( :, :, j ) + j * m( :, :, n 3 -j ) ] / n 1, j = 0:n 3 -1; then copy s to m  In the 1-level case, this preconditioner yields superlinear convergence for CG  In the higher level case, superlinear convergence is lost; but still good performance in practice 8

9 Toeplitz CG 9

10 Parallelization 1: Toeplitz-Multiply Naïve approach 1.y’ = Embed y 2.z = multidimensional-FFT(y’) 3.w = λ.* z 4.v’ = multidimensional-IFFT(w) 5.v = Truncate v’ Less-communication approach 1.y’’ = Embed y along unpartitioned dims 2.y’ = FFT(y’’) along unpartitioned dims 3.Transpose y’ 4.z’ = Embed y’ along unpartitioned dims 5.z = FFT(z’) along unpartitioned dims 6.w = λ.* z 7.w’’ = IFFT(w) along unpartitioned dims 8.w’ = Truncate w’’ along unpartitioned dims 9.Transpose w’ 10.v’ = IFFT(w’) along unpartitioned dims 11.v = Truncate v’ along unpartitioned dims 10 Red lines require MPI_Alltoall

11 Parallelization 1: Toeplitz-Multiply 11

12 Parallelization 1: Toeplitz-Multiply  How to partition a d-dimensional data cube? –Use an array of processes –Use a 2-dimensional grid of processes –Use a d’-dimensional grid of processes ( d’ = 1, 2, …, d )  The larger d’ is, the more processes one can use  The larger d’ is, the smaller the total size of MPI_Alltoall. ( p = p 0. p 1 … p d’-1 ) 12

13 Parallelization 1: Toeplitz-Multiply 13

14 Parallelization 2: Eliminate Allreduce 14

15 Parallelization 2: Eliminate Allreduce Computing v = Ty and σ = (v,y) simultaneously:  Use the alltoall in the ifft of w to sum the inner product between z and w  Thus eliminating allreduce 15

16 Parallelization 2: Eliminate Allreduce 16

17 Overall Solver Performance Strong scalingWeak scaling (2 20 grid points per core) 17

18 Summary  Multilevel Toeplitz matrices appear in, e.g., statistics  Iterative methods have been the methods for multilevel Toeplitz systems so far  We parallelize CG: –Use a multidimensional grid of processes to partition a multidimensional data –Eliminate communication in data embedding –Eliminate allreduce communication for computing inner products  Largest experiment: Solve 1B-by-1B matrix using 1K processes in 1 minute  Other iterative methods (e.g., GMRES) can be similarly parallelized 18


Download ppt "Parallelizing the conjugate gradient algorithm for multilevel Toeplitz systems Jie Chen a and Tom L. H. Li b a Argonne National Laboratory b University."

Similar presentations


Ads by Google