Download presentation

Presentation is loading. Please wait.

Published byArely Hiley Modified over 2 years ago

1
ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX Enrique Quintana-Ortí Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain Robert van de Geijn Department of Computer Sciences The University of Texas at Austin, Austin, TX Thierry Joffrain Department of Computer Sciences The University of Texas at Austin, Austin, TX

2
ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

3
ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

4
ScicomP 10, Aug 9-13, 2004 Motivation Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory. m n In-core

5
ScicomP 10, Aug 9-13, 2004 Motivation m >> n n While this is effective for many applications, it is inherently unscalable As m >> n, fewer columns can fit into memory

6
ScicomP 10, Aug 9-13, 2004 A=QR Q = I + YTY T Out-of-Core QR Factorization Compact WY Representation Q is an orthogonal matrix R is upper triangular Y is an m×r collection of Householder vectors, normalized to be unit lower triangular (trapezoidal) T is r×r upper triangular Given the m×n matrix, A, we wish to apply the factorization

7
ScicomP 10, Aug 9-13, 2004 Step 1: Begin with an unfactored matrix which resides on disk. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

8
ScicomP 10, Aug 9-13, 2004 Step 2: Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file. = Stored on disk= In memory QR Factorization Out-of-Core Implementation t t

9
ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

10
ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

11
ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

12
ScicomP 10, Aug 9-13, 2004 Step 3: Read in first tiles and factor, saving T matrices and overwriting lower tile with Y = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

13
ScicomP 10, Aug 9-13, 2004 Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

14
ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

15
ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

16
ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

17
ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

18
ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

19
ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

20
ScicomP 10, Aug 9-13, 2004 = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi Step 4: Read in remaining tiles in row and apply Q = I + Y i T i Y i, reading Y i in one panel at a time.

21
ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

22
ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

23
ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

24
ScicomP 10, Aug 9-13, 2004 Step 5: Factor next tile in first column using QR update algorithm. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

25
ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

26
ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

27
ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

28
ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

29
ScicomP 10, Aug 9-13, 2004 Step 6: Apply transformations to remaining tiles in row. = Stored on disk= In memory QR Factorization Out-of-Core Implementation TiTi YiYi

30
ScicomP 10, Aug 9-13, 2004 Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

31
ScicomP 10, Aug 9-13, 2004 Step 7: Repeat Steps 5 and 6 to any remaining rows of tiles. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

32
ScicomP 10, Aug 9-13, 2004 Step 8: Repeat Steps 1-7 on lower quadrant. = Stored on disk= In memory QR Factorization Out-of-Core Implementation YiYi TiTi

33
ScicomP 10, Aug 9-13, 2004 Step 8: Repeat Steps 1-7 on lower quadrant. Continue until entire matrix has been factored. = Stored on disk= In memory QR Factorization Out-of-Core Implementation

34
ScicomP 10, Aug 9-13, 2004 PA=LU Out-of-Core LU Factorization P is an permutation matrix U is n×n upper triangular L is lower trapezoidal Implementation analogous to out-of-core QR factorization Given the m×n matrix, A, we wish to apply the factorization

35
ScicomP 10, Aug 9-13, 2004 Step 1: Factor first tile, saving permutation matrix. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

36
ScicomP 10, Aug 9-13, 2004 Step 2: Update remaining tiles in row using panels of L and the saved permutation matrices. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

37
ScicomP 10, Aug 9-13, 2004 Step 3: Factor next tile in first column using LU update algorithm. = Stored on disk= In memory LU Factorization Out-of-Core Implementation PiPi LiLi UiUi

38
ScicomP 10, Aug 9-13, 2004 Step 4: Update remaining tiles in row using panels of L and stored permutation matrices. = Stored on disk= In memory LU Factorization Out-of-Core Implementation LiLi UiUi PiPi

39
ScicomP 10, Aug 9-13, 2004 Development Environment Parallel Linear Algebra Package (PLAPACK) Optimized parallel routines (FORTRAN and C interfaces) View-based infrastructure Uses standard MPI and BLAS libraries Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) Out-of-core extension to PLAPACK Handles the complexity of the I/O operations (i.e., hidden to user) Uses standard read/write functions for portability

40
ScicomP 10, Aug 9-13, 2004 Performance of Parallel OOC QR IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of Gflops

41
ScicomP 10, Aug 9-13, 2004 Performance for Sequential OOC LU

42
ScicomP 10, Aug 9-13, 2004 Earth Science Application Gravity Recovery And Climate Experiment (GRACE) A collaborative effort between The University of Texas Center for Space Research (CSR) The Jet Propulsion Laboratory (JPL) GeoForschungsZentrum (GFZ) Deutschen Zentrum für Luft- und Raumfahrt (DLR) National Aeronautics and Space Administration (NASA)

43
ScicomP 10, Aug 9-13, 2004 Earth Science Application Goal was to compute a rigorous 360x360 gravity model No approximation techniques Translates to roughly 100 km 2 resolution Involves the least squares estimation of ~130,000 parameters Requires the combination of hundreds of millions of observations surface gravity data (land) – ½ TB altimetry-based mean sea surface data (ocean) GRACE data (satellite) Using new parallel OOC QR algorithm A 360x360 field was generated, complete with full covariance Largest rigorous gravity field model ever created Used a single IBM P690 node OOC QR required only 32 GB To do in-core would require 165 GB of memory Required ~6 days of wall clock time to compute (2326 CPU hours) A single processor machine with sufficient memory would require 3.2 months

44
ScicomP 10, Aug 9-13, 2004 Conclusion Tile-based out-of-core algorithms provide scalability Size of the tile is based on the memory of the machine (i.e. fixed) and is independent of the problem size Algorithms achieve excellent performance The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations This helps to offset the I/O cost associated with moving the tiles to and from disk Use of the PLAPACK & POOCLAPACK greatly simplified the implementation Reduces complexity of code Makes code portable Has already proven valuable to Earth science applications

45
ScicomP 10, Aug 9-13, 2004 Conclusion Broad spectrum of applications Large scale problems Small clusters Embedded systems Other small memory machines Tile-based OOC approach can be extended to other dense linear algebra operations Cholesky, matrix inverse, BLAS-3, etc. Goal is to provide a full suite of OOC utilities

46
ScicomP 10, Aug 9-13, 2004 For More Information Visit the PLAPACK website:www.cs.utexas.edu/users/plapackwww.cs.utexas.edu/users/plapack Visit the GRACE website:

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google