PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata 1 Yoshimasa Nakamura 2 1 Nagoya University 2 Kyoto University

2 Introduction Background –Advent of many-core floating point accelerators as a means to speed up scientific computations Objective of our study –Apply these accelerators to the eigenvalue problem for nonsymmetric matrices. –Make clear potential problems. –Modify existing algorithms or develop new algorithms if necessary.

3 Outline of the talk Introduction Many-core floating point accelerators and its performance characteristics The nonsymmetric eigenvalue problem Proposed algorithm –Modification of the small-bulge multishift QR algorithm for floating-point accelerators Performance evaluation Conclusion

4 Many-core Floating-point accelerators ClearSpeed CSX600 –1+96 processor cores –48GFLOPS (double precision) Intel Larrabee (under development) –80 processor cores –1TFLOPS (single precision) GRAPE-DR (Tokyo Univ.) –512 processor cores –512GFLOPS (single precision) –256GFLOPS (double precision) Integrates hundreds of floating-point cores Very high GFLOPS value (peak performance)

5 Architecture of the CSX600 accelerator The CSX600 chip –1 main processor –96 floating-point processors 64bit 2flops/cycle 128Byte register file 6KB SRAM –Operates at 250MHz –Peak performance: 48GFLOPS ClearSpeed Advance board –Two CSX600 processors –1GB DRAM –Connected with the PC via the PCI-X bus. –Peak performance: 96GFLOPS

6 Problem with the data transfer speed Peak floating-point performance --- very high –48GFLOPS / chip –96GFLOPS / board Data transfer speed --- relatively low –3.2GB/s between the chip and on-board memory –1.066GF/s between the board and main memory Byte/flop – 0.066Byte/flop between the chip and on-board memory – 0.011Byte/flop between the board and main memory CSX600 I/F DRAM I/F CPU 3.2GB/s 1.066 GB/s PC ClearSpeed Advance board PCI-X

7 Byte/flop of typical linear algebraic operations FunctionOperation Amount of data transfer Flop count Byte/flop Dot product  := x T y 2n2n2n2n8 AXPY x := x +  y 3n3n2n2n12 Matrix-vector multiplication y := Axn2+2nn2+2n2n22n2 4 Rank-1 update A:= A + xy T 2n2+2n2n2+2n2n22n2 8 Matrix multiplication (MatMult) C:= C + AB4n24n2 2n32n3 16/n Operations other than matrix multiplication cannot exploit the performance of the CSX600 due to the limitation of data transfer speed. Matrix multiplication (MatMult) can be executed efficiently, but only if the size is very large (n > 1500).

8 Performance of MatMult on the ClearSpeed board GFLOPS M KN M K N N M The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory. M, N and K must be larger than 1500 to get substantial performance gain.

9 Problems to be solved Problems –Is it possible to reorganize the algorithm so that most of the computations are done with matrix multiplications? –What is the overhead of using very large size matrix multiplications? –How can we reduce the overhead? We consider these problems for the nonsymmetric eigenvalue problem.

10 The nonsymmetric eigenvalue problem The problem –Eigenvalue problem A : dense complex nonsymmetric matrix Compute all the eigenvalues / eigenvectors Applications –Magnetohydrodynamics –Structural dynamics –Quantum chemistry –Fluid dynamics Cf. Z. Bai and J. Demmel: A test matrix collection for non- Hermitian eigenvalue problems.

11 The standard algorithm –Similarity transformation to the upper triangular matrix –We focus on speeding up the QR algorithm. QR algorithm Work: 10 n 3 (empirically) Algorithm for the nonsymmetric eigenproblem Dense matrix diagonal elements = eigenvalues Hessenberg matrix Upper triangular matrix Finite # of steps Iterative method Householder method Work: (10/3)n 3 0 0 Target of speedup

12 Algorithm –shifts s 1, …, s m : eigenvalues of the trailing m x m submatrix of A l –Perform m steps of the QR algorithm at once. Computational procedure for one iteration –Introduce (m / 2) bulges –Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. The small-bulge multishift QR algorithm 0 the case of m = 4 shifts Matrices: A Hessenberg Q unitary R upper triangular

13 Algorithm –shifts s 1, …, s m : eigenvalues of the trailing m x m submatrix of A l –Perform m steps of the QR method at once. Computational procedure for one iteration –Introduce (m / 2) bulges –Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. The small-bulge multishift QR algorithm 0 the case of m = 4 shifts Matrices: A Hessenberg Q unitary R upper triangular

17 Division of the updating operations –Chase the bulges by only k rows at a time. –Divide update operations into two parts: First, update the diagonal block sequentially. –Accumulate the Householder transformations used in the update as a unitary matrix. Next, update the off-diagonal blocks. –Multiply the off-diagonal blocks by the unitary matrix. Use of the level-3 BLAS 0 Bulge (3x3) Diagonal update (sequential) Off-diagonal update (MatMult) k Level-3 BLAS Blocking of bulge-chasing operations

18 Random matrix ( n = 6000) Compute all the eigenvalues / eigenvectors with the small- bulge multishift QR algorithm Computational environments –Xeon 3.2 GHz, Memory 8 GB –ClearSpeed advance board CSX600 x 2 As the number of shifts increases … – MatMult part decrease – other part increase (bottleneck) Performance on the CSX600 Number of shifts Execution time (sec) XeonXeon + CSX600 100120160200240 MatMult size60072096012001440 Parts other than MatMult need to be sped up !

19 Modification of the algorithm (1) 0 k 0 k / qk / q Diagonal update (sequential) Off-diagonal (MatMult) Off-diagonal update (MatMult) Reformulation as a recursive algorithm Chasing of (m / 2) / q bulges by k / q rows (ex. Recursion level d = 1)

20 Modification of the algorithm (2) Deflation –Trailing submatrix is isolated. ( ) Eigensolution of the isolated submatrix –Apply the double shift QR algorithm. –Size of the submatrix increases with m. Division of the update operations –Update the diagonal block (until convergence) Accumulate the Householder transformations used in the update as a unitary matrix. –Update the off-diagonal block (only once, at last) Multiply the off-diagonal blocks by the unitary matrix. 0 0 sequential MatMult Bottleneck Reduce the Computational work

21 Numerical experiments Test problem – random matrices with elements in [0, 1] Reduced to Hessenberg form by Householder’s method –Compute all the eigenvalues / eigenvectors Computational environments –Xeon 3.2 GHz, Memory 8 GB –Fortran 77, double precision –ClearSpeed advance board CSX600 x 2 –Matrix multiplication ClearSpeed’s Library (for large size MatMult) Intel Math Kernel Library (for small size MatMult)

22 Comparison –Existing algorithm (small-bulge multishift QR method) MatMult part –Off-diagonal update –Our algorithm (mutishift QR + recursion) MatMult parts –Off-diagonal update –Diagonal update –Eigensolution of isolated submatrix –Parameter values Number of shifts: m is chosen optimally for each case. Row length of bulge chasing: k = (3/2)m Level of recursion: d = 1 Number of subdivision: q = m / 40 Numerical experiments CSX600

23 Effect of our modifications Our algorithm is 1.4 times faster –Diagonal update: 1.5 times faster –Eigensolution of isolated submatrix: 10 times faster （ n = 3000, m = 160, q = 4 ）（ n = 6000, m = 200, q = 5 ） Execution time (sec) Ours CSX600 is used in all cases Original

24 Effect of using the CSX600 n = 6000 n = 12000 By combining the CSX600 with our algorithm, –3.5 times speedup when n = 6000 –3.8 times speedup when n = 12000 Execution time (sec) m = 100 q = 5 m = 200 q = 5 m = 100 q = 5 m = 240 q = 6 （有） Xeon + CSX600 Xeon + CSX600 Xeon

25 Conclusion We proposed an approach to accelerate the solution of nonsymmetric eigenproblem using a floating-point accelerator. We used the small-bulge multishift QR algorithm, which can use matrix multiplications efficiently, as a basis. By reformulating part of the algorithm as a recursive algorithm, we succeeded in reducing the computational time spent by non- blocked part. This enables us to use large block size (number of shifts) with small overhead and to exploit the performance of the floating-point accelerator. When solving an eigenproblem of order 12,000, our algorithm is 1.4 times faster than the original small-bugle multishift QR algorithm. We obtained 3.8 times speedup with the CSX600 over the 3.2GHz Xeon.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Similar presentations

Presentation on theme: "PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Similar presentations

Presentation on theme: "PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata."— Presentation transcript:

Similar presentations

About project

Feedback