PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Slides:



Advertisements
Similar presentations
Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Advertisements

A Large-Grained Parallel Algorithm for Nonlinear Eigenvalue Problems Using Complex Contour Integration Takeshi Amako, Yusaku Yamamoto and Shao-Liang Zhang.
Lecture 6: Multicore Systems
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
The QR iteration for eigenvalues. ... The intention of the algorithm is to perform a sequence of similarity transformations on a real matrix so that the.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Part 3 Chapter 9 Gauss Elimination
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 6. Eigenvalue problems.
Numerical Analysis 1 EE, NCKU Tien-Hao Chang (Darby Chang)
C.S. Choy95 COMPUTER ORGANIZATION Logic Design Skill to design digital components JAVA Language Skill to program a computer Computer Organization Skill.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.
Eigenvalue Problems Solving linear systems Ax = b is one part of numerical linear algebra, and involves manipulating the rows of a matrix. The second main.
8.1 Vector spaces A set of vector is said to form a linear vector space V Chapter 8 Matrices and vector spaces.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
Makoto Kudoh*1, Hisayasu Kuroda*1,
High Performance Solvers for Semidefinite Programs
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Introduction of Intel Processors
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
On the Use of Sparse Direct Solver in a Projection Method for Generalized Eigenvalue Problems Using Numerical Integration Takamitsu Watanabe and Yusaku.
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
Introduction to MMX, XMM, SSE and SSE2 Technology
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Toward an Automatically Tuned Dense Symmetric Eigensolver for Shared Memory Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya.
PARALLEL COMPUTATION FOR MATRIX MULTIPLICATION Presented By:Dima Ayash Kelwin Payares Tala Najem.
Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Computational Physics (Lecture 7) PHY4061. Eigen Value Problems.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
Improvement to Hessenberg Reduction
ALGEBRAIC EIGEN VALUE PROBLEMS
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Part 3 Chapter 9 Gauss Elimination
Generalized and Hybrid Fast-ICA Implementation using GPU
TI Information – Selective Disclosure
Employing compression solutions under openacc
Ioannis E. Venetis Department of Computer Engineering and Informatics
Matrices and vector spaces
ACCELERATING SPARSE CHOLESKY FACTORIZATION ON GPUs
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Dense Linear Algebra (Data Distributions)
Memory System Performance Chapter 3
Presentation transcript:

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata 1 Yoshimasa Nakamura 2 1 Nagoya University 2 Kyoto University

2 Introduction Background –Advent of many-core floating point accelerators as a means to speed up scientific computations Objective of our study –Apply these accelerators to the eigenvalue problem for nonsymmetric matrices. –Make clear potential problems. –Modify existing algorithms or develop new algorithms if necessary.

3 Outline of the talk Introduction Many-core floating point accelerators and its performance characteristics The nonsymmetric eigenvalue problem Proposed algorithm –Modification of the small-bulge multishift QR algorithm for floating-point accelerators Performance evaluation Conclusion

4 Many-core Floating-point accelerators ClearSpeed CSX600 –1+96 processor cores –48GFLOPS (double precision) Intel Larrabee (under development) –80 processor cores –1TFLOPS (single precision) GRAPE-DR (Tokyo Univ.) –512 processor cores –512GFLOPS (single precision) –256GFLOPS (double precision) Integrates hundreds of floating-point cores Very high GFLOPS value (peak performance)

5 Architecture of the CSX600 accelerator The CSX600 chip –1 main processor –96 floating-point processors 64bit 2flops/cycle 128Byte register file 6KB SRAM –Operates at 250MHz –Peak performance: 48GFLOPS ClearSpeed Advance board –Two CSX600 processors –1GB DRAM –Connected with the PC via the PCI-X bus. –Peak performance: 96GFLOPS

6 Problem with the data transfer speed Peak floating-point performance --- very high –48GFLOPS / chip –96GFLOPS / board Data transfer speed --- relatively low –3.2GB/s between the chip and on-board memory –1.066GF/s between the board and main memory Byte/flop – 0.066Byte/flop between the chip and on-board memory – 0.011Byte/flop between the board and main memory CSX600 I/F DRAM I/F CPU 3.2GB/s GB/s PC ClearSpeed Advance board PCI-X

7 Byte/flop of typical linear algebraic operations FunctionOperation Amount of data transfer Flop count Byte/flop Dot product  := x T y 2n2n2n2n8 AXPY x := x +  y 3n3n2n2n12 Matrix-vector multiplication y := Axn2+2nn2+2n2n22n2 4 Rank-1 update A:= A + xy T 2n2+2n2n2+2n2n22n2 8 Matrix multiplication (MatMult) C:= C + AB4n24n2 2n32n3 16/n Operations other than matrix multiplication cannot exploit the performance of the CSX600 due to the limitation of data transfer speed. Matrix multiplication (MatMult) can be executed efficiently, but only if the size is very large (n > 1500).

8 Performance of MatMult on the ClearSpeed board GFLOPS M KN M K N N M The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory. M, N and K must be larger than 1500 to get substantial performance gain.

9 Problems to be solved Problems –Is it possible to reorganize the algorithm so that most of the computations are done with matrix multiplications? –What is the overhead of using very large size matrix multiplications? –How can we reduce the overhead? We consider these problems for the nonsymmetric eigenvalue problem.

10 The nonsymmetric eigenvalue problem The problem –Eigenvalue problem A : dense complex nonsymmetric matrix Compute all the eigenvalues / eigenvectors Applications –Magnetohydrodynamics –Structural dynamics –Quantum chemistry –Fluid dynamics Cf. Z. Bai and J. Demmel: A test matrix collection for non- Hermitian eigenvalue problems.

11 The standard algorithm –Similarity transformation to the upper triangular matrix –We focus on speeding up the QR algorithm. QR algorithm Work: 10 n 3 (empirically) Algorithm for the nonsymmetric eigenproblem Dense matrix diagonal elements = eigenvalues Hessenberg matrix Upper triangular matrix Finite # of steps Iterative method Householder method Work: (10/3)n Target of speedup

12 Algorithm –shifts s 1, …, s m : eigenvalues of the trailing m x m submatrix of A l –Perform m steps of the QR algorithm at once. Computational procedure for one iteration –Introduce (m / 2) bulges –Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. The small-bulge multishift QR algorithm 0 the case of m = 4 shifts Matrices: A Hessenberg Q unitary R upper triangular

13 Algorithm –shifts s 1, …, s m : eigenvalues of the trailing m x m submatrix of A l –Perform m steps of the QR method at once. Computational procedure for one iteration –Introduce (m / 2) bulges –Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. The small-bulge multishift QR algorithm 0 the case of m = 4 shifts Matrices: A Hessenberg Q unitary R upper triangular

14 Algorithm –shifts s 1, …, s m : eigenvalues of the trailing m x m submatrix of A l –Perform m steps of the QR method at once. Computational procedure for one iteration –Introduce (m / 2) bulges –Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. The small-bulge multishift QR algorithm 0 the case of m = 4 shifts Matrices: A Hessenberg Q unitary R upper triangular

15 Algorithm –shifts s 1, …, s m : eigenvalues of the trailing m x m submatrix of A l –Perform m steps of the QR method at once. Computational procedure for one iteration –Introduce (m / 2) bulges –Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. The small-bulge multishift QR algorithm 0 the case of m = 4 shifts Matrices: A Hessenberg Q unitary R upper triangular

16 Algorithm –shifts s 1, …, s m : eigenvalues of the trailing m x m submatrix of A l –Perform m steps of the QR method at once. Computational procedure for one iteration –Introduce (m / 2) bulges –Transform the matrix to Hessenberg form again by chasing (m / 2) bulges. The small-bulge multishift QR algorithm 0 the case of m = 4 shifts Matrices: A Hessenberg Q unitary R upper triangular

17 Division of the updating operations –Chase the bulges by only k rows at a time. –Divide update operations into two parts: First, update the diagonal block sequentially. –Accumulate the Householder transformations used in the update as a unitary matrix. Next, update the off-diagonal blocks. –Multiply the off-diagonal blocks by the unitary matrix. Use of the level-3 BLAS 0 Bulge (3x3) Diagonal update (sequential) Off-diagonal update (MatMult) k Level-3 BLAS Blocking of bulge-chasing operations

18 Random matrix ( n = 6000) Compute all the eigenvalues / eigenvectors with the small- bulge multishift QR algorithm Computational environments –Xeon 3.2 GHz, Memory 8 GB –ClearSpeed advance board CSX600 x 2 As the number of shifts increases … – MatMult part decrease – other part increase (bottleneck) Performance on the CSX600 Number of shifts Execution time (sec) XeonXeon + CSX MatMult size Parts other than MatMult need to be sped up !

19 Modification of the algorithm (1) 0 k 0 k / qk / q Diagonal update (sequential) Off-diagonal (MatMult) Off-diagonal update (MatMult) Reformulation as a recursive algorithm Chasing of (m / 2) / q bulges by k / q rows (ex. Recursion level d = 1)

20 Modification of the algorithm (2) Deflation –Trailing submatrix is isolated. ( ) Eigensolution of the isolated submatrix –Apply the double shift QR algorithm. –Size of the submatrix increases with m. Division of the update operations –Update the diagonal block (until convergence) Accumulate the Householder transformations used in the update as a unitary matrix. –Update the off-diagonal block (only once, at last) Multiply the off-diagonal blocks by the unitary matrix. 0 0 sequential MatMult Bottleneck Reduce the Computational work

21 Numerical experiments Test problem – random matrices with elements in [0, 1] Reduced to Hessenberg form by Householder’s method –Compute all the eigenvalues / eigenvectors Computational environments –Xeon 3.2 GHz, Memory 8 GB –Fortran 77, double precision –ClearSpeed advance board CSX600 x 2 –Matrix multiplication ClearSpeed’s Library (for large size MatMult) Intel Math Kernel Library (for small size MatMult)

22 Comparison –Existing algorithm (small-bulge multishift QR method) MatMult part –Off-diagonal update –Our algorithm (mutishift QR + recursion) MatMult parts –Off-diagonal update –Diagonal update –Eigensolution of isolated submatrix –Parameter values Number of shifts: m is chosen optimally for each case. Row length of bulge chasing: k = (3/2)m Level of recursion: d = 1 Number of subdivision: q = m / 40 Numerical experiments CSX600

23 Effect of our modifications Our algorithm is 1.4 times faster –Diagonal update: 1.5 times faster –Eigensolution of isolated submatrix: 10 times faster ( n = 3000, m = 160, q = 4 ) ( n = 6000, m = 200, q = 5 ) Execution time (sec) Ours CSX600 is used in all cases Original

24 Effect of using the CSX600 n = 6000 n = By combining the CSX600 with our algorithm, –3.5 times speedup when n = 6000 –3.8 times speedup when n = Execution time (sec) m = 100 q = 5 m = 200 q = 5 m = 100 q = 5 m = 240 q = 6 (有 ) Xeon + CSX600 Xeon + CSX600 Xeon

25 Conclusion We proposed an approach to accelerate the solution of nonsymmetric eigenproblem using a floating-point accelerator. We used the small-bulge multishift QR algorithm, which can use matrix multiplications efficiently, as a basis. By reformulating part of the algorithm as a recursive algorithm, we succeeded in reducing the computational time spent by non- blocked part. This enables us to use large block size (number of shifts) with small overhead and to exploit the performance of the floating-point accelerator. When solving an eigenproblem of order 12,000, our algorithm is 1.4 times faster than the original small-bugle multishift QR algorithm. We obtained 3.8 times speedup with the CSX600 over the 3.2GHz Xeon.