An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Slides:



Advertisements
Similar presentations
The first-generation Cell Broadband Engine (BE) processor is a multi-core chip comprised of a 64-bit Power Architecture processor core and eight synergistic.
Advertisements

Copyright 2011, Data Mining Research Laboratory Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining Xintian Yang, Srinivasan.
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Shredder GPU-Accelerated Incremental Storage and Computation
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
Introduction to Parallel Computing
Motivation Desktop accelerators (like GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance.
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)
OpenFOAM on a GPU-based Heterogeneous Cluster
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Massively LDPC Decoding on Multicore Architectures Present by : fakewen.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Fast Thermal Analysis on GPU for 3D-ICs with Integrated Microchannel Cooling Zhuo Fen and Peng Li Department of Electrical and Computer Engineering, {Michigan.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Computación algebraica dispersa con GPUs y su aplicación en tomografía electrónica Non-linear iterative optimization method for locating particles using.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Parallel Solution of the Poisson Problem Using MPI
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
HAMA: An Efficient Matrix Computation with the MapReduce Framework Sangwon Seo, Edward J. Woon, Jaehong Kim, Seongwook Jin, Jin-soo Kim, Seungryoul Maeng.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
A survey of Exascale Linear Algebra Libraries for Data Assimilation
Xing Cai University of Oslo
I. E. Venetis1, N. Nikoloutsakos1, E. Gallopoulos1, John Ekaterinaris2
Edinburgh Napier University
Accelerating MapReduce on a Coupled CPU-GPU Architecture
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Linchuan Chen, Xin Huo and Gagan Agrawal
Data-Intensive Computing: From Clouds to GPU Clusters
Hybrid Programming with OpenMP and MPI
Chapter 01: Introduction
Presentation transcript:

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G. Ortega 1, I. García 2 and E. M. Garzón 1 1 Dpt. Computer Architecture and Electronics. University of Almería 2 Dpt. Computer Architecture. University of Málaga 1

2 1.Introduction 2.Algorithm 3.Multi-GPU approach Implementation 4.Performance Evaluation 5.Conclusions and Future works Outline

The resolution of the 3D Helmholtz equation  Development of models related to a wide range of scientific and technological applications: Mechanical Acoustical Thermal Electromagnetic waves 3 Introduction Motivation

4 Introduction Helmholtz Equation Literature: Other authors  Don’t use heterogeneous multi-GPU clusters. Green’s Functions Spatial Discretization (based on FEM) Large linear system of equations A is sparse, symmetric and with a regular pattern Linear Eliptic Partial Differential of Equations (PDE).

5 Introduction Develop a parallel solution for the 3D Helmholtz equation on a heterogeneous architecture of modern multi-GPU clusters Goal BCG method (1) multi-GPU clusters (2) Regular Format matrices (3) Acceleration SpMVs & vector operations OUR PROPOSAL  mem. req. and runtime reductions Extend the resolution of problems of practical interest to several different fields of Physics.

6 1.Introduction 2.Algorithm 3.Regular Format 4.Multi-GPU approach Implementation 5.Performance Evaluation 6.Conclusions and Future works Outline

7 Algorithm Biconjugate Gradient Method Regular Format dots saxpy SpMV

8 Regular Format 1.Complex symmetric matrix 2.Max seven nonzeros/row 3.Nonzeros are located by seven diagonals 4.Same values for lateral diagonals (a, b, c) 1.Complex symmetric matrix 2.Max seven nonzeros/row 3.Nonzeros are located by seven diagonals 4.Same values for lateral diagonals (a, b, c) Regularities Mem. Req. (GB) for storing A: VolTP CRS ELLR-T Reg Format Mem. Req. (GB) for storing A: VolTP CRS ELLR-T Reg Format The arithmetic intensity of SpMV based on Regular Format is 1.6 times greater than this parameter for the CRS format if a = b = c = 1 Algorithm

9 1.Introduction 2.Algorithm 3.Multi-GPU approach Implementation 4.Performance Evaluation 5.Conclusions and Future works Outline

Implementation on Heterogeneous platforms Exploiting the heterogeneous platforms of a cluster has two main advantages: (1)Larger problems can be solved because the code can be distributed among the available nodes; (2)Runtime is reduced since more operations are executed at the same time in different nodes and accelerated by the GPU devices. To distribute the load between CPUs and GPU processes: MPI to communicate multicores in different nodes. GPU implementation (CUDA interface) Multi-GPU approach implementation

One MPI process per CPU core or GPU device is started. The parallelization of the sequential code has been done according to the data parallel concept. Sparse matrix  The row-wise matrix decomposition. Important issue  Communications among processors occur twice at every iteration: (1)Dot operations. (MPI_Allreduce) (synchronization point) (2)Two SpMV operations  regularity of the matrix  swapping halos MPI implementation Multi-GPU approach implementation

Halos swapping It is advantageous only when the percentage of redundancy with respect to the total data of every process is small; i.e. when P ≪ N/D2, where P is the number of MPI tasks, N the dimension of A and D2 half of the halo elements. Multi-GPU approach implementation

GPU Implementation The exploitation of one GPU device per processor. All the operations are carried out in the GPUs but when a communication process is required among cluster processors, data chunks are copied to the CPU and the exchange among processors is executed. Each GPU device is devoted to computing all the local vector operations (dot, saxpy) and local SpMVs which are involved in the BCG specifically suited for solving the 3D Helmholtz equation. Optimization techniques: The reading of the sparse matrix and data involved in vector operations are coalesced global memory access, this way the bandwidth of global memory is maximized. Shared memory and registers are used to store any intermediate data of the operations which constitute Fast-Helmholtz, despite the low reuse of data in these operations. Fusion of operations into one kernel. Multi-GPU approach implementation

Fusion of kernels Multi-GPU approach implementation 2 SpMVs can be executed at the same time so avoiding the reading of A twice  arithmetic intensity is improved by this fusion.

15 1.Introduction 2.Algorithm 3.Multi-GPU approach Implementation 4.Performance Evaluation 5.Conclusions and Future works Outline

16 Platforms 2 compute nodes (Bullx R424-E3. Intel Xeon E (16 multicores) and 64 GB RAM). 4 GPUs, 2 per node. Tesla M2075: 5.24 GB memory resources per GPU. CUDA interface. Performance Evaluation

17 Test matrices and approaches Three strategies for solving the 3D Helmholtz equation have been proposed: MPI GPU Heterogeneous: GPU-MPI Performance Evaluation

18 Results (I) Performance Evaluation Seq (s) m_ m_ m_ m_ m_ m_ m_ m_ m_ m_ Table: Runtime 1000 iterations BCG based on Helmholtz equation using 1 CPU core. OPTIMIZED code: fusion, Regular Format, etc. It takes 1.8 hours

19 Results (II) Performance Evaluation Acceleration factors of operations of 2Ax, saxpies and dots routines with 4 MPI processes Acceleration factors of operations of 2Ax, saxpies and dots routines with 4multi-GPUs

20 Results(III) Performance Evaluation Table: Resolution time (seconds) of 1000 iterations of the BCG based on Helmholtz, using 2 and 4 MPI processes and 2 and 4 GPU devices. Acceleration Factor ≈ 9x

21 Static Distribution of the load Heterogeneous data partition = t. CPU/ t. GPU ≈ 10 CPU: Load GPU: 10*Load Performance Evaluation Static workload balance scheduling has been considered  The application workload is known at compile time and it fixed during the execution. So, the distribution between the different processes can be done at compile time. Static workload balance scheduling has been considered  The application workload is known at compile time and it fixed during the execution. So, the distribution between the different processes can be done at compile time.

22 Results (IV) Performance Evaluation Table: Profiling of the resolution of 1000 iterations of the BCG based on Helmholtz using our Heterogeneous approach with three diferent configurations of MPI and GPU processes. Memory of the GPU is the limiting factor

23 Results (V) Runtime executions (s) of 1000 iterations of the BCG based on Helmholtz using our Heterogeneous approach (4GPUs + 8 MPIs) and 4 GPU processes. Runtime (s) Performance Evaluation

24 1.Introduction 2.Algorithm 3.Multi-GPU approach Implementation 4.Performance Evaluation 5.Conclusions and Future works Outline

25 Conclusions The parallel solution for the 3D Helmholtz equation which combines the exploitation of the high regularity of the matrices involved in the numerical methods and the massive parallelism supplied by heterogeneous architecture of modern multi-GPU cluster. Experimental results have shown that our heterogeneous approach outperforms the MPI and the GPU approaches when several CPU cores are used to collaborate with the GPU devices. This strategy allows to extend end the resolution of problems of practical interest to several different fields of Physics. Conclusions and Future Works

26 Future works (1) to design a model to determine the most suitable factor to have the workload well-balanced; (2) to integrate this framework in a real application based on Optical Diffraction Tomography (ODT) (3) to include Pthreads or OpenMP for shared memory. Conclusions and Future Works

27 Thank you for your attention

28 Results (II) Performance Evaluation Percentage of the runtime for each call to function using 4multi-GPUs