October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
The hybird approach to programming clusters of multi-core architetures.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
MUMPS A Multifrontal Massively Parallel Solver IMPLEMENTATION Distributed multifrontal.
Parallel Performance of Hierarchical Multipole Algorithms for Inductance Extraction Ananth Grama, Purdue University Vivek Sarin, Texas A&M University Hemant.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.
Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Introduction to Parallel Finite Element Method using GeoFEM/HPC-MW Kengo Nakajima Dept. Earth & Planetary Science The University of Tokyo VECPAR’06 Tutorial:
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Parallelizing Gauss-Seidel Solver/Pre-conditioner Aim: To parallelize a Gauss-Seidel Solver, which can be used as a pre-conditioner for the finite element.
Parallel Iterative Solvers with the Selective Blocking Preconditioning for Simulations of Fault-Zone Contact Kengo Nakajima GeoFEM/RIST, Japan. 3rd ACES.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
Case Study in Computational Science & Engineering - Lecture 5 1 Iterative Solution of Linear Systems Jacobi Method while not converged do { }
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
Scotch + HAMD –Hybrid algorithm based on incomplete Nested Dissection, the resulting subgraphs being ordered with an Approximate Minimun Degree method.
Efficiency of small size tasks calculation in grid clusters using parallel processing.. Olgerts Belmanis Jānis Kūliņš RTU ETF Riga Technical University.
Parallel Computing Presented by Justin Reschke
Adaptive grid refinement. Adaptivity in Diffpack Error estimatorError estimator Adaptive refinementAdaptive refinement A hierarchy of unstructured gridsA.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Multipole-Based Preconditioners for Sparse Linear Systems. Ananth Grama Purdue University. Supported by the National Science Foundation.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Parallel Iterative Solvers for Ill-Conditioned Problems with Reordering Kengo Nakajima Department of Earth & Planetary Science, The University of Tokyo.
Hui Liu University of Calgary
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Xing Cai University of Oslo
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Density-based Hybrid Clustering
Parallel Programming in C with MPI and OpenMP
CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph
Nathan Grabaskas: Batched LA and Parallel Communication Optimization
Supported by the National Science Foundation.
A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.
Hybrid Programming with OpenMP and MPI
COMP60621 Fundamentals of Parallel and Distributed Systems
Comparison of CFEM and DG methods
Database System Architectures
Chapter 01: Introduction
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Parallel Programming in C with MPI and OpenMP
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST) Integrated Predictive Simulation System for Earthquake and Tsunami Disaster Robust & Efficient Parallel Preconditioning Methods in “Multi-Core Era” Introduction In this work, parallel preconditioning methods based on “Hierarchical Interface Decomposition (HID)” and hybrid parallel programming models were applied to finite-element based simulations of linear elasticity problems in media with heterogeneous material properties. Reverse Cuthill-McKee reordering with cyclic multicoloring (CM-RCM) was applied for parallelism through OpenMP. The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster)” using up to 512 cores. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI. Parallel Programming Models on Multi-core Clusters In order to achieve minimal parallelization overheads on SMP (symmetric multiprocessors) and multi-core clusters, a multi-level hybrid parallel programming model is often employed. In this method, coarse-grained parallelism is achieved through domain decomposition by message passing among nodes, while fine-grained parallelism is obtained via loop-level parallelism inside each node using compiler-based thread parallelization techniques, such as OpenMP. Another often-used programming model is the single-level flat MPI model, in which separate single-threaded MPI processes are executed on each core. Generally, the efficiency of each programming model depends on hardware performance, application features, and problem size. Fig.1 Hybrid and Flat MPI Parallel Programming Models core memory core memory core memory Target Application In the present work, linear elasticity problems in simple cube geometries of media with heterogeneous material properties are solved using a parallel finite-element method (FEM). Poisson’s ratio is set to 0.25 for all elements, while a heterogeneous distribution of Young’s modulus in each element is calculated by a sequential Gauss algorithm, which is widely used in the area of geostatistics. The minimum and maximum values of Young’s modulus are and 10 3, respectively, with an average value of 1.0. The GPBi-CG (Generalized Product-type methods based on Bi-CG) solver with SGS (Symmetric Gauss-Seidel) preconditioner (SGS/GPBi-CG) was applied. The code is based on the framework for parallel FEM procedures of GeoFEM, and the GeoFEM’s local data structure is applied. Fig.2 Heterogeneous Distribution of Material Property HID (Hierarchical Interface Decomposition) Localized block Jacobi ILU/IC preconditioners are widely used for parallel iterative solvers. They provide excellent parallel performance for well-defined problems, although the number of iterations required for convergence gradually increases according to the number of processors. However, this preconditioning technique is not robust for ill-conditioned problems with many processors, because it ignores the global effect of external nodes in other domains. The Parallel Hierarchical Interface Decomposition Algorithm (PHIDAL) [Henon & Saad 2007] provides robustness and scalability for parallel ILU/IC preconditioners. PHIDAL is based on defining “hierarchical interface decomposition (HID)”. The HID process starts with a partitioning of the graph, with one layer of overlap. The “levels” are defined from this partitioning, with each level consisting of a set of vertex groups. Each vertex group of a given level is a separator for vertex groups of a lower level. (to be continued to the back page) core

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST) HID (Hierarchical Interface Decomposition) (cont.) If the unknowns are reordered according to their level numbers, from the lowest to highest, the block structure of the reordered matrix is as shown in Fig.3. This block structure leads to a natural parallelism if ILU/IC decompositions or forward/backward substitution processes are applied.. (a) Domain Decomposition Fig.3 Domain/block decomposition of the matrix according to the HID reordering (b) Matrix Block(c) Forward Substitution Proc. of SGS T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo) The developed code has been tested on the “T2K Open Super Computer (Todai Combined Cluster) (T2K/Tokyo)” at the University of Tokyo. The “T2K/Tokyo” was developed by Hitachi under “T2K Open Supercomputer Alliance”. T2K/Tokyo is an AMD Quad-core Opteron-based combined cluster system with 952 nodes, 15,232 cores and 31TB memory. Total peak performance is TFLOPS. Each node includes four “sockets” of AMD Quad-core Opteron processors (2.3GHz), as shown in Fig.4. Each node is connected via Myrinet-10G network. In the present work, 32 nodes of the system have been evaluated. Because T2K/Tokyo is based on CC/NUMA architecture, careful design of software and data structure is required for efficient access to local memory. Fig.4 Overview of T2K/Tokyo (Entire System and Each Node) Results & Future Works Preliminary tests of the developed code have been conducted on the T2K/Tokyo using up to 512 cores of T2K/Tokyo system. Preconditioners based on HID provide a scalable performance and robustness in comparison to conventional localized block Jacobi preconditioners. Performance of Hybrid 4x4 parallel programming model is competitive with that of Flat MPI.HID-based preconditioning with the hybrid parallel programming model is expected to be a good choice for excellent scalable performance and robustness for implementations on more than 10 4 cores on clusters of multi-core processors, if each MPI process is assigned to each socket, as is in Hybrid 4x4 in this work. In the present work, no fill-in processes have been considered in HID procedures. The results of using HID in comparison with conventional localized block Jacobi preconditioning have therefore not been so significant. More robust preconditioning methods based on HID may be developed by considering fill-ins inside and between connectors for realistic applications with ill-conditioned matrices. Moreover, the developed methods may further be evaluated on various types of clusters with more cores. Fig.5 Strong Scaling Test up to 512 cores of T2K/Tokyo Relative Performance of HID normalized by results of Localized Block Jacobi Relative Performance of three parallel programming models (Flat MPI, Hybrid 4x4 and Hybrid 8x2), normalized by results of Flat MPI