Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Slides:

Advertisements

Similar presentations

Program Analysis and Tuning The German High Performance Computing Centre for Climate and Earth System Research Panagiotis Adamidis.

Advertisements

Introductions to Parallel Programming Using OpenMP

Introductory Courses in High Performance Computing at Illinois David Padua.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

Introduction CS 524 – High-Performance Computing.

4/26/05Han: ELEC72501 Department of Electrical and Computer Engineering Auburn University, AL K.Han Development of Parallel Distributed Computing System.

Energy Efficient Prefetching – from models to Implementation 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software Engineering.

NETL 2014 Workshop on Multiphase Flow Science August 5-6, 2014, Morgantown, WV Accelerating MFIX-DEM code on the Intel Xeon Phi Dr. Handan Liu Dr. Danesh.

Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.

The hybird approach to programming clusters of multi-core architetures.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Performance Comparison of Pure MPI vs Hybrid MPI-OpenMP Parallelization Models on SMP Clusters Nikolaos Drosinos and Nectarios Koziris National Technical.

Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.

Parallelization: Conway’s Game of Life. Cellular automata: Important for science Biology – Mapping brain tumor growth Ecology – Interactions of species.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

1 Programming Multicore Processors Aamir Shafi High Performance Computing Lab

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Energy Profiling And Analysis Of The HPC Challenge Benchmarks Scalable Performance Laboratory Department of Computer Science Virginia Tech Shuaiwen Song,

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

A performance evaluation approach openModeller: A Framework for species distribution Modelling.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Hybrid MPI and OpenMP Parallel Programming

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Massive Supercomputing Coping with Heterogeneity of Modern Accelerators Toshio Endo and Satoshi Matsuoka Tokyo Institute of Technology, Japan.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, Dec 26, 2012outline.1 ITCS 4145/5145 Parallel Programming Spring 2013 Barry Wilkinson Department.

October 2008 Integrated Predictive Simulation System for Earthquake and Tsunami Disaster CREST/Japan Science and Technology Agency (JST)

CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.

“A Learner-Centered Computational Experience in Nanotechnology for Undergraduate STEM Students” IEEE ISEC 2016 Friend Center at Princeton University March.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

CMAQ PARALLEL PERFORMANCE WITH MPI AND OpenMP George Delic, Ph

Parallel Analytic Systems

A Domain Decomposition Parallel Implementation of an Elasto-viscoplasticCoupled elasto-plastic Fast Fourier Transform Micromechanical Solver with Spectral.

Hybrid Programming with OpenMP and MPI

Department of Computer Science, University of Tennessee, Knoxville

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering, Texas A&M University, College Station, TX Ashraf Bah RabiouDr. Valerie E. TaylorDr. Xingfu Wu -Computationally intensive applications can be parallelized to be executed on multicore systems to achieve better performance -MPI and OpenMP are two popular programming paradigms that can be used for that purpose -MPI and OpenMP can be combined in order to explore multiple levels of parallelism on multicore systems Results on Hydra -LBM is based on the kinetic theory, which entails a more fundamental level in studying the fluid than Navier-Strokes equation -LBM is used for simulating fluid flows in computational physics, aerodynamics, material science, chemical engineering and aerospace engineering -LBM is computationally intensive -LBM easily exploits features of parallelism -MPI-only LBM was developed by the Aerospace department at TAMU -It uses D3Q19 (Cubic solid with 19 velocities) as shown in this figure MPIOpenMPHybrid Implementation -Message passing model -Process level parallelism -Communication library -Shared memory model -Thread level parallelism -Compiler directives -Scrutinize the original MPI-only program to detect the computationally intensive loops -Use OpenMP to parallelize the loops to construct hybrid LBM -Avoid data dependencies within the loops to be parallelized -Determine the right scopes of the variables in order to maintain the accuracy of the program -Hybrid LBM uses MPI for inter-node communication and OpenMP for intra-node parallelization to achieve multiple level parallelism -Evaluate the performance of hybrid LBM and compare it with MPI LBM with increasing number of cores on two multicore systems -Three datasets were used  64x64x64, 128x128x128 and 256x256x256 -Use three performance metrics: execution time, speedup and efficiency for comparison -Use PowerPack to collect power profiling data for energy consumption analysis ConfigurationDori CS Department Virginia Tech Hydra Supercomputing Facility at Texas A&M Number of nodes852 CPUs per node416 Cores per chip22 CPU Type1.8 GHz AMD Opteron1.9 GHz IBM Power 5+ Memory per node6 GB32 GB/node for 49 nodes 64 GB/node for 3 nodes MPI vs Hybrid execution times using 64x64x64 Summary and Conclusion Chip Architecture of Hydra (IBM p5-575) Chip Architecture of Dori Specifications of Both Clusters Acknowledgment Experiment Platforms Hybrid MPI/OpenMP Lattice Boltzmann Application Methodology Lattice Boltzmann Method (LBM)Motivations and Goals Results on Dori MPI vs Hybrid on Dori using 64x64x64 dataset MPI vs Hybrid execution times using 128x128x128 MPI vs Hybrid on Dori using 256x256x256 dataset -The results above show that MPI LBM outperforms the hybrid on hydra -Because of strong scaling, the execution time decreases with increasing number of cores, hence the speedup increases with the number of cores -For MPI LBM with 64x64x64 dataset executed on more than 32 cores, its execution time starts increasing because of communication overhead -Some data points are missing for 128X128x128 because of large memory requirements -Because of large memory requirements, both hybrid and MPI LBM could not be run for the problem size of 256x256x256 -Implement a hybrid MPI/OpenMP Lattice Boltzmann application to explore multiple levels of parallelism on multicore systems -Evaluate the performance of this hybrid implementation and compare with the existing MPI- only version on two different multicore systems, and analyze energy consumption Goals MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128 MPI vs Hybrid speedups using 64x64x64 MPI vs Hybrid speedups using 128x128x128 -The results above show that MPI-only outperforms hybrid on Dori, except using 32 cores for 64x64x64 dataset -For each programming paradigm, the execution time decreases with increasing number of cores, hence the speedup increases with the number of cores -For MPI LBM with 64x64x64 executed on 32 cores, execution time starts increasing -The energy consumption data shows that MPI LBM consumes less energy than hybrid LBM Energy consumption data using 64x64x64 dataset Motivations -Hybrid version of the parallel LBM program was developed -Our experiment results show that MPI performs better than hybrid on both multicore systems, Hydra and Dori -Energy consumption results show that MPI consumes less energy than hybrid on Dori -Due to large memory requirements, both hybrid and MPI LBM could not be run for large problem size such as 256x256x256 and 512x512x512 -Through this project, we learned parallel programming using OpenMP and MPI as well as performance analysis techniques -I would like to thank Dr Valerie E. Taylor, Dr Xingfu Wu and Charles Lively for being awesome mentors and for providing me with a great deal of information and help necessary for the project. - This research was supported by the Distributed Research Experience for Undergraduates (DREU) program, as well as the Research Experience for Undergraduates (REU) program at the Texas A&M University's Computer Science and Engineering department. COREStimesystem (kJ)cpu (kJ)memory (kj)Hard disk (kJ)motherboard (kJ) 1 MPI HYBRID MPI HYBRID MPI HYBRID MPI HYBRID MPI HYBRID MPI HYBRID