Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Decision Trees and MPI Collective Algorithm Selection Problem Jelena Pje¡sivac-Grbovi´c,Graham E. Fagg, Thara Angskun, George Bosilca, and Jack J. Dongarra,
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,
Parallel Computation of the 2D Laminar Axisymmetric Coflow Nonpremixed Flames Qingan Andy Zhang PhD Candidate Department of Mechanical and Industrial Engineering.
1 A Common Application Platform (CAP) for SURAgrid -Mahantesh Halappanavar, John-Paul Robinson, Enis Afgane, Mary Fran Yafchalk and Purushotham Bangalore.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
OpenFOAM on a GPU-based Heterogeneous Cluster
CS 584. Review n Systems of equations and finite element methods are related.
Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.
1 Tuesday, October 03, 2006 If I have seen further, it is by standing on the shoulders of giants. -Isaac Newton.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Topic Overview One-to-All Broadcast and All-to-One Reduction
Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.
Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
CS 591x – Cluster Computing and Programming Parallel Computers Parallel Libraries.
Chapter 4 Performance. Times User CPU time – Time that the CPU is executing the program System CPU time – time the CPU is executing OS routines for the.
Wavelet Transforms CENG 5931 GNU RADIO INSTRUCTOR: Dr GEORGE COLLINS.
Impact of Network Sharing in Multi-core Architectures G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech Mathematics and Comp.
1 Using the PETSc Parallel Software library in Developing MPP Software for Calculating Exact Cumulative Reaction Probabilities for Large Systems (M. Minkoff.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Semantics-based Distributed I/O with the ParaMEDIC Framework P. Balaji, W. Feng, H. Lin Math. and Computer Science, Argonne National Laboratory Computer.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
1 Eigenvalue Problems in Nanoscale Materials Modeling Hong Zhang Computer Science, Illinois Institute of Technology Mathematics and Computer Science, Argonne.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
CMRS Review, PPPL, 5 June 2003 &. 4 projects in high energy and nuclear physics 5 projects in fusion energy science 14 projects in biological and environmental.
Parallelization of 2D Lid-Driven Cavity Flow
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
1 1  Capabilities: Scalable algebraic solvers for PDEs Freely available and supported research code Usable from C, C++, Fortran 77/90, Python, MATLAB.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
MPI: Portable Parallel Programming for Scientific Computing William Gropp Rusty Lusk Debbie Swider Rajeev Thakur.
Connections to Other Packages The Cactus Team Albert Einstein Institute
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Generic Compressed Matrix Insertion P ETER G OTTSCHLING – S MART S OFT /TUD D AG L INDBO – K UNGLIGA T EKNISKA H ÖGSKOLAN SmartSoft – TU Dresden
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
On the Performance of PC Clusters in Solving Partial Differential Equations Xing Cai Åsmund Ødegård Department of Informatics University of Oslo Norway.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
A Parallel Linear Solver for Block Circulant Linear Systems with Applications to Acoustics Suzanne Shontz, University of Kansas Ken Czuprynski, University.
OpenMP Runtime Extensions Many core Massively parallel environment Intel® Xeon Phi co-processor Blue Gene/Q MPI Internal Parallelism Optimizing MPI Implementation.
Background Computer System Architectures Computer System Software.
TEMPLATE DESIGN © H. Che 2, E. D’Azevedo 1, M. Sekachev 3, K. Wong 3 1 Oak Ridge National Laboratory, 2 Chinese University.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
Auburn University
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
High Altitude Low Opening?
MPI: Portable Parallel Programming for Scientific Computing
Parallel Programming By J. H. Wang May 2, 2017.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
COMP60621 Fundamentals of Parallel and Distributed Systems
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
COMP60611 Fundamentals of Parallel and Distributed Systems
Presentation transcript:

Non-uniformly Communicating Non-contiguous Data: A Case Study with PETSc and MPI P. Balaji, D. Buntinas, S. Balay, B. Smith, R. Thakur and W. Gropp Mathematics and Computer Science Argonne National Laboratory

Numerical Libraries in HEC Developing parallel applications is a complex task –Discretizing physical equations to numerical forms –Representing the domain of interest as data points Libraries allow developers to abstract low-level details –E.g., Numerical Analysis, Communication, I/O Numerical libraries (e.g., PETSc, ScaLAPACK, PESSL) –Parallel data layout and processing –Tools for distributed data layout (matrix, vector) –Tools for data processing (SLES, SNES)

Overview of PETSc Portable, Extensible Toolkit for Scientific Computing Software tools for solving PDEs –Suite of routines to create vectors, matrices and distributed arrays –Sequential/parallel data layout –Linear and nonlinear numerical solvers Widely used in Nanosimulations, Molecular dynamics, etc. Uses MPI for communication BLASLAPACKMPI MatricesVectorsIndex Sets KSP (Krylov subspace Methods) PC (Preconditioners) Draw SNES (Nonlinear Equation Solvers) SLES (Linear Equation Solvers) TS (Time Stepping) PDE Solvers Application Codes Level of Abstraction

Handling Parallel Data Layouts in PETSc Grid layout exposed to the application –Structured or Unstructured (1D, 2D, 3D) –Internally managed as a single vector of data elements –Representation often suited to optimize its operations Impact on communication: –Data representation and communication pattern might not be ideal for MPI communication operations –Non-uniformity and Non-contiguity in communication are the primary culprits

Presentation Layout Introduction Impact of PETSc Data Layout and Processing on MPI MPI Enhancements and Optimizations Experimental Evaluation Concluding Remarks and Future Work

Local Data Point Data Layout and Processing in PETSc Grid layouts: data is divided among processes –Ghost data points shared Non-contiguous Data Communication –2 nd dimension of the grid Non-uniform communication –Structure of the grid –Stencil type used –Sides larger than corners Process Boundary Ghost Data Point Proc 1Proc 0 Box-type stencil Proc 1Proc 0 Star-type stencil

MPI Derived Datatypes –Application describes noncontiguous data layout to MPI –Data is either packed to contiguous buffers and pipelined (sparse layouts) or sent individually (dense layouts) Good for simple algorithms, but very restrictive –Lookup upcoming content to predecide algorithm to use –Multiple parses on the datatype loses context! Non-contiguous Communication in MPI Non-contiguous Data layout Save Context Send Data Save Context Packing Buffer

Issues with Lost Datatype Context Rollback of context not possible –Datatypes could be recursive Duplication of context not possible –Context information might be large –When datatype elements are small, context could be larger than the datatype itself Search of context possible, but very expensive –Quadratically increasing search time with increasing datatype size –Currently used mechanism!

Non-uniform Collective Communication Non-uniform communication algorithms are optimized for “uniform” communication Case Studies –Allgatherv uses a ring algorithm Causes idleness if data volumes are very different –Alltoallw sends data to nodes in round-robin manner MPI processing is sequential Large Message Small Message

Presentation Layout Introduction Impact of PETSc Data Layout and Processing on MPI MPI Enhancements and Optimizations Experimental Evaluation Concluding Remarks and Future Work

Dual-context Approach for Non-contiguous Communication Previous approaches are in-efficient in complex designs –E.g., if a look-ahead is performed to understand the structure of the upcoming data, the saved context is lost Dual-context approach retains the data context –Look-aheads are performed using a separate context –Completely eliminates the search time Non-contiguous Data layout Save Context Send Data Save Context Look-ahead Packing Buffer

Non-Uniform Communication: AllGatherv Single point of distribution is the primary bottleneck Identify if a small fraction of messages are very large –Floyd and Rivest Algorithm –Linear time detection of outliers Binomial Algorithms –Recursive doubling or Dissemination –Logarithmic time Large Message Small Message

Non-uniform Communication: Alltoallw Distributing messages to be sent out as bins (based on message size) allows differential treatment to nodes Send out small messages first –Nodes waiting for small messages have to wait lesser –Ratio of increase in time for nodes waiting for larger messages is much smaller –No skew for zero-byte data with lesser synchronization Most helpful for non-contiguous messages –MPI processing (e.g., packing) is sequential for non- contiguous messages

Presentation Layout Introduction Impact of PETSc Data Layout and Processing on MPI MPI Enhancements and Optimizations Experimental Evaluation Concluding Remarks and Future Work

Experimental Testbed 64-node Cluster –32 nodes with dual Intel EM64T 3.6GHz processors 2MB L2 Cache, 2GB DDR2 400MHz SDRAM Intel E7520 (Lindenhurst) Chipset –32 nodes with dual Opteron 2.8GHz processors 1MB L2 Cache, 4GB DDR 400MHz SDRAM NVidia 2200/2050 Chipset RedHat AS4 with kernel.org kernel InfiniBand DDR (16Gbps) Network: –MT25208 adapters connected through a 144-port switch MVAPICH MPI implementation

Non-uniform Communication Evaluation Search time can dominate performance if the working context is lost!

AllGatherv Evaluation

Alltoallw Evaluation Our algorithm reduces the skew introduced due to the Alltoallw operations by sending out smaller messages first and allowing the corresponding applications to progress

PETSc Vector Scatter

3-D Laplacian Multigrid Solver

Presentation Layout Introduction Impact of PETSc Data Layout and Processing on MPI MPI Enhancements and Optimizations Experimental Evaluation Concluding Remarks and Future Work

Non-uniform and Non-contiguous communication is inherent in several libraries and applications Current algorithms deal with non-uniform communication in a same way as uniform communication Demonstrated that more sophisticated algorithms can give close to 10x improvements in performance Designs are a part of MPICH and –To be picked up by MPICH2 derivatives in later releases Future Work: –Skew tolerance in non-uniform communication –Other libraries and applications

Thank You Group Web-page: Home-page:

Backup Slides

Noncontiguous Communication in PETSc Copy Buffer vector (count = 8, stride = 8) contiguous (count = 3) double | double | double contiguous (count = 3) Data might not always be contiguously laid out in memory –E.g., Second dimension of a structured grid Communication is performed by packing data Pipelining copy and communication is important for performance

Hand-tuning vs. Automated optimization Nonuniformity and noncontiguity in data communication is inherent in several applications –Communicating unequal amounts of data to the different peer processes –Communication data from noncontiguous memory locations Previous research has primarily focused on uniform and contiguous data communication Accordingly applications and libraries tried hand-tuning attempts to convert communication formats –Manually packing noncontiguous data –Re-implementing collective operations in the application

Non-contiguous Communication in MPI MPI Derived Datatypes –Common approach for non-contiguous communication –Application describes noncontiguous data layout to MPI –Data is either packed into contiguous memory (sparse layouts) or sent as independent segments (dense layouts) Pipelining of packing and communication improves performance, but requires context information! Non-contiguous Data layout Save Context Send Data Save Context Packing Buffer

Issues with Non-contiguous Communication Current approach is simple and works as long as there is a single parse on the noncontiguous data More intelligent algorithms might suffer: –E.g., lookup upcoming datatype content to predecide algorithm to use –Multiple parses on the datatype lose the context ! –Searching for the lost context every time requires quadratically increasing time with datatype size PETSc non-contiguous communication suffers with such high search times

MPI-level Evaluation

Experimental Results MPI-level Micro-benchmarks –Non-contiguous data communication time –Non-uniform collective communication Allgatherv Operation Alltoallw Operation PETSc Vector Scatter Benchmark –Performs communication only 3-D Laplacian Multigrid Solver Application –Partial differential equation solver –Utilizes PETSc numerical solver operations