SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,

Slides:



Advertisements
Similar presentations
An Analysis of ASPECT Mantle Convection Simulator Performance and Benchmark Comparisons Eric M. Heien [1], Timo Heister [2], Wolfgang Bangerth [2], Louise.
Advertisements

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
SAN DIEGO SUPERCOMPUTER CENTER Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time) Ross C. Walker.
1 “CFD Analysis of Inlet and Outlet Regions of Coolant Channels in an Advanced Hydrocarbon Engine Nozzle” Dr. Kevin R. Anderson Associate Professor California.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance of Applications Using Dual-Rail InfiniBand 3D Torus Network on the.
SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Outline October 9th 2008 Ross C. Walker.
Extending the capability of TOUGHREACT simulator using parallel computing Application to environmental problems.
OpenFOAM on a GPU-based Heterogeneous Cluster
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO High-Frequency Simulations of Global Seismic Wave Propagation A seismology challenge:
Introduction to numerical simulation of fluid flows
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
© 2011 Autodesk Freely licensed for use by educational institutions. Reuse and changes require a note indicating that content has been modified from the.
1 Parallel Simulations of Underground Flow in Porous and Fractured Media H. Mustapha 1,2, A. Beaudoin 1, J. Erhel 1 and J.R. De Dreuzy IRISA – INRIA.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
“SEMI-AUTOMATED PARALLELISM USING STAR-P " “SEMI-AUTOMATED PARALLELISM USING STAR-P " Dana Schaa 1, David Kaeli 1 and Alan Edelman 2 2 Interactive Supercomputing.
Massively Parallel Magnetohydrodynamics on the Cray XT3 Joshua Breslau and Jin Chen Princeton Plasma Physics Laboratory Cray XT3 Technical Workshop Nashville,
Larry Marx and the Project Athena Team. Outline Project Athena Resources Models and Machine Usage Experiments Running Models Initial and Boundary Data.
Information Technology at Purdue Presented by: Dr. Gerry McCartney Vice President and CIO, ITaP HPC User Forum September 8-10, 2008 Using SiCortex SC5832.
LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Descriptive Data Analysis of File Transfer Data Sudarshan Srinivasan Victor Hazlewood Gregory D. Peterson.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Scientific Computing Topics for Final Projects Dr. Guy Tel-Zur Version 2,
/ ZZ88 Performance of Parallel Neuronal Models on Triton Cluster Anita Bandrowski, Prithvi Sundararaman, Subhashini Sivagnanam, Kenneth Yoshimoto,
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
ParCFD Parallel computation of pollutant dispersion in industrial sites Julien Montagnier Marc Buffat David Guibert.
After step 2, processors know who owns the data in their assumed partitions— now the assumed partition defines the rendezvous points Scalable Conceptual.
Discontinuous Galerkin Methods and Strand Mesh Generation
Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)
- Rohan Dhamnaskar. Overview  What is a Supercomputer  Some Concepts  Couple of examples.
June 29 San FranciscoSciDAC 2005 Terascale Supernova Initiative Discovering New Dynamics of Core-Collapse Supernova Shock Waves John M. Blondin NC State.
A Framework for Visualizing Science at the Petascale and Beyond Kelly Gaither Research Scientist Associate Director, Data and Information Analysis Texas.
Advanced User Support Amit Majumdar 5/7/09. Outline  Three categories of AUS  Update on Operational Activities  AUS.ASTA  AUS.ASP  AUS.ASEOT.
Numerical Libraries Project Microsoft Incubation Group Mary Beth Hribar Microsoft Corporation CSCAPES Workshop June 10, 2008 Copyright Microsoft Corporation,
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.
1 1 What does Performance Across the Software Stack mean?  High level view: Providing performance for physics simulations meaningful to applications 
Interactive Computational Sciences Laboratory Clarence O. E. Burg Assistant Professor of Mathematics University of Central Arkansas Science Museum of Minnesota.
MESQUITE: Mesh Optimization Toolkit Brian Miller, LLNL
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Domain Decomposition in High-Level Parallelizaton of PDE codes Xing Cai University of Oslo.
SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Overview Adrian E. Roitberg University of Florida July 2nd 2009 By Ross C. Walker.
Using IOR to Analyze the I/O Performance
TR&D 2: NUMERICAL TOOLS FOR MODELING IN CELL BIOLOGY Software development: Jim Schaff Fei Gao Frank Morgan Math & Physics: Boris Slepchenko Diana Resasco.
MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.
SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Overview Thomas E. Cheatham III University of Utah Jan 14th 2010 By Ross C. Walker.
Algebraic Solvers in FASTMath Argonne Training Program on Extreme-Scale Computing August 2015.
Parallel I/O Performance Study and Optimizations with HDF5, A Scientific Data Package Christian Chilan, Kent Yang, Albert Cheng, Quincey Koziol, Leon Arber.
Large-scale accelerator simulations: Synergia on the Grid turn 1 turn 27 turn 19 turn 16 C++ Synergia Field solver (FFT, multigrid) Field solver (FFT,
ANSYS, Inc. Proprietary © 2004 ANSYS, Inc. Chapter 5 Distributed Memory Parallel Computing v9.0.
An Introduction to Computational Fluids Dynamics Prapared by: Chudasama Gulambhai H ( ) Azhar Damani ( ) Dave Aman ( )
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
High Performance Computing Seminar II Parallel mesh partitioning with ParMETIS Parallel iterative solvers with Hypre M.Sc. Caroline Mendonça Costa.
SCEC Capability Simulations on TeraGrid
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Xing Cai University of Oslo
Programming Models for SimMillennium
Unstructured Grids at Sandia National Labs
HPC Modeling of the Power Grid
GENERAL VIEW OF KRATOS MULTIPHYSICS
Your name here Your institution here
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Presentation transcript:

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09, 2008 Mahidhar Tatineni (SDSC) Lonnie Crosby (NICS) John Cazes (TACC)

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Overview of MPCUGLES Code MPCUGLES is an unstructured grid large eddy simulation code (written in f90/MPI), developed by Prof. Mahesh Krishnan’s group at the University of Minnesota, which can be used for very complex geometries. The incompressible flow algorithm employs a staggered approach with face- normal velocities stored at the centroids of faces, velocity and pressure stored at cell-centroids. The non-linear terms are discretized such that discrete energy conservation is imposed. The code also uses the HYPRE library (developed at LLNL) which is a set of high performance preconditioners to help solve sparse linear systems of equations which are part of the main algorithm. MPCUGLES has been run at scale using upto 2048 cores and 50 million control volumes, on the Blue Gene (SDSC), DataStar (SDSC), Ranger (TACC) and Kraken (NICS).

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO General Requirements Grid, initial condition generation and partitioning for the runs is done using the METIS software. For the larger grids the experimental metis-5.0pre1 version is required (Previous ASTA project uncovered a problem with metis-4.0 version for large scale cases). The I/O in the code is done using NETCDF. Each processor writes its own files in the NETCDF format. There is no MPI-IO or parallel netcdf requirement. HYPRE library (from LLNL) of high performance preconditioners that features parallel multigrid methods for both structured and unstructured grid problems. Compiled with version 1.8.2b. The algebraic multigrid (HYPRE_BoomerAMG) solver is used from the library. The MPCUGLES code also has the option of using a conjugate-gradient method as an alternative.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Porting to Ranger and Kraken The code was recently ported to both the available track 2 systems (Ranger and Kraken). Compiling the code on both machines was relatively straightforward. Both Ranger and Kraken had the netcdf libraries already installed. The needed versions of the Hypre library (v 1.8.2b) and METIS (v 5.0pre1) were easy to install on both machines. The grid and initial condition generation codes are currently serial. For the current scaling studies they were run on Ranger (1 proc/node, 32GB) or DataStar (1 proc/p690 node; 128GB). This is a potential bottleneck for larger runs (>50 million CVs) and part of the current AUS project will be focused on parallelizing this part so that much larger grid sizes can be considered.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance on Ranger Strong Scaling (257^3 grid) Weak Scaling (64k CVs/task) Cores4-way8-way s (2-way) s s633s s494s s503s s CoresTotal CVs4-way8-way s308s s453s s433s s494s s-

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance on Kraken Strong Scaling (257^3 grid) Weak Scaling (64k CVs/task) Cores1-way2-way s s365s s280s s268s CoresTotal CVs1-way2-way s301s s405s s379s s365s s -

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Comments on Performance Strong scaling for 16 million control volumes case is o.k. upto 256 cores on Ranger and 512 cores on Kraken. The primary factor is the network bandwidth available per core (higher on Kraken). Overall the code scales o.k. if there are ~32-64K CVs per task. This is consistent with previous results on DataStar. The code should exhibit good weak scaling based on the communication pattern seen in older runs (mostly nearest neighbor). The results are o.k. up to 256 cores but show a jump in run times after that. One of the problems is that the underlying solver might be taking longer to converge as the number of CVs increases (this is not a isotropic problem … wall bound channel flow). Weak scaling runs for 65K CVs/task and above 512 cores are restricted due to grid size limitations at this point. Needs to be addressed.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Future Work Near term: Redo the weak scaling runs with an isotropic run to see if that helps avoid the extra computations needed by the underlying solver. Run at larger processor counts on both Ranger and Kraken with profiling / performance tools to analyze the performance. Long term: Parallelize the initial condition and grid generation parts to enable scaling to much larger processor counts. Investigate the performance implications of changing the underlying linear solver and see if any improvements can be made. For example the CG algorithm scales much better (tests on Kraken already show this) but takes longer to converge (tradeoff).