Adaptive MPI Milind A. Bhandarkar

Slides:

Advertisements

Similar presentations

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Types of Parallel Computers

I/O Analysis and Optimization for an AMR Cosmology Simulation Jianwei LiWei-keng Liao Alok ChoudharyValerie Taylor ECE Department Northwestern University.

An Evaluation of a Framework for the Dynamic Load Balancing of Highly Adaptive and Irregular Parallel Applications Kevin J. Barker, Nikos P. Chrisochoides.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Legion Worldwide virtual computer. About Legion Made in University of Virginia Object-based metasystems software project middleware that connects computer.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.

1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.

Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Instrumentation in Software Dynamic Translators for Self-Managed Systems Bruce R. Childers Naveen Kumar, Jonathan Misurda and Mary.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Faucets Queuing System Presented by, Sameer Kumar.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Data Structures and Algorithms in Parallel Computing Lecture 7.

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

CSAR Overview Laxmikant (Sanjay) Kale 11 September 2001 © ©2001 Board of Trustees of the University of Illinois.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

OpenMosix, Open SSI, and LinuxPMI

Computer Science Overview

Parallel Objects: Virtualization & In-Process Components

Performance Evaluation of Adaptive MPI

Component Frameworks:

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Component Frameworks:

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Runtime Optimizations via Processor Virtualization

Faucets: Efficient Utilization of Multiple Clusters

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

Chao Huang Parallel Programming Laboratory University of Illinois

Support for Adaptivity in ARMCI Using Migratable Objects

Parallel Implementation of Adaptive Spacetime Simulations A

Laxmikant (Sanjay) Kale Parallel Programming Laboratory

Presentation transcript:

Adaptive MPI Milind A. Bhandarkar

Motivation Many CSE applications exhibit dynamic behavior Many CSE applications exhibit dynamic behavior Adaptive mesh refinement (AMR) Adaptive mesh refinement (AMR) Pressure-driven crack propagation in solid propellants Pressure-driven crack propagation in solid propellants Also, non-dedicated supercomputing platforms such as clusters affect processor availability Also, non-dedicated supercomputing platforms such as clusters affect processor availability These factors cause severe load imbalance These factors cause severe load imbalance Can the parallel language / runtime system help ? Can the parallel language / runtime system help ?

Load Imbalance in Crack- propagation Application (More and more cohesive elements activated between iterations 35 to 40.)

Load Imbalance in an AMR Application (Mesh is refined at 25 th time step)

Multi-partition Decomposition Basic idea: Basic idea: Decompose the problem into a number of partitions, Decompose the problem into a number of partitions, Independent of the number of processors Independent of the number of processors # Partitions > # processors # Partitions > # processors The system maps partitions to processors The system maps partitions to processors The system maps and re-maps objects as needed The system maps and re-maps objects as needed Re-mapping strategies help adapt to dynamic variations Re-mapping strategies help adapt to dynamic variations To make this work, need To make this work, need Load balancing framework & runtime support Load balancing framework & runtime support But, isn’t there a high overhead of multi-partitioning ? But, isn’t there a high overhead of multi-partitioning ?

(Crack Propagation code, with 70k elements) “Overhead” of Multi-partition Decomposition

Charm++ Supports data driven objects. Supports data driven objects. Singleton objects, object arrays, groups,.. Singleton objects, object arrays, groups,.. Many objects per processor, with method execution scheduled with availability of data. Many objects per processor, with method execution scheduled with availability of data. Supports object migration, with automatic forwarding. Supports object migration, with automatic forwarding. Excellent results with highly irregular & dynamic applications. Excellent results with highly irregular & dynamic applications. Molecular dynamics application NAMD, speedup of 1250 on 2000 processors of ASCI-red. Molecular dynamics application NAMD, speedup of 1250 on 2000 processors of ASCI-red. Brunner, Phillips, & Kale. “Scalable molecular dynamics”, Gordon Bell finalist, SC2000. Brunner, Phillips, & Kale. “Scalable molecular dynamics”, Gordon Bell finalist, SC2000.

Charm++: System Mapped Objects

Load Balancing Framework

However… Many CSE applications are written in Fortran, MPI Many CSE applications are written in Fortran, MPI Conversion to a parallel object-based language such as Charm++ is cumbersome Conversion to a parallel object-based language such as Charm++ is cumbersome Message-driven style requires split-phase transactions Message-driven style requires split-phase transactions Often results in a complete rewrite Often results in a complete rewrite How to convert existing MPI applications without extensive rewrite ? How to convert existing MPI applications without extensive rewrite ?

Solution: Each partition implemented as a user-level thread associated with a message-driven object Each partition implemented as a user-level thread associated with a message-driven object Communication library for these threads same in syntax and semantics as MPI Communication library for these threads same in syntax and semantics as MPI But what about the overheads associated with threads ? But what about the overheads associated with threads ?

AMPI Threads Vs MD Objects (1D Decomposition)

AMPI Threads Vs MD Objects (3D Decomposition)

Thread Migration Thread stacks may contain references to local variables Thread stacks may contain references to local variables May not be valid upon migration to a different address space May not be valid upon migration to a different address space Solution: thread stacks should span the same virtual address space on any processor where they may migrate (Isomalloc) Solution: thread stacks should span the same virtual address space on any processor where they may migrate (Isomalloc) Split the virtual space into per-processor allocation pool Split the virtual space into per-processor allocation pool Scalability issues Scalability issues Not important on 64-bit processors Not important on 64-bit processors Constrained load balancing (limit the thread’s migratability to fewer processors) Constrained load balancing (limit the thread’s migratability to fewer processors)

AMPI Issues: Thread-safety Multiple threads mapped to each processor Multiple threads mapped to each processor “Process data” to be localized “Process data” to be localized Make them instance variables of a “class” Make them instance variables of a “class” All subroutines become instance methods of this class All subroutines become instance methods of this class AMPIzer: A source-to-source translator AMPIzer: A source-to-source translator Based on Polaris front-end Based on Polaris front-end Recognize all global variables Recognize all global variables Put them in a thread-private area Put them in a thread-private area

AMPI Issues: Data Migration Thread-private data needs to be migrated with the thread Thread-private data needs to be migrated with the thread Developer has to write subroutines for packing and unpacking data Developer has to write subroutines for packing and unpacking data Writing separate subroutines is error-prone Writing separate subroutines is error-prone Puppers (“pup”=“pack-unpack”) Puppers (“pup”=“pack-unpack”) A subroutine to “show” the data to the runtime system A subroutine to “show” the data to the runtime system Fortran90 generic procedures make writing the pupper easy Fortran90 generic procedures make writing the pupper easy

AMPI: Other Features Automatic checkpoint and restart Automatic checkpoint and restart On different number of processors On different number of processors Number of chunks remain the same, but can be mapped to different number of processors Number of chunks remain the same, but can be mapped to different number of processors No additional work is needed No additional work is needed Same pupper used for migration is also used for checkpointing and restart Same pupper used for migration is also used for checkpointing and restart

Adaptive Multi-MPI Integration of multiple MPI-based modules Integration of multiple MPI-based modules Example: integrated rocket simulation Example: integrated rocket simulation ROCFLO, ROCSOLID, ROCBURN, ROCFACE ROCFLO, ROCSOLID, ROCBURN, ROCFACE Each module gets its own MPI_COMM_WORLD Each module gets its own MPI_COMM_WORLD All COMM_worlds form MPI_COMM_UNIVERSE All COMM_worlds form MPI_COMM_UNIVERSE Point to point communication between different MPI_COMM_worlds using the same AMPI functions Point to point communication between different MPI_COMM_worlds using the same AMPI functions Communication across modules is also considered while balancing load Communication across modules is also considered while balancing load

Experimental Results

AMR Application With Load Balancing (Load balancer is activated at time steps 20, 40, 60, and 80.)

AMPI Load Balancing on Heterogeneous Clusters (Experiment carried out on a cluster of Linux™ workstations.)

AMPI Vs MPI (This is a scaled problem.)

AMPI “Overhead”

AMPI: Status Over 70+ commonly used functions from MPI 1.1 Over 70+ commonly used functions from MPI 1.1 All point-to-point communication functions All point-to-point communication functions All collective communications functions All collective communications functions User-defined MPI data types User-defined MPI data types C, C++, and Fortran (77/90) bindings C, C++, and Fortran (77/90) bindings Tested on Origin 2000, IBM SP, Linux and Solaris clusters Tested on Origin 2000, IBM SP, Linux and Solaris clusters Should run on any platform supported by Charm++ that has mmap Should run on any platform supported by Charm++ that has mmap