Milind A. Bhandarkar (milind@cs.uiuc.edu) Adaptive MPI Milind A. Bhandarkar (milind@cs.uiuc.edu)

Slides:



Advertisements
Similar presentations
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Types of Parallel Computers
Distributed Processing, Client/Server, and Clusters
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Novel and “Alternative” Parallel Programming Paradigms Laxmikant Kale CS433 Spring 2000.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Parallelization Of The Spacetime Discontinuous Galerkin Method Using The Charm++ FEM Framework (ParFUM) Mark Hills, Hari Govind, Sayantan Chakravorty,
Adaptive MPI Milind A. Bhandarkar
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
Faucets Queuing System Presented by, Sameer Kumar.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
CSAR Overview Laxmikant (Sanjay) Kale 11 September 2001 © ©2001 Board of Trustees of the University of Illinois.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Adaptive MPI Performance & Application Studies
OpenMosix, Open SSI, and LinuxPMI
Computer Science Overview
Conception of parallel algorithms
SOFTWARE DESIGN AND ARCHITECTURE
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
AMPI: Adaptive MPI Tutorial
Component Frameworks:
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
AMPI: Adaptive MPI Celso Mendes & Chao Huang
Chapter 4: Threads.
Component Frameworks:
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Chapter 4: Threads.
MPI-Message Passing Interface
Q: What Does the Future Hold for “Parallel” Languages?
Runtime Optimizations via Processor Virtualization
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Introduction to parallelism and the Message Passing Interface
Faucets: Efficient Utilization of Multiple Clusters
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Chao Huang Parallel Programming Laboratory University of Illinois
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel Implementation of Adaptive Spacetime Simulations A
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Milind A. Bhandarkar (milind@cs.uiuc.edu) Adaptive MPI Milind A. Bhandarkar (milind@cs.uiuc.edu)

Motivation Many CSE applications exhibit dynamic behavior Adaptive mesh refinement (AMR) Pressure-driven crack propagation in solid propellants Also, non-dedicated supercomputing platforms such as clusters affect processor availability These factors cause severe load imbalance Can the parallel language / runtime system help ?

Load Imbalance in Crack-propagation Application (More and more cohesive elements activated between iterations 35 to 40.)

Load Imbalance in an AMR Application (Mesh is refined at 25th time step)

Multi-partition Decomposition Basic idea: Decompose the problem into a number of partitions, Independent of the number of processors # Partitions > # processors The system maps partitions to processors The system maps and re-maps objects as needed Re-mapping strategies help adapt to dynamic variations To make this work, need Load balancing framework & runtime support But, isn’t there a high overhead of multi-partitioning ?

“Overhead” of Multi-partition Decomposition (Crack Propagation code, with 70k elements)

Charm++ Supports data driven objects. Singleton objects, object arrays, groups, .. Many objects per processor, with method execution scheduled with availability of data. Supports object migration, with automatic forwarding. Excellent results with highly irregular & dynamic applications. Molecular dynamics application NAMD, speedup of 1250 on 2000 processors of ASCI-red. Brunner, Phillips, & Kale. “Scalable molecular dynamics”, Gordon Bell finalist, SC2000.

Charm++: System Mapped Objects 1 5 8 10 4 11 12 9 2 3 6 7 13 1 12 5 9 10 2 11 3 4 7 13 6 8

Load Balancing Framework

However… Many CSE applications are written in Fortran, MPI Conversion to a parallel object-based language such as Charm++ is cumbersome Message-driven style requires split-phase transactions Often results in a complete rewrite How to convert existing MPI applications without extensive rewrite ?

Solution: Each partition implemented as a user-level thread associated with a message-driven object Communication library for these threads same in syntax and semantics as MPI But what about the overheads associated with threads ?

AMPI Threads Vs MD Objects (1D Decomposition)

AMPI Threads Vs MD Objects (3D Decomposition)

Thread Migration Thread stacks may contain references to local variables May not be valid upon migration to a different address space Solution: thread stacks should span the same virtual address space on any processor where they may migrate (Isomalloc) Split the virtual space into per-processor allocation pool Scalability issues Not important on 64-bit processors Constrained load balancing (limit the thread’s migratability to fewer processors)

AMPI Issues: Thread-safety Multiple threads mapped to each processor “Process data” to be localized Make them instance variables of a “class” All subroutines become instance methods of this class AMPIzer: A source-to-source translator Based on Polaris front-end Recognize all global variables Put them in a thread-private area

AMPI Issues: Data Migration Thread-private data needs to be migrated with the thread Developer has to write subroutines for packing and unpacking data Writing separate subroutines is error-prone Puppers (“pup”=“pack-unpack”) A subroutine to “show” the data to the runtime system Fortran90 generic procedures make writing the pupper easy

AMPI: Other Features Automatic checkpoint and restart On different number of processors Number of chunks remain the same, but can be mapped to different number of processors No additional work is needed Same pupper used for migration is also used for checkpointing and restart

Adaptive Multi-MPI Integration of multiple MPI-based modules Example: integrated rocket simulation ROCFLO, ROCSOLID, ROCBURN, ROCFACE Each module gets its own MPI_COMM_WORLD All COMM_worlds form MPI_COMM_UNIVERSE Point to point communication between different MPI_COMM_worlds using the same AMPI functions Communication across modules is also considered while balancing load

Experimental Results

AMR Application With Load Balancing (Load balancer is activated at time steps 20, 40, 60, and 80.)

AMPI Load Balancing on Heterogeneous Clusters (Experiment carried out on a cluster of Linux™ workstations.)

(This is a scaled problem.) AMPI Vs MPI (This is a scaled problem.)

AMPI “Overhead”

AMPI: Status Over 70+ commonly used functions from MPI 1.1 All point-to-point communication functions All collective communications functions User-defined MPI data types C, C++, and Fortran (77/90) bindings Tested on Origin 2000, IBM SP, Linux and Solaris clusters Should run on any platform supported by Charm++ that has mmap