NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
PPL-Dept of Computer Science, UIUC Component Frameworks: Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Adaptive MPI Milind A. Bhandarkar
Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
Advanced / Other Programming Models Sathish Vadhiyar.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Productive Performance Tools for Heterogeneous Parallel Computing
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Programming Models for Blue Gene/L : Charm++, AMPI and Applications
Component Frameworks:
Scalable Molecular Dynamics for Large Biomolecular Systems
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Integrated Runtime of Charm++ and OpenMP
Course Outline Introduction in algorithms and applications
Runtime Optimizations via Processor Virtualization
Faucets: Efficient Utilization of Multiple Clusters
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
An Orchestration Language for Parallel Objects
Support for Adaptivity in ARMCI Using Migratable Objects
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Context: Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to very large number of processors –Productivity: of human programmers –Complex: irregular structure, dynamic variations Approach: Application Oriented yet CS centered research –Develop enabling technology, for a wide collection of apps. –Develop, use and test it in the context of real applications –Develop standard library of reusable parallel components

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Project Objective and Overview Focus on extremely large parallel machines –Exemplified by Blue Gene/Cyclops Issues: –Programming Environment: Objects, threads, compiler support –Runtime performance adaptation –Performance modeling Coarse grained models Fine grained models Hybrid –Applications: Unstructured Meshes (FEM/Crack Propagation),.. David Padua Sanjay Kale Sarita Adve Phillipe Geubelle

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Project Objective and Overview Focus on extremely large parallel machines –Exemplified by Blue Gene/Cyclops Issues: –Programming Environment –Runtime performance adaptation –Performance modeling Coarse grained models Fine grained models Hybrid –Applications: Unstructured Meshes (FEM/Crack Propagation),.. David Padua Sanjay Kale Sarita Adve Phillipe Geubelle

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Multi-partition Decomposition Idea: divide the computation into a large number of pieces –Independent of number of processors –Typically larger than number of processors –Let the system map entities to processors Optimal division of labor between “system” and programmer: Decomposition done by programmer, Everything else automated

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Object-based Parallelization User View System implementation User is only concerned with interaction between objects

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Charm++ Parallel C++ with Data Driven Objects Object Arrays/ Object Collections Object Groups: –Global object with a “representative” on each PE Asynchronous method invocation Prioritized scheduling Information sharing abstractions: readonly, tables,.. Mature, robust, portable

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Data driven execution Scheduler Message Q

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Load Balancing Framework Based on object migration –Partitions implemented as objects (or threads) are mapped to available processors by LB framework Measurement based load balancers: –Principle of persistence Computational loads and communication patterns –Runtime system measures actual computation times of every partition, as well as communication patterns Variety of “plug-in” LB strategies available –Scalable to a few thousand processors –Including those for situations when principle of persistence does not apply

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Building on Object-based Parallelism Application induced load imbalances Environment induced performance issues: –Dealing with extraneous loads on shared m/cs –Vacating workstations –Heterogeneous clusters –Shrinking and Expanding jobs to available Pes Object “migration”: novel uses –Automatic checkpointing –Automatic prefetching for out-of-core execution Reuse: object based components

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Applications Charm++ developed in the context of real applications Current applications we are involved with: –Molecular dynamics –Crack propagation –Rocket simulation: fluid dynamics + structures + –QM/MM: Material properties via quantum mech –Cosmology simulations: parallel analysis+viz –Cosmology: gravitational with multiple timestepping

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Molecular Dynamics Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step –Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s –Calculate velocities and advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1, ,000)

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Object Based Parallelization for MD

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Performance Data: SC2000

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Charm++ Is a Good Match for M-PIM Encapsulation : objects Cost model: –Object data, read-only data, remote data Migration and resource management: automatic One sided communication: since the beginning Asynchronous global operations (reductions,..) Modularity: –see 1996 paper for why DD Objects enable modularity Acceptability: –C++ –Now also: AMPI on top of charm++

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Higher-level Models Do programmers find Charm++/AMPI easy/good –We think so –Certainly a good intermediate level model Higher level abstractions can be built on it But what kinds of abstractions? We think domain-specific ones

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Specialization MPI expression Scheduling Mapping Decomposition HPFCharm++ Domain specific frameworks /AMPI

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Further Match With MPIM Ability to predict: –Which data is going to be needed and –Which code will execute –Based on the ready queue of object method invocations –So, we can: Prefetch data accurately Prefetch code if needed S S Q Q

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC So, What Are We Doing About It? How to develop any programming environment for a machine that isn’t built yet Blue Gene/C emulator using charm++ –Completed last year –Implememnts low level BG/C API Packet sends, extract packet from comm buffers –Emulation runs on machines with hundreds of “normal” processors Charm++ on blue Gene /C Emulator

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Structure of the Emulators Blue Gene/C Low-level API Charm++ Converse Charm++BG/C low level API Charm++

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Emulation on a Parallel Machine Simulating (Host) Processor BG/C Nodes Hardware thread

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Extensions to Charm++ for BG/C Microtasks: –Objects may fire microtasks that can be executed by any thread on the same node –Increases parallelism –Overhead: sub-microsecond Issue: –Object affinity: map to thread or node? Thread, currently. Microtasks alleviate load balancing within a node

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Emulation efficiency How much time does it take to run an emulation? –8 Million processors being emulated on 100 –In addition, lower cache performance –Lots of tiny messages On a Linux cluster: –Emulation shows good speedup

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Emulation efficiency 1000 BG/C nodes (10x10x10) Each with 200 threads (total of 200,000 user-level threads) But Data is preliminary, based on one simulation

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Emulator to Simulator Step 1: Coarse grained simulation –Simulation: performance prediction capability –Models contention for processor/thread –Also models communication delay based on distance –Doesn’t model memory access on chip, or network –How to do this in spite of out-of-order message delivery? Rely on determinism of Charm++ programs Time stamped messages and threads Parallel time-stamp correction algorithm

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Timestamp correction Basic execution: –Timestamped messages Correction needed when: –A message arrives with an earlier timestamp than other messages “processed” already Cases: –Messages to Handlers or simple objects –MPI style threads, without wildcard or irecvs –Charm++ with dependence expressed via structured dagger

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC M8 M1M7M6M5M4M3M2 RecvTime Execution TimeLine Timestamps Correction

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC M8 M1M7M6M5M4M3M2 RecvTime Execution TimeLine Timestamps Correction

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC M1M7M6M5M4M3M2 RecvTime Execution TimeLine M8 Execution TimeLine M1M7M6M5M4M3M2M8 RecvTime Correction Message Timestamps Correction

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC M1M7M6M5M4M3M2 RecvTime Execution TimeLine Correction Message (M4) M4 Correction Message (M4) M4 M1M7M4M3M2 RecvTime Execution TimeLine M5M6 Correction Message M1M7M6M4M3M2 RecvTime Execution TimeLine M5 Correction Message Timestamps Correction

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Applications on the current system Using BG Charm++ LeanMD: –Research quality Molecular Dyanmics –Version 0: only electrostatics + van der Vaal Simple AMR kernel –Adaptive tree to generate millions of objects Each holding a 3D array –Communication with “neighbors” Tree makes it harder to find nbrs, but Charm makes it easy

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Emulator to Simulator Step 2: Add fine grained procesor simulation –Sarita Adve: RSIM based simulation of a node SMP node simulation: completed –Also: simulation of interconnection network –Millions of thread units/caches to simulate in detail? Step 3: Hybrid simulation –Instead: use detailed simulation to build model –Drive coarse simulation using model behavior –Further help from compiler and RTS

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Modeling layers Applications Libraries/RTS Chip ArchitectureNetwork model For each: need a detailed simulation and a simpler (e.g. table-driven) “model” And methods for combining them

NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Summary Charm++ (data-driven migratable objects) – is a well-matched candidate programming model for M-PIMs We have developed an Emulator/Simulator –For BG/C –Runs on parallel machines We have Implemented multi-million object applications using Charm++ –And tested on emulated Blue Gene/C More info: –Emulator is available for download, along with Charm