1 Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai.

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
1 Case Study I: Parallel Molecular Dynamics and NAMD: Laxmikant Kale Parallel Programming Laboratory University of Illinois at.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.
Adaptive MPI Milind A. Bhandarkar
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
1 Introduction to Parallel Computing Issues Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science And Theoretical.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
Advanced / Other Programming Models Sathish Vadhiyar.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Workshop on Operating System Interference in High Performance Applications Performance Degradation in the Presence of Subnormal Floating-Point Values.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1 NAMD: Biomolecular Simulation on Thousands of Processors James C. Phillips Gengbin Zheng Sameer Kumar Laxmikant Kale Parallel.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©
1 Major analytical/theoretical techniques Typically involves simple algebraic formulas, and ratios –Typical variables are: data size (N), number of processors.
Dynamic Load Balancing Tree and Structured Computations.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Parallel Algorithm Design
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Component Frameworks:
Scalable Molecular Dynamics for Large Biomolecular Systems
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Runtime Optimizations via Processor Virtualization
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

1 Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai Lee Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

2 Motivation Performance optimization is increasingly challenging –Modern applications are complex and dynamic –Some may involve small amount of computation per step –Performance issues and obstacles change: As the number of processors change –Performance tuning on small machines isn’t enough Need very good Performance Analysis tools –Feedback at the level of applications –Analysis capabilities –Scalable views –Automatic instrucmentation

3 Projections Many performance analysis systems exist –And are very useful in their own way –Paragraph, Upshot, Pablo,.. The system we present is “Projections” –Designed for message-driven programs –Has some unique benefits compared with others The next few slides Outline: –Processor virtualization and message-driven execution –Projections: trace generation –Projections: views –Case Study: NAMD, Molecular Dynamics program that won a Gordon Bell award at SC’02 by scaling MD for biomolecules to 3,000 procs

4 Virtualization: Object-based Parallelization User View System implementation User is only concerned with interaction between objects

5 Data driven execution Scheduler Message Q

6 Charm++ and Adaptive MPI Realizations of Virtualization Approach Charm++ Parallel C++ –Asynchronous methods In development for over a decade Basis of several parallel applications Runs on all popular parallel machines and clusters AMPI A migration path for MPI codes –Allows them dynamic load balancing capabilities of Charm++ Minimal modifications to convert existing MPI programs Bindings for –C, C++, and Fortran90 Both available from

7 Benefits of Virtualization Software Engineering –Number of virtual processors can be independently controlled –Separate VPs for modules Message Driven Execution –Adaptive overlap –Modularity –Predictability: Automatic Out-of-core Dynamic mapping –Heterogeneous clusters: Vacate, adjust to speed, share –Automatic checkpointing –Change the set of processors Principle of Persistence: –Enables Runtime Optimizations –Automatic Dynamic Load Balancing –Communication Optimizations –Other Runtime Optimizations More info:

8 Measurement Based Load Balancing Principle of persistence –Object communication patterns and computational loads tend to persist over time –In spite of dynamic behavior Abrupt but infrequent changes Slow and small changes Runtime instrumentation –Measures communication volume and computation time Measurement based load balancers –Use the instrumented data-base periodically to make new decisions

9 Trace Generation Automatic instrumentation by runtime system Detailed Summary

10 Post mortem analysis: views Utilization Graph –As a function of time interval or processor –Shows processor utilization –As well as: time spent on specific parallel methods Timeline: –upshot-like, but more details –Pop-up views of method execution, message arrows, user- level events Profile: stacked graphs: –For a given period, breakdown of the time on each processor Includes idle time, and message-sending, receiving times

11 Prtojections Views: continued Animation of utilization Histogram of method execution times –How many method-execution instances had a time of 0-1 ms? 1-2 ms?.. Performance counters: –Associated with each entry method –Usual counters, interface to PAPI

12 Case Study: NAMD We illustrate the use of Projections –Using a case study –Illustrate different “views” –Show performance debugging methodology

13 NAMD: A Production MD program NAMD Fully featured program NIH-funded development Distributed free of charge (~5000 downloads so far) Binaries and source code Installed at NSF centers User training and support Large published simulations (e.g., aquaporin simulation featured in SC’02 keynote)

14 NAMD, CHARMM27, PME NpT ensemble at 310 or 298 K 1ns equilibration, 4ns production Protein:~ 15,000 atoms Lipids (POPE):~ 40,000 atoms Water:~ 51,000 atoms Total:~ 106,000 atoms 3.5 days / ns O2000 CPUs 11 days / ns - 32 Linux CPUs.35 days/ns–512 LeMieux CPUs Acquaporin Simulation F. Zhu, E.T., K. Schulten, FEBS Lett. 504, 212 (2001) M. Jensen, E.T., K. Schulten, Structure 9, 1083 (2001)

15 Molecular Dynamics in NAMD Collection of [charged] atoms, with bonds –Newtonian mechanics –Thousands of atoms (10, ,000) At each time-step –Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s –Short-distance: every timestep –Long-distance: using PME (3D FFT) –Multiple Time Stepping : PME every 4 timesteps –Calculate velocities and advance positions Challenge: femtosecond time-step, millions needed! Collaboration with K. Schulten, R. Skeel, and coworkers

16 Sizes of Simulations Over Time BPTI 3K atoms Estrogen Receptor 36K atoms (1996) ATP Synthase 327K atoms (2001)

17 F 1 F 0 ATP-Synthase (ATP-ase) CConverts the electrochemical energy of the proton gradient into the mechanical energy of the central stalk rotation, driving ATP synthesis (  G = 7.7 kcal/mol). 327,000 atoms total, 51,000 atoms -- protein and nucletoide 276,000 atoms -- water and ions The Benchmark

18 Traditional Approaches: non isoefficient Replicated Data: –All atom coordinates stored on each processor Communication/Computation ratio: P log P Partition the Atoms array across processors –Nearby atoms may not be on the same processor –C/C ratio: O(P) Distribute force matrix to processors –Matrix is sparse, non uniform, – C/C Ratio: sqrt(P)

19 Spatial Decomposition Via Charm Atoms distributed to cubes based on their location Size of each cube : Just a bit larger than cut-off radius Communicate only with neighbors Work: for each pair of nbr objects C/C ratio: O(1) However: Load Imbalance Limited Parallelism Cells, Cubes or“Patches” Charm++ is useful to handle this

20 Object Based Parallelization for MD: Force Decomposition + Spatial Decomposition Now, we have many objects to load balance: –Each diamond can be assigned to any proc. – Number of diamonds (3D): –14·Number of Patches

21 Bond Forces Multiple types of forces: –Bonds(2), Angles(3), Dihedrals (4),.. –Luckily, each involves atoms in neighboring patches only Straightforward implementation: –Send message to all neighbors, –receive forces from them –26*2 messages per patch! Instead, we do: –Send to (7) upstream nbrs –Each force calculated at one patch B CA

VPs NAMD Parallelization using Charm++ These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system 9,800 VPs

23 Adding PME PME involves: –A grid of modest size (e.g. 192x144x144) –Need to distribute charge from patches to grids –3D FFT over the grid Strategy: –Use a smaller subset (non-dedicated) of processors for PME –Overlap PME with cutoff computation –Use individual processors for both PME and cutoff computations –Multiple timestepping

VPs VP s 30,000 VPs NAMD Parallelization using Charm++ : PME These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

25 Some Challenges Here, we will focus on cut-off only (no PME ) simulation –For simplicity of presentation of performance issues New parallel machine with faster processors –PSC Lemieux –1 processor performance: 57 seconds on ASCI red to 7.08 seconds on Lemieux –Makes is harder to parallelize: E.g. larger communication-to-computation ratio Each timestep is few milliseconds on 1000’s of processors

26 Overview of Performance Optimizations List the optimizations we are going to show …. Next few slides should include the projections pictures from the papers and associated stories in the text…. Use the slides after the “summary” as needed.

27 Summary and Conclusion Processor virtualization –A useful technique for complex applications –Charm++ and AMPI embody this –Can be downloaded at Projections: –Performance analysis tool especially suited for processor virtualization NAMD performance was optimized to scale to 3000 processors using projections Future: –Further automation of analysis –On-demand displays (via a query-and-display language)

28 Grainsize and Amdahls’s law A variant of Amdahl’s law, for objects: –The fastest time can be no shorter than the time for the biggest single object! –Lesson from previous efforts Splitting computation objects: –30,000 nonbonded compute objects –Instead of approx 10,000

29 Mode: 700 us Distribution of execution times of non-bonded force computation objects (over 24 steps)

30 Periodic Load Balancing Strategies Centralized strategy: –Charm RTS collects data (on one processor) about: Computational Load and Communication for each pair –Partition the graph of objects across processors Take communication into account –Pt-to-pt, as well as multicast over a subset –As you map an object, add to the load on both sending and receiving processor The red communication is free, if it is a multicast.

31 Load Balancing Steps Regular Timesteps Instrumented Timesteps Detailed, aggressive Load Balancing Refinement Load Balancing

32 Another New Challenge Jitter due small variations –On 2k processors or more –Each timestep, ideally, will be about msec for ATPase –Within that time: each processor sends and receives : Approximately messages of 4-6 KB each –Communication layer and/or OS has small “hiccups” No problem until 512 processors Small rare hiccups can lead to large performance impact –When timestep is small (10-20 msec), AND –Large number of processors are used

33 Benefits of Avoiding Barrier Problem with barriers: –Not the direct cost of the operation itself as much –But it prevents the program from adjusting to small variations E.g. K phases, separated by barriers (or scalar reductions) Load is effectively balanced. But, –In each phase, there may be slight non-determistic load imbalance –Let Li,j be the load on I’th processor in j’th phase. In NAMD, using Charm++’s message-driven execution: –The energy reductions were made asynchronous –No other global barriers are used in cut-off simulations With barrier:Without:

milliseconds

35 Substep Dynamic Load Adjustments Load balancer tells each processor its expected (predicted) load for each timestep Each processor monitors its execution time for each timestep –after executing each force-computation object If it has taken well beyond its allocated time: –Infers that it has encountered a “stretch” –Sends a fraction of its work in the next 2-3 steps to other processors Randomly selected from among the least loaded processors migrate Compute(s) away in this step

36 Performance: NAMD on Lemieux ATPase: 320,000+ atoms including water

milliseconds

38 Using all 4 processors on each Node 300 milliseconds