Scalable Molecular Dynamics for Large Biomolecular Systems

Slides:



Advertisements
Similar presentations
1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.
Advertisements

A Digital Laboratory “In the real world, this could eventually mean that most chemical experiments are conducted inside the silicon of chips instead of.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
1 Case Study I: Parallel Molecular Dynamics and NAMD: Laxmikant Kale Parallel Programming Laboratory University of Illinois at.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (ii) Bill Smith CCLRC Daresbury Laboratory
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Adaptive MPI Milind A. Bhandarkar
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
1 Introduction to Parallel Computing Issues Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science And Theoretical.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1 NAMD: Biomolecular Simulation on Thousands of Processors James C. Phillips Gengbin Zheng Sameer Kumar Laxmikant Kale Parallel.
1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Scalability and interoperable libraries in NAMD Laxmikant (Sanjay) Kale Theoretical Biophysics group and Department of Computer Science University of Illinois.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
1 Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai.
Programming for Performance CS433 Spring 2001 Laxmikant Kale.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Static Process Scheduling
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Group Mission and Approach To enhance Performance and Productivity in programming complex parallel applications –Performance: scalable to thousands of.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Dynamic Load Balancing Tree and Structured Computations.
Programming for Performance Laxmikant Kale CS 433.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Auburn University
Computational Techniques for Efficient Carbon Nanotube Simulation
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Parallel Algorithm Design
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Parallel Programming in C with MPI and OpenMP
Programming Models for Blue Gene/L : Charm++, AMPI and Applications
Component Frameworks:
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Summary Background Introduction in algorithms and applications
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
BigSim: Simulating PetaFLOPS Supercomputers
Computational Techniques for Efficient Carbon Nanotube Simulation
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Parallel Programming in C with MPI and OpenMP
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Presentation transcript:

Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale

Overview Context: approach and methodology Molecular dynamics for biomolecules Our program NAMD Basic Parallelization strategy NAMD performance Optimizations Techniques Results Conclusions: summary, lessons and future work

The context Objective: Enhance Performance and productivity in parallel programming For complex, dynamic applications Scalable to thousands of processors Theme: Adaptive techniques for handling dynamic behavior Look for optimal division of labor between human programmer and the “system” Let the programmer specify what to do in parallel Let the system decide when and where to run the subcomputations Data driven objects as the substrate

5 8 1 1 2 10 4 3 8 2 3 9 7 5 6 10 9 4 9 12 11 13 6 13 7 11 12

Data driven execution Scheduler Scheduler Message Q Message Q

Charm++ Parallel C++ with Data Driven Objects Object Arrays and collections Asynchronous method invocation Object Groups: global object with a “representative” on each PE Prioritized scheduling Mature, robust, portable http://charm.cs.uiuc.edu

Multi-partition decomposition

Load balancing Based on migratable objects Collect timing data for several cycles Run heuristic load balancer Several alternative ones Re-map and migrate objects accordingly Registration mechanisms facilitate migration

Measurement based load balancing Application induced imbalances: Abrupt, but infrequent, or Slow, cumulative rarely: frequent, large changes Principle of persistence Extension of principle of locality Behavior, including computational load and communication patterns, of objects tend to persist over time We have implemented strategies that exploit this automatically

Molecular Dynamics

Molecular dynamics and NAMD MD to understand the structure and function of biomolecules proteins, DNA, membranes NAMD is a production quality MD program Active use by biophysicists (science publications) 50,000+ lines of C++ code 1000+ registered users Features and “accessories” such as VMD: visualization Biocore: collaboratory Steered and Interactive Molecular Dynamics

NAMD Contributors PI s : NAMD 1: NAMD2: Laxmikant Kale, Klaus Schulten, Robert Skeel NAMD 1: Robert Brunner, Andrew Dalke, Attila Gursoy, Bill Humphrey, Mark Nelson NAMD2: M. Bhandarkar, R. Brunner, A. Gursoy, J. Phillips, N.Krawetz, A. Shinozaki, K. Varadarajan, Gengbin Zheng, ..

Molecular Dynamics Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step Calculate forces on each atom bonds: non-bonded: electrostatic and van der Waal’s Calculate velocities and Advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1,000 - 100,000)

Cut-off radius Use of cut-off radius to reduce work 8 - 14 Å Faraway charges ignored! 80-95 % work is non-bonded force computations Some simulations need faraway contributions Periodic systems: Ewald, Particle-Mesh Ewald Aperiodic systems: FMA Even so, cut-off based computations are important: near-atom calculations are part of the above multiple time-stepping is used: k cut-off steps, 1 PME/FMA

Scalability The Program should scale up to use a large number of processors. But what does that mean? An individual simulation isn’t truly scalable Better definition of scalability: If I double the number of processors, I should be able to retain parallel efficiency by increasing the problem size

Isoefficiency Quantify scalability (Work of Vipin Kumar, U. Minnesota) How much increase in problem size is needed to retain the same efficiency on a larger machine? Efficiency : Seq. Time/ (P · Parallel Time) parallel time = computation + communication + idle

Atom decomposition Partition the Atoms array across processors Nearby atoms may not be on the same processor Communication: O(N) per processor Communication/Computation: O(N)/(N/P): O(P) Again, not scalable by our definition

Force Decomposition Distribute force matrix to processors Matrix is sparse, non uniform Each processor has one block Communication: Ratio: Better scalability in practice (can use 100+ processors) Plimpton: Hwang, Saltz, et al: 6% on 32 Pes 36% on 128 processor Yet not scalable in the sense defined here!

Spatial Decomposition Allocate close-by atoms to the same processor Three variations possible: Partitioning into P boxes, 1 per processor Good scalability, but hard to implement Partitioning into fixed size boxes, each a little larger than the cutoff distance Partitioning into smaller boxes Communication: O(N/P): so, scalable in principle

Spatial Decomposition in NAMD NAMD 1 used spatial decomposition Good theoretical isoefficiency, but for a fixed size system, load balancing problems For midsize systems, got good speedups up to 16 processors…. Use the symmetry of Newton’s 3rd law to facilitate load balancing

Spatial Decomposition But the load balancing problems are still severe:

FD + SD Now, we have many more objects to load balance: Each diamond can be assigned to any processor Number of diamonds (3D): 14·Number of Patches

Bond Forces Multiple types of forces: Straightforward implementation: Bonds(2), Angles(3), Dihedrals (4), .. Luckily, each involves atoms in neighboring patches only Straightforward implementation: Send message to all neighbors, receive forces from them 26*2 messages per patch!

Bonded Forces: Assume one patch per processor: an angle force involving atoms in patches: (x1,y1,z1), (x2,y2,z2), (x3,y3,z3) is calculated in patch: (max{xi}, max{yi}, max{zi}) A C B

Implementation Multiple Objects per processor Different types: patches, pairwise forces, bonded forces, Each may have its data ready at different times Need ability to map and remap them Need prioritized scheduling Charm++ supports all of these

Load Balancing Is a major challenge for this application especially for a large number of processors Unpredictable workloads Each diamond (force object) and patch encapsulate variable amount of work Static estimates are inaccurate Measurement based Load Balancing Framework Robert Brunner’s recent Ph.D. thesis Very slow variations across timesteps

Bipartite graph balancing Background load: Patches (integration, ..) and bond-related forces: Migratable load: Non-bonded forces Bipartite communication graph between migratable and non-migratable objects Challenge: Balance Load while minimizing communication

Load balancing strategy Greedy variant (simplified): Sort compute objects (diamonds) Repeat (until all assigned) S = set of all processors that: -- are not overloaded -- generate least new commun. P = least loaded {S} Assign heaviest compute to P Refinement: Repeat - Pick a compute from the most overloaded PE - Assign it to a suitable underloaded PE Until (No movement) Cell Compute Cell

Initial Speedup Results: ASCI Red

BC1 complex: 200k atoms

Optimizations Series of optimizations Examples to be covered here: Grainsize distributions (bimodal) Integration: message sending overheads

Grainsize and Amdahls’s law A variant of Amdahl’s law, for objects, would be: The fastest time can be no shorter than the time for the biggest single object! How did it apply to us? Sequential step time was 57 seconds To run on 2k processors, no object should be more than 28 msecs. Should be even shorter Grainsize analysis via projections showed that was not so..

Grainsize analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem

Grainsize reduced

Performance audit

Performance audit Through the optimization process, Total Ideal Actual an audit was kept to decide where to look to improve performance Total Ideal Actual Total 57.04 86 nonBonded 52.44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Overhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receives 0 1.61 Integration time doubled

Integration overhead analysis Problem: integration time had doubled from sequential run

Integration overhead example: The projections pictures showed the overhead was associated with sending messages. Many cells were sending 30-40 messages. The overhead was still too much compared with the cost of messages. Code analysis: memory allocations! Identical message is being sent to 30+ processors. Simple multicast support was added to Charm++ Mainly eliminates memory allocations (and some copying)

Integration overhead: After multicast

Improved Performance Data

Results on Linux Cluster

Performance of Apo-A1 on Asci Red

Performance of Apo-A1 on O2k and T3E

Lessons learned Need to downsize objects! One of the biggest challenge Choose smallest possible grainsize that amortizes overhead One of the biggest challenge was getting time for performance tuning runs on parallel machines

Future and Planned work Speedup on small molecules! Interactive molecular dynamics Increased speedups on 2k-10k processors Smaller grainsizes New algorithms for reducing communication impact New load balancing strategies Further performance improvements for PME/FMA With multiple timestepping Needs multi-phase load balancing

Steered MD: example picture Image and Simulation by the theoretical biophysics group, Beckman Institute, UIUC

More information Charm++ and associated framework: http://charm.cs.uiuc.edu NAMD and associated biophysics tools: http://www.ks.uiuc.edu Both include downloadable software

Performance: size of system Performance data on Cray T3E

Performance: various machines