May 8th, 2012 Higher Ed & Research. Molecular Dynamics Applications Overview AMBER NAMD GROMACS LAMMPS Sections Included * * In fullscreen mode, click.

Slides:



Advertisements
Similar presentations
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.
Survey of Molecular Dynamics Simulations By Will Welch For Jan Kubelka CHEM 4560/5560 Fall, 2014 University of Wyoming.
1 Scalable Fast Multipole Accelerated Vortex Methods Qi Hu a Nail A. Gumerov a Rio Yokota b Lorena Barba c Ramani Duraiswami a a Department of Computer.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Why GPU Computing. GPU CPU Add GPUs: Accelerate Science Applications © NVIDIA 2013.
SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Outline October 9th 2008 Ross C. Walker.
Parallel Processing1 Parallel Processing (CS 676) Overview Jeremy R. Johnson.
Cyberinfrastructure for Scalable and High Performance Geospatial Computation Xuan Shi Graduate assistants supported by the CyberGIS grant Fei Ye (2011)
The Protein Folding Problem David van der Spoel Dept. of Cell & Mol. Biology Uppsala, Sweden
Evaluation of Fast Electrostatics Algorithms Alice N. Ko and Jesús A. Izaguirre with Thierry Matthey Department of Computer Science and Engineering University.
1Managed by UT-Battelle for the U.S. Department of Energy Optimal Biomolecular simulations in heterogeneous computing architectures Scott Hampton, Pratul.
1 petaFLOPS+ in 10 racks TB2–TL system announcement Rev 1A.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
1 Scalable Distributed Fast Multipole Methods Qi Hu, Nail A. Gumerov, Ramani Duraiswami Institute for Advanced Computer Studies Department of Computer.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Future Direction with NAMD David Hardy
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Direct Self-Consistent Field Computations on GPU Clusters Guochun.
NVDA Preetam Jinka Akhil Kolluri Pavan Naik. Background Graphics processing units (GPUs) Chipsets Workstations Personal computers Mobile devices Servers.
Next-generation scalable applications: When MPI-only is not enough June 3, 2008 Paul S. Crozier Sandia National Labs Sandia is a multiprogram laboratory.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Coarse-grain modeling of lipid membranes
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
Massively Parallel Ensemble Methods Using Work Queue Badi’ Abdul-Wahid Department of Computer Science University of Notre Dame CCL Workshop 2012.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Demonstration: Using NAMD David Hardy
Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
BlueGene/L Facts Platform Characteristics 512-node prototype 64 rack BlueGene/L Machine Peak Performance 1.0 / 2.0 TFlops/s 180 / 360 TFlops/s Total Memory.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
ALGORITHM IMPROVEMENTS AND HPC 1. Basic MD algorithm 1. Initialize atoms’ status and the LJ potential table; set parameters controlling the simulation;
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Fig. 1. A wiring diagram for the SCEC computational pathways of earthquake system science (left) and large-scale calculations exemplifying each of the.
An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
ANTON D.E Shaw Research.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
SAN DIEGO SUPERCOMPUTER CENTER Advanced User Support Project Overview Adrian E. Roitberg University of Florida July 2nd 2009 By Ross C. Walker.
Harnessing Grid-Based Parallel Computing Resources for Molecular Dynamics Simulations Josh Hursey.
Advanced User Support Amit Majumdar 8/13/09. Outline  Three categories of AUS  Operational Activities  AUS.ASTA  AUS.ASP  ASTA example.
Data-Driven Time-Parallelization in the AFM Simulation of Proteins L. Ji, H. Nymeyer, A. Srinivasan, and Y. Yu Florida State University
Locate Potential Support Vectors for Faster
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Image Charge Optimization for the Reaction Field by Matching to an Electrostatic Force Tensor Wei Song Donald Jacobs University of North Carolina at Charlotte.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Multicore Applications in Physics and Biochemical Research Hristo Iliev Faculty of Physics Sofia University “St. Kliment Ohridski” 3 rd Balkan Conference.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
GPGPU use cases from the MoBrain community
Tesla P100 Performance Guide
Grad OS Course Project Kevin Kastner Xueheng Hu
Scott Michael Indiana University July 6, 2017
Porting DL_MESO_DPD on GPUs
Early Results of Deep Learning on the Stampede2 Supercomputer
IBM Power Systems.
Computer simulation studies of forced rupture kinetics of
Parallel computing in Computational chemistry
Accelerating New Science
Presentation transcript:

May 8th, 2012 Higher Ed & Research

Molecular Dynamics Applications Overview AMBER NAMD GROMACS LAMMPS Sections Included * * In fullscreen mode, click on link to view a particular module. Click on NVIDIA logo in each slide to return to this page.

Application Features Supported GPU PerfRelease StatusNotes/Benchmarks AMBER PMEMD Explicit Solvent & GB Implicit Solvent ns/day JAC NVE on 16X 2090s Released Multi-GPU, multi-node AMBER 12 htm#Benchmarks htm#Benchmarks NAMD Full electrostatics with PME and most simulation features 6.44 ns/days STMV 585X 2050s Released 100M atom capable Multi-GPU, multi-node NAMD version April 12 md_ bench.html GROMACS Implicit (5x), Explicit (2x) Solvent via OpenMM 165 ns/Day DHFR 4X C2075s 4.5 Single GPU released 4.6 Multi-GPU Released gpu.html LAMMPS Lennard-Jones, Gay- Berne, Tersoff x Released. Multi-GPU, multi-node 1 billion atom on Lincoln: # machine # machine GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison Molecular Dynamics (MD) Applications

Application Features Supported GPU PerfRelease StatusNotes Abalone TBD Simulations 4-29X (on 1060 GPU) Released Single GPU. Agile Molecule, Inc. ACEMD Written for use on GPUs 160 ns/dayReleased Production bio-molecular dynamics (MD) software specially optimized to run on single and multi-GPUs DL_POLY Two-body Forces, Link- cell Pairs, Ewald SPME forces, Shake VV 4x V 4.0 Source only Results Published Multi-GPU, multi-node supported HOOMD- Blue Written for use on GPUs 2X (32 CPU cores vs. 2 10XX GPUs) Released, Version Single and multi-GPU. New/Additional MD Applications Ramping GPU Perf compared against Multi-core x86 CPU socket. GPU Perf benchmarked on GPU supported features and may be a kernel to kernel perf comparison

GPU Value to Molecular Dynamics What Why How Study disease & discover drugs Predict drug and protein interactions Speed of simulations is critical Enables study of: Longer timeframes Larger systems More simulations GPUs increase throughput & accelerate simulations AMBER 11 Application 4.6x performance increase with 2 GPUs with only a 54% added cost* AMBER 11 Cellulose NPT on 2x E5670 CPUS + 2x Tesla C2090s (per node) vs. 2xcE5670 CPUs (per node) Cost of CPU node assumed to be $9333. Cost of adding two (2) 2090s to single node is assumed to be $5333 GPU Test Drive Pre-configured Applications AMBER 11 NAMD 2.8 GPU Test Drive Pre-configured Applications AMBER 11 NAMD 2.8 GPU Ready Applications Abalone ACEMD AMBER DL_PLOY GAMESS GROMACS LAMMPS NAMD GPU Ready Applications Abalone ACEMD AMBER DL_PLOY GAMESS GROMACS LAMMPS NAMD

All Key MD Codes are GPU Ready AMBER, NAMD, GROMACS, LAMMPS Life and Material Sciences Great multi-GPU performance Additional MD GPU codes: Abalone, ACEMD, HOOMD-Blue Focus: scaling to large numbers of GPUs

Outstanding AMBER Results with GPUs

Run AMBER Faster Up to 5x Speed Up With GPUs DHFR (NVE) 23,558 Atoms “…with two GPUs we can run a single simulation as fast as on 128 CPUs of a Cray XT3 or on 1024 CPUs of an IBM BlueGene/L machine. We can try things that were undoable before. It still blows my mind.” Axel Kohlmeyer Temple University CPU Supercomputer

AMBER Make Research More Productive with GPUs Adding Two 2090 GPUs to a Node Yields a > 4 x Performance Increase Base node configuration: Dual Xeon X5670s and Dual Tesla M2090 GPUs per node 318% Higher Performance 54% Additional Expense No GPU With GPU

Run NAMD Faster Up to 7x Speed Up With GPUs ApoA-1 92,224 Atoms STMV 1,066,628 Atoms Test Platform: 1 Node, Dual Tesla M2090 GPU (6GB), Dual Intel 4-core Xeon (2.4 GHz), NAMD 2.8, CUDA 4.0, ECC On. Visit for more information on speed up results, configuration and test models. NAMD 2.8 B1 + unreleaesd patch, STMV Benchmark A Node is Dual-Socket, Quad-core x5650 with 2 Tesla M2070 GPUs Performance numbers for 2 M cores (GPU+CPU) vs. 8 cores (CPU)

Make Research More Productive with GPUs Get up to a 250% Performance Increase (STMV – 1, atoms) No GPU With GPU 250% Higher 54% Additional Expense

GROMACS Partnership Overview Erik Lindahl, David van der Spoel, Berk Hess are head authors and project leaders. Szilárd Páll is a key GPU developer. 2010: single GPU support (OpenMM library in GROMACS 4.5) NVIDIA Dev Tech resources allocated to GROMACS code 2012: GROMACS 4.6 will support multi-GPU nodes as well as GPU clusters

GROMACS 4.6 Release Features Multi-GPU support - GPU acceleration is one of main focus: majority of features will be accelerated in 4.6 in a transparent fashion PME simulations get special attention, and most of the effort will go into making these algorithms well load-balanced Reaction-Field and Cut-Off simulations also run accelerated List of non-supported GPU accelerated features will be quite short GROMACS Multi-GPU Expected in April 2012

GROMACS 4.6 Alpha Release Absolute Performance Absolute performance of GROMACS running CUDA- and SSE-accelerated non-bonded kernels with PME on 3-12 CPU cores and 1-4 GPUs. Simulations with cubic and truncated dodecahedron cells, pressure coupling, as well as virtual interaction sites enabling 5 fs are shown Benchmark systems: RNAse in water with atoms in cubic and atoms in truncated dodecahedron box Settings: electrostatics cut-off auto-tuned >0.9 nm, LJ cut-off 0.9 nm, 2 fs and 5 fs (with vsites) time steps Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release Strong Scaling Strong scaling of GPU-accelerated GROMACS with PME and reaction-field on: Up to 40 cluster nodes with 80 GPUs Benchmark system: water box with 1.5M particles Settings: electrostatics cut-off auto-tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps Hardware: Bullx cluster nodes with 2x Intel Xeon E5649 (6C), 2x NVIDIA Tesla M2090, 2x QDR Infiniband 40 Gb/s

GROMACS 4.6 Alpha Release PME Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system Sizes which fall beyond the typical single-node production size. Benchmark systems: water boxes size ranging from 1.5k to 3M particles. Settings: electrostatics cut-off auto- tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps. Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075.

GROMACS 4.6 Alpha Release Rxn-Field Weak Scaling Weak scaling of the GPU-accelerated GROMACS with reaction-field and PME on: 3-12 CPU cores and 1-4 GPUs. The gradient background indicates the range of system sizes which fall beyond the typical single-node production size Benchmark systems: water boxes size ranging from 1.5k to 3M particles Settings: electrostatics cut-off auto- tuned >0.9 nm for PME and 0.9 nm for reaction-field, LJ cut-off 0.9 nm, 2 fs time steps Hardware: workstation with 2x Intel Xeon X5650 (6C), 4x NVIDIA Tesla C2075

GROMACS 4.6 Alpha Release Weak Scaling Weak scaling of the CUDA non-bonded force kernel on GeForce and Tesla GPUs. Perfect weak scaling, challenges for strong scaling Benchmark systems: water boxes size ranging from 1.5k to 3M particles Settings: electrostatics & LJ cut-off 1.0 nm, 2 fs time steps Hardware: workstation with 2x Intel Xeon X5650 (6C) CPUs, 4x NVIDIA Tesla C2075

LAMMPS Released GPU Features and Future Plans * Courtesy of Michael Brown at ORNL and Paul Crozier at Sandia Labs LAMMPS August 2009 First GPU accelerated support LAMMPS Aug. 22, 2011 Selected accelerated Non-bonded short‐range potentials (SP, MP, DP support) Lennard-Jones (several variants with & without coulombic interactions) Morse Buckingham CHARMM Tabulated Course grain SDK Anisotropic Gay-Bern RE-squared “Hybrid” combinations (GPU accel & no GPU accel) Particle-Particle Particle-Mesh (SP or DP) Neighbor list builds Longer Term* Improve performance on smaller particle counts Neighbor List is the problem Improve long-range performance MPI/Poisson Solve is the problem Additional pair potential support (including expensive advanced force fields) – See “Tremendous Opportunity for GPUs” slide* Performance improvements focused to specific science problems

W.M. Brown, “GPU Acceleration in LAMMPS”, 2011 LAMMPS Workshop LAMMPS LAMMPS 8.6x Speed-up with GPUs

LAMMPS LAMMPS 4x Faster on Billion Atoms Test Platform: NCSA Lincoln Cluster with S1070 1U GPU servers attached CPU-only Cluster- Cray XT5 Billion Atom Lennard-Jones Benchmark 103 Seconds 288 GPUs + CPUs1920 x86 CPUs

4X-15X Speedups Gay-Berne RE-Squared From August 2011 LAMMPS Workshop Courtesy of W. Michael Brown, ORNL LAMMPS

LAMMPS Conclusions Runs both with individual multi-GPU node, as well as GPU clusters Outstanding raw performance! Performance is 3x-40X higher than equivalent CPU code Impressive linear strong scaling Good weak scaling, scales to a billion particles Tremendous opportunity to GPU accelerate other force fields