PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,

Slides:



Advertisements
Similar presentations
PERFORMANCE EVALUATION OF USER-REAXC PACKAGE Hasan Metin Aktulga Postdoctoral Researcher Scientific Computing Group Lawrence Berkeley National Laboratory.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Sponsors: National Science Foundation, LogicBlox Inc., and NVIDIA Kernel.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.
Copyright HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill,
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
OpenFOAM on a GPU-based Heterogeneous Cluster
GPU Computational Screening of Carbon Capture Materials J Kim 1, A Koniges 1, R Martin 1, M Haranczyk 1, J Swisher 2 and B Smit 1,2 1 Berkeley Lab (USA),
The Protein Folding Problem David van der Spoel Dept. of Cell & Mol. Biology Uppsala, Sweden
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Algorithms and Software for Large-Scale Simulation of Reactive Systems _______________________________ Ananth Grama Coordinated Systems Lab Purdue University.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Accelerating Knowledge-based Energy Evaluation in Protein Structure Modeling with Graphics Processing Units 1 A. Yaseen, Yaohang Li, “Accelerating Knowledge-based.
PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Implementing a Speech Recognition System on a GPU using CUDA
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
VORPAL Optimizations for Petascale Systems Paul Mullowney, Peter Messmer, Ben Cowan, Keegan Amyx, Stefan Muszala Tech-X Corporation Boyana Norris Argonne.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
GPU Architecture and Programming
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Parallelization of 2D Lid-Driven Cavity Flow
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Molecular Simulation of Reactive Systems. _______________________________ Sagar Pandit, Hasan Aktulga, Ananth Grama Coordinated Systems Lab Purdue University.
Algorithms and Software for Large-Scale Simulation of Reactive Systems _______________________________ Metin Aktulga, Sagar Pandit, Alejandro Strachan,
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
CS 732: Advance Machine Learning
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Martin Kruliš by Martin Kruliš (v1.0)1.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
PuReMD: Purdue Reactive Molecular Dynamics Software Ananth Grama Center for Science of Information Computational Science and Engineering Department of.
Sobolev(+Node 6, 7) Showcase +K20m GPU Accelerator.
P&H Ap. A GPUs for Graphics and Computing. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Appendix C Graphics and Computing GPUs
Employing compression solutions under openacc
Porting DL_MESO_DPD on GPUs
Accelerating MapReduce on a Coupled CPU-GPU Architecture
PuReMD: Purdue Reactive Molecular Dynamics Software Ananth Grama Center for Science of Information Computational Science and Engineering Department of.
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy, lone-pair, three-body, four-body and hydrogen-bonds Implementation of Non-bonded interactions – Charge Equilibration – Coulombs and Van der Waals forces

Neighbor-list generation 3D grid structure is built by dividing the simulation domain into small cells Atoms binned into cells Larger neighborhood cutoff, usually couple of hundreds of neighbors per atom

Neighbor-list generation Each cell with its neighboring cells and neighboring atoms Processing of neighbor list generation

Other data structures Iterating over the neighbor list to generate bond/hydrogen-bond/Qeq coefficients All the above lists are redundant on GPUs 3 Kernels one for each list – Redundant traversal of neighbor list by kernels One kernel for generating all the list entries per atom pair

Bonded and Non-bonded interactions Bonded interactions are computed using one thread per atom – Number of bonds per atom typically low (< 10) – Bond-list iterated to compute force/energy terms – Expensive three-body and four body interactions because of number of bonds involved (three-body list) – Hydrogen-bonds use multiple threads per atom (hbonds per atom usually in 100’s)

Nonbonded interactions Charge Equilibration – One warp (32 threads) per row SpMV, CSR Format – GMRES implemented using CUBLAS library Van der Waals and Coulombs interactions – Iterate over neighbor lists – Multiple thread per atom kernel

Experimental Results Intel Xeon CPU E5606, 2.13 GHz, 24 GB RAM, Linux OS and Tesla C2705 GPU CUDA 5.0 Environment “-funroll-loops -fstrict-aliasing -O3” Compiler options (same as serial implementation) Double precision arithmetic and NO fused multiply add (fmad) operations Time step: 0.1 fs and tolerance, NVE ensemble

Memory footprint All the numbers are in MBs All data structures are redundant on GPU Three body interaction estimates the number of entries in the three-body list before executing the kernel because of which memory is allocated as needed with excessive over estimation – as in the case of PuReMD C2075/6GB can handle as large as 50K system

GPU Overhead per interaction PuReMD-GPU uses additional memory to store temporary results and performs a final reduction for energy/force computation PuReMD-GPU uses additional kernels to compute reduction on lists to compute effective forces and energies for each interaction. These will NOT be incurred if atomic operations are used in these computations

Atomic vs. Redundant Memory Performance of kernels is severely impacted by using atomic operations on GPUs. 64 bit atomic operations are NOT supported on TESLA GPU, needs 2 32 bit operations. Results in serialization of competing threads. Additional memory provides dedicated memory to each thread. After the interaction kernel is completed, additional kernels are used to perform reductions to compute the final result.

Multiple thread Kernels performance All timings are in milliseconds Allocating multiple threads per atom yields tremendous speedups. Allocating more threads than needed, increases the scheduling overhead (increases the total number of blocks to be executed.

Performance of PuReMD-GPU

Accuracy Results Double precision arithmetic NO FMAD (fused multiply-add) operations Deviation of 4% after 100K time steps Million time steps of simulation executed

Accuracy Results

PG-PuReMD Multi-node, multi-gpu implementation Problem decomposition – 3D Domain decomposition with wrap around links for periodic boundaries (Torus) Outer shell selection – Full-shell, half-shell, midpoint-shell, neutral territory (zonal) methods etc… – PG-PuReMD uses full-shell – r shell = max(3 x r bond, r hbond, r nonb )

Inter process communication 3D domain decomposition and full-shell as outer-shell for processes, we can use direct messaging or staged messaging Our experiments show that staged messaging is efficient as number of processors is increased Each process sends atoms in +x, -x, then +y, -y and finally in +z, -z dimensions

Experimental Results LBNL’s GPU Clusters: 50 node GPU cluster interconnected with QDR Infinitiband switch GPU Node: 2 Intel GHz QPI Quadcore Nehalem processors (8 cores/node), 24 GB RAM and Tesla C2050 Fermi GPUs with 3GB global memory Single CPU : Intel Xeon E5606, 2.13 GHz and 24 GB RAM

Experimental Setup CUDA 5.0 development environment “-arch=sm 20 -funroll-loops –O3” compiler options GCC 4.6 compiler, MPICH2 openmpi Cuda object files are linked with openmpi to produce the final executable Qeq tolerance, 0.25 fs time step and NVE ensemble

Strong Scaling Results

Weak Scaling Results