Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
A Parallel Structured Ecological Model for High End Shared Memory Computers Dali Wang Department of Computer Science, University of Tennessee, Knoxville.
The hybird approach to programming clusters of multi-core architetures.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
GmImgProc Alexandra Olteanu SCPD Alexandru Ştefănescu SCPD.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Threads. Thread A basic unit of CPU utilization. An Abstract data type representing an independent flow of control within a process A traditional (or.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
PuReMD Design Initialization – neighbor-list, bond-list, hydrogenbond-list and Coefficients of QEq matrix Bonded interactions – Bond-order, bond-energy,
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
Hybrid Parallel Implementation of The DG Method Advanced Computing Department/ CAAM 03/03/2016 N. Chaabane, B. Riviere, H. Calandra, M. Sekachev, S. Hamlaoui.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
Computational Techniques for Efficient Carbon Nanotube Simulation
Ioannis E. Venetis Department of Computer Engineering and Informatics
Introduction to parallel programming
CS5102 High Performance Computer Systems Thread-Level Parallelism
Lecture 18: Coherence and Synchronization
Computer Engg, IIT(BHU)
Task Scheduling for Multicore CPUs and NUMA Systems
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Distributed Systems CS
Multithreaded Programming
Concurrency: Mutual Exclusion and Process Synchronization
Lecture 2 The Art of Concurrency
Computational Techniques for Efficient Carbon Nanotube Simulation
Department of Computer Science, University of Tennessee, Knoxville
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Presentation transcript:

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali Liu, and Jianjiang Li Information Engineering School, University of Science and Technology Beijing, Beijing, P.R.China

Outline 1 Motivation 2 Related Works 3 Spatial Decomposition Coloring (SDC) Approach 4 Short-Range Forces Calculations of EAM using SDC method 5 Experiments and Discussion 6 Conclusion and Future Directions

1 Motivation The process of molecular dynamics simulations calculate forces calculate new positions of atoms set init_state Fig. 1 the process of molecular dynamics simulations.

1 Motivation the intensive computation appears in short-range force calculations procedure of MD simulations Neighbor-list method decreases the intensive computation largely. It make each atom only interacts with atoms in its neighbor region. Newton’s third law can have the force computations. And it brings the reduction operations on irregular arrays Fig. 2 codes of force caluclations.

2 Related Works --- parallel reduction operations on irregular arrays Some types of solutions enclosing reduction operation in a critical section privating the reduction array using redundant computations strategy

2 Related Works --- parallel reduction operations on irregular arrays enclosing reduction operation in a critical section create a critical section in inner loop straight and easy to implement parallelization. high synchronization cost arose by critical region, atomic or lock involved in inner loop

2 Related Works --- parallel reduction operations on irregular arrays private the reduction array each thread have to update share array in critical region according the value of its private array it reduce times of entering into critical region and reduce synchronization cost. high memory overhead of private array limit number of particles allowed in simulations compete for cache space and decrease program speed

2 Related Works --- parallel reduction operations on irregular arrays redundant computations strategy does not use Newton’s third law. So each pair interaction has to be calculated twice. the high parallelizability since data dependence has been removed between the loop iterations there are double computations and that neighbor list requires more memory space.

3 Spatial Decomposition Coloring (SDC) Approach Spatial Decomposition (SD) method distributed memory multi-processors involving several hundreds of processors change all array declarations and all loop bounds, and explicitly codes the periodic transfer of the boundary data between processors. It is simple to implement SD in OpenMP.

3 Spatial Decomposition Coloring (SDC) Approach SD method places a restriction on parallelism in OpenMP. synchronization will be required to ensure that multiple threads do not attempt to update the same atom simultaneously. Fig. 3 SD method.

3 Spatial Decomposition Coloring (SDC) Approach SDC method SDC method consists of the following steps Step 1): Split domain Step 2): Coloring subdomains Step 3): Parallel Computing

3 Spatial Decomposition Coloring (SDC) Approach SDC method SDC method consists of the following steps Step 1): Split domain Split the spatial domain into subdomains. Length of a subdomain must be longer than diameter. Number of subdomains in dimension decomposed should be even.

3 Spatial Decomposition Coloring (SDC) Approach SDC method SDC method consists of the following steps Step 2): Coloring subdomains The number of subdomains with each color must be equal each subdomain is surrounded only by those subdomains with different colors.

3 Spatial Decomposition Coloring (SDC) Approach SDC method SDC method consists of the following steps Step 3): Parallel Computing Calculations of forces on subdomains with one color can be run in parallel. a barrier should be given for waiting all threads to complete computation on this color. Calculations on subdomains with different colors must run in a serial fashion.

3 Spatial Decomposition Coloring (SDC) Approach SDC method advantage neighbor list usually doesn’t be updated in every time-step Cost of SDC method is very lowest. higher-dimensional decomposition method creates more subdomains. scalable and suitable on multi-core and many-core architectures. disadvantage Spatial Decomposition method Overload imbalance  under condition of simulation system has uniformity of density

4 Short-Range Forces Calculations of EAM using SDC method EAM method short-range forces the intensive computation three computational phases the most time consuming parts are 1 and 3 Fig. 4 short-range forces in EAM method.

4 Short-Range Forces Calculations of EAM using SDC method The parallel procedure of short-range forces calculations using SDC method 1) Run electron density computations using SDC method 2) Calculate embedding function value and their derivative in parallel 3) Run force calculations using SDC method

4 Short-Range Forces Calculations of EAM using SDC method force calculations based on SDC method L1: computations on subdomains with different color L2 : computations on subdomains with same color L3 deals with all atoms that constitute a subdomain L4 deals with neighbors of a atom Fig. 5 forces calculations using SDC.

5 Experiments and Discussion Experimental environment Four Intel Xeon(R) Quad-core E7320 (L2 Cache 4MB) processors, 16 GB memory OS is Fedora release 9 with kernel 2.6.25. The compiler is gcc 4.3.0. Experimental cases observe micro-deformation behaviors of pure Fe metals material ---came from XMD program under periodic boundary conditions initial state -- body-centered cubic (bcc) lattice arrangement test cases Small-scale case (1): 54,000 atoms Medium-scale case (2): 265,302 atoms Large-scale case (3): 1,062,882 atoms Large-scale case (4): 3,456,000 atoms

Medium case (2) on 2~16 cores Table 1. The Speedups of Spatial Decomposition Coloring (SDC) Methods Speedup Small case (1) on 2~16 cores Medium case (2) on 2~16 cores 2 3 4 8 12 16 SDC (one-dim) 1.71 2.46 3.07 4.17 1.84 2.64 3.37 6.24 6.33 SDC (two-dim) 1.70 4.74 5.90 6.43 2.65 3.39 6.20 8.89 10.90 SDC (three-dim) 1.66 2.40 2.99 4.61 5.74 6.30 1.82 3.36 6.16 8.76 10.78 Large case (3) on 2~16 cores Large case (4) on 2~16 cores 1.86 2.76 3.67 6.82 9.76 9.59 1.88 2.79 3.66 9.97 9.82 1.87 2.78 3.64 6.74 9.73 12.31 2.80 3.65 6.77 9.84 12.42 2.75 6.64 9.65 12.29 12.34

5 Experiments and Discussion the scalability of our SDC method. performance of multi-dimensional SDC method has been improved with the increase in the number of cores and the increase in the number of atoms. performance of SDC methods. We can see that two-dimensional SDC method achieves highest efficiency. two-dimensional decomposition algorithm strives to make subdomains with small surface area and large volume, which results in better cache locality compared to the one-dimensional decomposition strategy. three-dimensional SDC method slightly degrades the performance due to the more overhead of fork-join threads and scheduling.

Fig. 6 The speedup of two-dimensional Spatial Decomposition Coloring (SDC) method, Critical Section (CS) method, Share Array Privatization (SAP) method and Redundant Computations (RC) method.

5 Experiments and Discussion SDC method achieves a nearly linear speedup and highest speedup than other methods The reason of nearly linear speedup is that the low synchronization cost of implicit barriers in our method can be amortized over a large amount of computation. CSmethod achieves lowest efficiency. CS method encloses reduction operations on irregular array in critical section. SAPmethod performance degrade with the increase of the number of executing cores. memory overhead+synchronization overhead RC VS SDC there is nearly two-fold computation work for the short-range force calculations in RC method than in SDC method, the efficiency of RC method is low than that of SDC method.

Conclusion and Future Directions A scalable spatial decomposition coloring (SDC) method To solve a class of short-range force calculations problems on shared memory multi-core platforms It is scalable not only to large simulation system but also to many-core architectures Future directions To study SDC method on NUMA memory architecture To implement SDC method using MPI+OpenMP in multi-core cluster

Thank You !