Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallelization of CPAIMD using Charm++

Similar presentations


Presentation on theme: "Parallelization of CPAIMD using Charm++"— Presentation transcript:

1 Parallelization of CPAIMD using Charm++
Parallel Programming Lab

2 CPAIMD Collaboration with Glenn Martyna and Mark Tuckerman
MPI code – PINY Scalability problems When #procs >= #orbitals Charm++ approach Better scalability using virtualization Further divide orbitals

3 The Iteration

4 The Iteration (contd.) Start with 128 “states” FFT each of 128 states
State – spatial representation of electron FFT each of 128 states In parallel Planar decomposition => transpose Compute densities (DFT) Compute energies using density Compute Forces and move electrons Orthonormalize states Start over

5 Parallel View

6 Optimized Parallel 3D FFT
To perform 3D FFT 1d followed by 2d instead of 2d followed by 1d Lesser computation Lesser communication

7 Orthonormalization All-pairs operation Our approach (picture follows)
The data of each state has to meet with the data of all other states Our approach (picture follows) A virtual processor acts as meeting point for several pairs of states Create lots of these The number of pairs meeting at a VP: n Communication decreases with n Computation increases with n Balance required

8 VP based approach

9 Performance Existing MPI code – PINY Our performance:
Does not scale beyond 128 processors Best per-iteration: 1.7s Our performance: Processors Time(s) 128 2.07 256 1.18 512 0.65 1024 0.48 1536 0.39

10 Load balancing Load imbalance due to distribution of data in orbitals
Planes are sections of a sphere Hence imbalance Computation – more points Communication – more data to send

11 Load Imbalance Iteration time: 900ms on 1024 procs

12 Improvement - I Improvement by pairing heavily loaded planes with lightly loaded planes. Iteration time: 590ms

13 Charm++ Load Balancing
Load balancing provided by the system, iteration time: 600ms

14 Improvement - II Improvement by using a load vector based scheme to map planes to processors. The number of “light” planes per processor is corresponding lesser than that of the number of “heavy” planes. Iteration time: 480ms

15 Scope for Improvement Load balancing
Charm++ load balancer shows encouraging results on 512 pes Combination of automated and manual load-balancing Avoiding copying when sending messages In ffts Sending large read-only messages FFTs can be made more efficient Use double packing Make assumption about data distribution when performing FFTs Alternative implementation of orthonormalization


Download ppt "Parallelization of CPAIMD using Charm++"

Similar presentations


Ads by Google