2What is N-Body Simulation? Simulating the interaction of some number N of objects in a system. A physics interpretation is the movement of stars under the influence of gravity in a galaxy.An introduction to N-Body Simulation that I relied on extensively is available here:
3How is N-Body Simulation done? A simulation starts with the bodies in some initial position and initial velocity (for this project, these are randomly generated.) Then for each time step acceleration of each Body is calculated based on the influence of gravity of each other Body. Velocity is updated based on acceleration and position is updated based on velocity.
4Acceleration Calculation The slow part of N-Body simulation is the acceleration calculation.Each body is under the influence of gravity from each other body in the system.In the serial version, this is a n(n-1)/2 calculation (since acceleration from i to jand j to i can be calculated in the same loop.)double rjix, rjiy, rjiz;rjix = px[j] - px[i];rjiy = py[j] - py[i];rjiz = pz[j] - pz[i];double r2 = rjix*rjix + rjiy*rjiy + rjiz*rjiz;double r3 = r2*sqrt(r2);ax[i] += m[j] * rjix / r3;ay[i] += m[j] * rjiy / r3;az[i] += m[j] * rjiz / r3;ax[j] -= m[i] * rjix / r3;ay[j] -= m[i] * rjiy / r3;az[j] -= m[i] * rjiz / r3;
5N-Body Simulation Position and Velocity update The simplest update step is a Forward Euler algorithm:While these are nice simple equations to implement, it is not very accurate. Basically for the entire time step dt a body is moving in the v_i direction which is only correct at time i.
6Leap-Frog AlgorithmThe problem with Forward Euler is that it is not very accurate. As dt is made 10 times smaller, the accuracy improves 10 times.Using better methods, better accuracy can be achieved. With the Leap Frog Algorithm we expect to get 100 times more accurate as dt is made 10 time smaller.
7Leap-Frog AlgorithmPosition are defined on integer time steps and velocity is defined on integer + ½ time steps. Velocity is updated by (a_i + a_i+1) / 2 which is the approximate value of a halfway between time steps i and i + 1.
8How to Parallelize N-Body Position and velocity are given. Then an initialization step sets aat timestep 0 based on the given positions, mass, and gravity.Then for each time step:Update a body's velocity based on ½ its acceleration at t-1.Update a body's position based on its velocity.Update a body's acceleration based on position and mass of every other body.Update a body's velocity based on ½ its acceleration at time t.As mentioned before, the acceleration update is an n(n-1) / 2 operation. This is the part of the algorithm that needs to be parallelized.The idea is to have each MPI process have N/p of the bodies in the N-Body system.For each time step, each process will update it's acceleration based on the local N-Bodies.Then the process will communicate in a ring its bodies to the next process and receivingbodies from the previous process. This is an n(n-1) algorithm, not n(n-1)/2 like the original.Therefore, at least 3 processors need to be used to achieve improved performance.
9CUDA ParallelizationThe CUDA parallelization is straight forward. Assign 1 GPU per MPI process then have k of n n-body updates occur in parallel on k-CUDA- threads.Enough blocks are launched so that there is 1 CUDA thread for each of the n-bodies.
10Correctness?Serial Version – To verify correctness I implemented the serial algorithm and used examples from to show that I seem to be getting reasonable results. The implementation may not be perfect, however I believe it is correct enough to give accurate timings.Parallel Version – I determine the parallel version is correct if it matches the output of the serial version for at least a few time steps. Because the floating point math is being done in a different order on the parallel version, numeric errors will creep in and the answers will diverge. Currently the MPI version matches the Serial version to 6 decimal places at 20 time steps for a 9 body system with 3 processes. But does differ at 100 time steps.
11Note: Timings, Speedup and Performance graphs are based on a run of 10 steps in the simulation. For example, it took the serial version 1400 seconds to run 10 steps at size 64,000.Timings