Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang.

Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang 2013-8-7 2013 International School on Numerical Relativity and Gravitational Waves, Pohang Korea

Outline Motivations from gravitational wave detection New parallel mesh refinement numerical scheme GPU acceleration for NR Summary

The most stringent test of GR the "anomalous" precession of the perihelion of Mercury (1915, v≈ ) Deflection of Starlight (1919, v≈ ) gravitational redshift (1965, v≈ ) gravitational time delay effect (1968, v≈ ) Evidence of Gravitational Waves (1978, v≈ ) frame-dragging effect (2010, v≈ ) Direct gravitational wave detection (?, v≈1) GR = Newtonian Gravity + PN(v) + PN(v^2) + ……

Gravitational wave astronomy Search back to extremely early universe Hear the dark universe

Gravitational wave and its detection

Category of Black Holes Super massive black hole: M: 10^5—10^9 Msun Stellar massive black hole: M: 1-10s Msun Intermediate massive black hole: M: 10s—10^5 Msun (mainly in globular cluster) [Farrell, et al, Nature 460 (2009) 73; Feng, et al, New Astronomy Reviews 55 (2011) 166]

Category of Black Holes Binary

ALIA Xuefei Gong, et al, CQG 28, 094012 (2011) 1:1000 1:1 Advanced LIGO Abadie, et al, PRD 85, 102004 (2012) IMBH and GW detection

Data analysis and template Ref to Sang Hoon Oh’s lecture

Template model for BBH ????? Yi Pan’s talk, 2013

Template model for BBH PN templates: for early stage of inspiralling EOBNR (effective one body model together with numerical relativity): for full inspiral + merger + ring down stage; works well for mass ratio less than 1:8 and extreme mass ratio BBH, high spinning, precession! But no reliable template for mass ratio 1:10 to 1:100

From a given separation of the two BHs, when mass ratio increases the number of orbit increases quickly. This requires that the numerical simulation with full GR increases much consequently. In contrast to 1:1, 1:100 needs 10 times more computation cost. PN estimation

Computational cost 1:1, 9 days 1:100, 20 days LSSC cluster II, 128 CPUs, for last 2 orbits computational cost 1 to 20!!

Challenge of large mass BBH to NR Compared to 1:1, the computational cost of 1:100 BBH increase roughly 200 times!! For typical simulation of 1:1 BBH, 14 days are needed. So by straight forward method to 1:100, roughly 1year is needed!!

Possible ways out 1. Physical level: approximation method, such as self force frame work (but still first order yet), …… 2. Numerical Algorithm level: implicit scheme [R. Lau et al, PRD 84, 084023 (2011)], combine Cauchy evolution to null evolution, …… 3. Computer level: improve scalability to use more CPUs, use GPU, ……

Mesh refinement scheme High resolution mesh grids for region near BH, while low resolution mesh grids for far region

Mesh refinement in CFD Result based on PARAMESH PARAMESH GrACE JASMIN ……

Comparison of NR and CFD NR (only for BH): computational expensive on single grid point, but functions quite smooth  few grid points (handrads), high order finite difference CFD: computation on single point is cheap, but fluid dynamics quite complex (compare the lectures on HD)  grid number is quite large (millions)

Mesh refinement scheme Scheme adopted by PARAMESH Level 0 Level 1

Mesh refinement scheme Scheme adopted by PARAMESH Level 0Level 1 t x

Mesh refinement scheme Scheme for NR Level 0 Level 1 Distribute data along one level to available processes

Mesh refinement scheme Scheme for NR F. Loeffler et al, CQG 29, 115001 (2012) Level 0Level 1 LS scheme

Mesh refinement scheme Parallelization limit: 200x200x200 6 th order finite difference (8 ghost points for two sides) processes How about distribute data on all levels and calculate them parallely?

Parallel mesh level algorithm PX scheme: distribute data on all levels to all processes; calculate parallely

Mesh refinement scheme Procs for lev0 procs for lev1 procs for lev2 run run run wait wait run wait run run wait wait run run run run … … … Strong scalling property due to more data to distribute; Resource wasting (Lx procs of LS) due to waiting! Calculation speed: 2 times faster! time

Parallel mesh level algorithm P2 scheme: distribute data on finest level to half processes and distribute data on other levels along the same level to another half processes; calculate parallely for finest level and other levels, while sequentially for other levels lev0 lev2 lev1

Mesh refinement scheme Procs for lower levels procs for lev2 lev1 run lev0 run lev1 run wait run lev1 run … … Scalling property is weaker than PX; Less waiting (2x procs LS)! Calculation speed: 2 times faster! time

Comparison to LS scheme

more complicate case t x lev0lev1lev2 Now, procs for finest level have to wait!

more complicate case t x lev0lev1lev2

GPU acceleration For system biology, Yamazaki, Igarashi, Neural Networks, 2013 For GW data analysis, Zhihui Du, et al, CQG 29, 235018 (2012)

Put RHS calculation to GPU For AMSS-NCKU code, time for RHS calculation > 80% RHS function involves too many variables, even only transform their addresses are time consuming So pack these addresses and store it in constant memory (do not transform any more during evolution), save shared memory at the same time

Put RHS calculation to GPU Keep the data on GPU till MPI data transfer between different processes Using buffer point method to reduce MPI transfer for RK4 from 4 times to only 1 time; also reduce data transfer times between GPU and CPU

Put RHS calculation to GPU Arrange shared memory Divide RHS calculation into 8 parts, let the memory requirement for each part can be satisfied with shared memory For one RHS calculation, copy data from global memory to shared memory once and use shared memory in most time

Put restrict-prolong to GPU After put RHS to GPU, the most time consuming part is Restrict-Prolong interpolation How to treat this part? The work is going on

Test of GPU acceleration on desktop

OpenMP implementation AMSS-NCKU = Fortran90 + C++ C++ used for program flow control and memory administration Fortran90 used for main numerical calculation Add OpenMP command in Fortran90 segments

Structure of AMSS-NCKU GPU code Two groups MPI processes, one for cpu and one for gpu MPI + OpenMP + CUDA

Test of AMSS-NCKU GPU code Titan: top 1 super computer around the world (now Tianhe 2) 1024x16 cores + 1024 GPUs

Summary Challenge from GW detection: AdvLIGO—1:150 ALIA ---1:1000 Parallel mesh level calculation method—2x speed up GPU implementation to NR---have got roughly 5x speed up; 30x speed up? in progress 10x in all is ready for science simulation

Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang.

Similar presentations

Presentation on theme: "Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang.

Similar presentations

Presentation on theme: "Port AMSS-NCKU code to GPU Zhoujian Cao Academy of Mathematics and System Science, CAS Cowork with Zhihui Du, Steven Brandt, Frank Loeffler and Quan Yang."— Presentation transcript:

Similar presentations

About project

Feedback