Presentation is loading. Please wait.

Presentation is loading. Please wait.

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,

Similar presentations


Presentation on theme: "Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,"— Presentation transcript:

1 Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora, U. Alberta, Canada NERSC User Training Lectures at June NUG Meeting June 6-7, 2000 ORNL

2 Motivation - Very large scale calculations which are extremely demanding computionally are required when we attempt to model experimental situations - Optimization of performance is necessary to improve efficiency and make parameter scans feasible Outline - Brief model description - Optimization strategy and implementation - Results - Summary

3 UCLA-Canada (UCAN) Global Gyrokinetic Particle Code - Three dimensional - Toroidal - Nonlinear - Cartesian geometry - Covers the whole plasma cross-section=> Global - Adiabatic electrons Designed to study ion microinstabilities and ensuing transport in toroidal magnetic fusion devices of the tokamak type Physics Attributes: Refs: - R. D. Sydora, V. K. Decyk, and J. M. Dawson, Plasma Phys. Control. Fusion 38 (1996) A281-A294 - J. N. Leboeuf et al., Phys. Plasmas 7 (2000)

4 UCLA-Canada (UCAN) Global Gyrokinetic Particle Code Numerical Attributes - Uses low-noise nonlinear delta-f methods to integrate ions orbits - Massively parallel implementation using PLIB library of particle and field management routines Ref: - V. K. Decyk, Computer Physics Communications 87 (1995) MPI for message passing - One dimensional domain decomposition (toroidal or z direction)

5 Global Gyrokinetic Particle Simulations of Turbulence in UCLA’s Electric Tokamak (ET) Massively parallel computers enable modeling of plasma conditions (plasma weather) inside a large scale fusion device like ET Linear Stage Nonlinear Steady State

6 Need for Optimization and Optimization Strategy Follow strategy outlined in “Optimization of particle-in-cell codes for RISC processors”, V. K. Decyk, S. R. Karmesin, A. de Boer, and P. C. Liewer, Computers in Physics 10 (1996) Main elements: - Loop reordering - Data restructuring - Particle sorting The CRAY T3E is notoriously cache starved with an 8K level 1 cache compared to a 64K level 1 cache on the IBM SP => minimizing cache usage is a crucial issue on the T3E Should help even on the IBM SP

7 do 33 l = 1, nblok c nleng = nleng0 ntot = npp(l) - nps(l) + 1 if( ntot.lt. nleng ) nleng = ntot ndiv = float( ntot - 1 ) / float( nleng ) + 1. nlast = ntot - ( ndiv - 1 ) * nleng c do 10 ncon = 1,ndiv nlengl = nleng if( ncon.eq. ndiv ) nlengl = nlast jfst = nps(l) + ( ncon - 1 ) * nleng - 1 c c do 20 k = 1,4 c k2 = k + k c k1 = k2 - 1 c do 20 jj = 1,nlengl c i = jfst + jj c btot = cb1*rmajor / (rmajor + (part(1,i,l) - xxo) ) rhoi = sqrt( 2. * part(11,i,l) * btot ) * wcirv c do 30 k=1,4 k2=k+k k1=k2-1 c OPTIMIZATION: Loop Reordering in Particle Push and Accumulation Routines After do 33 l = 1, nblok c nleng = nleng0 ntot = npp(l) - nps(l) + 1 if( ntot.lt. nleng ) nleng = ntot ndiv = float( ntot - 1 ) / float( nleng ) + 1. nlast = ntot - ( ndiv - 1 ) * nleng c do 10 ncon = 1,ndiv nlengl = nleng if( ncon.eq. ndiv ) nlengl = nlast jfst = nps(l) + ( ncon - 1 ) * nleng - 1 c do 20 k = 1,4 k2 = k + k k1 = k2 - 1 c do 30 jj = 1,nlengl c i = jfst + jj c btot = cb1*rmajor / (rmajor + (part(1,i,l) - xxo) ) rhoi = sqrt( 2. * part(11,i,l) * btot ) * wcirv Before

8 OPTIMIZATION: Loop Reordering in FFT Routines c first transform in z nrz = nxyz/nz do 580 l = 1, indz ns = 2**(l - 1) ns2 = ns + ns km = nzh/ns kmr = km*nrz do 570 m = 1, jblok do 560 k = 1, km k1 = ns2*(k - 1) k2 = k1 + ns do 550 j = 1, ns j1 = j + k1 j2 = j + k2 s = conjg(sct(1+kmr*(j-1))) do 540 i = 1, kxp do 530 n = 1, ny t = s*g(j2,n,i,m) g(j2,n,i,m) = g(j1,n,i,m) - t g(j1,n,i,m) = g(j1,n,i,m) + t 530 continue 540 continue 550 continue 560 continue 570 continue 580 continue c first transform in z nrz = nxyz/nz do 580 l = 1, indz ns = 2**(l - 1) ns2 = ns + ns km = nzh/ns kmr = km*nrz do 570 m = 1, jblok do 560 i = 1, kxp do 550 n = 1, ny do 540 k = 1, km k1 = ns2*(k - 1) k2 = k1 + ns do 530 j = 1, ns j1 = j + k1 j2 = j + k2 s = conjg(sct(1+kmr*(j-1))) t = s*g(j2,n,i,m) g(j2,n,i,m) = g(j1,n,i,m) - t g(j1,n,i,m) = g(j1,n,i,m) + t 530 continue 540 continue 550 continue 560 continue 570 continue 580 continue Before After

9 OPTIMIZATION: Data Restructuring Separate electric field components elx, ely, elz regrouped into single array el(1,...)=elx(...), el(2,...)=ely(...) el(3,...)=elz(...) so that they are now contiguous in memory call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sct, cdddd1 incx+1, incy+1, incz, kstrt, 1 incx, incy, incz, kstrt, 2 ncxv, ncyppv, nczv, 3 nncxppp, nnczp, nnczp1, nblok, nblok, ncypp, 4 ncypp, ncy ) c do 810 l = 1,nblok do 810 k = 1,nczp do 810 j = 1,nncy1 do 810 i = 1,nncx1 elx(i,j,k,l) = g1x(i,j,k,l) el(1,i,j,k,l)=g1x(i,j,k,l) 810 continue c After call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sct, cdddd1 incx+1, incy+1, incz, kstrt, 1 incx, incy, incz, kstrt, 2 ncxv, ncyppv, nczv, 3 nncxppp, nnczp, nnczp1, nblok, nblok, ncypp, 4 ncypp, ncy ) c do 810 l = 1,nblok do 810 k = 1,nczp do 810 j = 1,nncy1 do 810 i = 1,nncx1 elx(i,j,k,l) = g1x(i,j,k,l) 810 continue c Before

10 OPTIMIZATION: Memory Access Control c extt(k,jj,l)= 1 elx(lx,ly,lz,l)*alf1 + elx(lxr,ly,lz,l)*alf2 1 + elx(lx,lyr,lz,l)*alf3 + elx(lxr,lyr,lz,l)*alf4 2 + elx(lx,ly,lzr,l)*alf5 + elx(lxr,ly,lzr,l)*alf6 3 + elx(lx,lyr,lzr,l)*alf7 + elx(lxr,lyr,lzr,l)*alf8 c eytt(k,jj,l)= 1 ely(lx,ly,lz,l)*alf1 + ely(lxr,ly,lz,l)*alf2 1 + ely(lx,lyr,lz,l)*alf3 + ely(lxr,lyr,lz,l)*alf4 2 + ely(lx,ly,lzr,l)*alf5 + ely(lxr,ly,lzr,l)*alf6 3 + ely(lx,lyr,lzr,l)*alf7 + ely(lxr,lyr,lzr,l)*alf8 c eztt(k,jj,l)= 1 elz(lx,ly,lz,l)*alf1 + elz(lxr,ly,lz,l)*alf2 1 + elz(lx,lyr,lz,l)*alf3 + elz(lxr,lyr,lz,l)*alf4 2 + elz(lx,ly,lzr,l)*alf5 + elz(lxr,ly,lzr,l)*alf6 3 + elz(lx,lyr,lzr,l)*alf7 + elz(lxr,lyr,lzr,l)*alf8 c Particle pusher modified to take advantage of data restructuring, so as to minimize memory excursions and improve data locality c dx= 1 el(1,lx,ly,lz,l)*alf1+el(1,lxr,ly,lz,l)*alf2 1 +el(1,lx,lyr,lz,l)*alf3+el(1,lxr,lyr,lz,l)*alf4 dy= 1 el(2,lx,ly,lz,l)*alf1+el(2,lxr,ly,lz,l)*alf2 1 +el(2,lx,lyr,lz,l)*alf3+el(2,lxr,lyr,lz,l)*alf4 dz= 1 el(3,lx,ly,lz,l)*alf1+el(3,lxr,ly,lz,l)*alf2 1 +el(3,lx,lyr,lz,l)*alf3+el(3,lxr,lyr,lz,l)*alf4 c c..... e.s. field c extt(k,jj,l)= dx 2 + el(1,lx,ly,lzr,l)*alf5 + el(1,lxr,ly,lzr,l)*alf6 3 + el(1,lx,lyr,lzr,l)*alf7 + el(1,lxr,lyr,lzr,l)*alf8 c eytt(k,jj,l)= dy 2 + el(2,lx,ly,lzr,l)*alf5 + el(2,lxr,ly,lzr,l)*alf6 3 + el(2,lx,lyr,lzr,l)*alf7 + el(2,lxr,lyr,lzr,l)*alf8 c eztt(k,jj,l)= dz 2 + el(3,lx,ly,lzr,l)*alf5 + el(3,lxr,ly,lzr,l)*alf6 3 + el(3,lx,lyr,lzr,l)*alf7 + el(3,lxr,lyr,lzr,l)*alf8 c Before After

11 OPTIMIZATION: Particle Sorting cc cc initial sort cc call psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1 nnopmax,nblok,ny1,nyzpm1) c c c main loop nsortt=5 c call timera(-1,'total ') c 100 continue c nt = nt + 1 nb = ( nt / nbb ) * nbb nsort=(nt/nsortt)*nsortt call pmove3( part, 1 edges, npp, sbufr, sbufl, rbufr, rbufl, ihole, 2 jsr, jsl, jss, 3 ncz, kstrt, nvp, idimp, nnopmax, nblok, idps, 4 nbmax, ntmax, ierr ) c if(nsort.eq.nt) then call psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1 nnopmax,nblok,ny1,nyzpm1) end if c Sorting Every 5 Time StepsNo Sorting c c c main loop c call timera(-1,'total ') c 100 continue c nt = nt + 1 nb = ( nt / nbb ) * nbb c call pmove3( part, 1 edges, npp, sbufr, sbufl, rbufr, rbufl, ihole, 2 jsr, jsl, jss, 3 ncz, kstrt, nvp, idimp, nnopmax, nblok, idps, 4 nbmax, ntmax, ierr ) c

12 Performance Improvements on NERSC’s CRAY T3E Resulting from Optimizations Improvements by better than a factor of 2 achieved in terms of overall time per particle per step and flops as measured by the ‘pat’ utility on NERSC’s CRAY T3E

13 Performance Improvements on NERSC’s CRAY T3E and IBM SP Resulting from Optimizations Optimization for cache starved T3E still results in ~ 30% performance improvement on SP

14 Performance with Increasing Problem Size and Number of Processors on NERSC’s IBM SP Linear scaling obtained when number of particles increases with number of processors (factor of 2 each at a time) fit

15 Comparison on Performance on Appleseed Mcintosh G4 cluster and IBM SP Equivalent performance up to 8 processors

16 Summary Optimization strategy successfully implemented Optimizations have improved performance on both the T3E and SP With up to 8 processors, performance is comparable on UCLA’s Appleseed cluster of Macintosh G4s and on NERSC’s IBM SP


Download ppt "Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,"

Similar presentations


Ads by Google