Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora, U. Alberta, Canada NERSC User Training Lectures at June NUG Meeting June 6-7, 2000 ORNL

Motivation - Very large scale calculations which are extremely demanding computionally are required when we attempt to model experimental situations - Optimization of performance is necessary to improve efficiency and make parameter scans feasible Outline - Brief model description - Optimization strategy and implementation - Results - Summary

UCLA-Canada (UCAN) Global Gyrokinetic Particle Code - Three dimensional - Toroidal - Nonlinear - Cartesian geometry - Covers the whole plasma cross-section=> Global - Adiabatic electrons Designed to study ion microinstabilities and ensuing transport in toroidal magnetic fusion devices of the tokamak type Physics Attributes: Refs: - R. D. Sydora, V. K. Decyk, and J. M. Dawson, Plasma Phys. Control. Fusion 38 (1996) A281-A294 - J. N. Leboeuf et al., Phys. Plasmas 7 (2000) 1795-1801

UCLA-Canada (UCAN) Global Gyrokinetic Particle Code Numerical Attributes - Uses low-noise nonlinear delta-f methods to integrate ions orbits - Massively parallel implementation using PLIB library of particle and field management routines Ref: - V. K. Decyk, Computer Physics Communications 87 (1995) 87-94 - MPI for message passing - One dimensional domain decomposition (toroidal or z direction)

Global Gyrokinetic Particle Simulations of Turbulence in UCLA’s Electric Tokamak (ET) Massively parallel computers enable modeling of plasma conditions (plasma weather) inside a large scale fusion device like ET Linear Stage Nonlinear Steady State

Need for Optimization and Optimization Strategy Follow strategy outlined in “Optimization of particle-in-cell codes for RISC processors”, V. K. Decyk, S. R. Karmesin, A. de Boer, and P. C. Liewer, Computers in Physics 10 (1996) 290-298 Main elements: - Loop reordering - Data restructuring - Particle sorting The CRAY T3E is notoriously cache starved with an 8K level 1 cache compared to a 64K level 1 cache on the IBM SP => minimizing cache usage is a crucial issue on the T3E Should help even on the IBM SP

do 33 l = 1, nblok c nleng = nleng0 ntot = npp(l) - nps(l) + 1 if( ntot.lt. nleng ) nleng = ntot ndiv = float( ntot - 1 ) / float( nleng ) + 1. nlast = ntot - ( ndiv - 1 ) * nleng c do 10 ncon = 1,ndiv nlengl = nleng if( ncon.eq. ndiv ) nlengl = nlast jfst = nps(l) + ( ncon - 1 ) * nleng - 1 c c do 20 k = 1,4 c k2 = k + k c k1 = k2 - 1 c do 20 jj = 1,nlengl c i = jfst + jj c btot = cb1*rmajor / (rmajor + (part(1,i,l) - xxo) ) rhoi = sqrt( 2. * part(11,i,l) * btot ) * wcirv c do 30 k=1,4 k2=k+k k1=k2-1 c OPTIMIZATION: Loop Reordering in Particle Push and Accumulation Routines After do 33 l = 1, nblok c nleng = nleng0 ntot = npp(l) - nps(l) + 1 if( ntot.lt. nleng ) nleng = ntot ndiv = float( ntot - 1 ) / float( nleng ) + 1. nlast = ntot - ( ndiv - 1 ) * nleng c do 10 ncon = 1,ndiv nlengl = nleng if( ncon.eq. ndiv ) nlengl = nlast jfst = nps(l) + ( ncon - 1 ) * nleng - 1 c do 20 k = 1,4 k2 = k + k k1 = k2 - 1 c do 30 jj = 1,nlengl c i = jfst + jj c btot = cb1*rmajor / (rmajor + (part(1,i,l) - xxo) ) rhoi = sqrt( 2. * part(11,i,l) * btot ) * wcirv Before

OPTIMIZATION: Loop Reordering in FFT Routines c first transform in z nrz = nxyz/nz do 580 l = 1, indz ns = 2**(l - 1) ns2 = ns + ns km = nzh/ns kmr = km*nrz do 570 m = 1, jblok do 560 k = 1, km k1 = ns2*(k - 1) k2 = k1 + ns do 550 j = 1, ns j1 = j + k1 j2 = j + k2 s = conjg(sct(1+kmr*(j-1))) do 540 i = 1, kxp do 530 n = 1, ny t = s*g(j2,n,i,m) g(j2,n,i,m) = g(j1,n,i,m) - t g(j1,n,i,m) = g(j1,n,i,m) + t 530 continue 540 continue 550 continue 560 continue 570 continue 580 continue c first transform in z nrz = nxyz/nz do 580 l = 1, indz ns = 2**(l - 1) ns2 = ns + ns km = nzh/ns kmr = km*nrz do 570 m = 1, jblok do 560 i = 1, kxp do 550 n = 1, ny do 540 k = 1, km k1 = ns2*(k - 1) k2 = k1 + ns do 530 j = 1, ns j1 = j + k1 j2 = j + k2 s = conjg(sct(1+kmr*(j-1))) t = s*g(j2,n,i,m) g(j2,n,i,m) = g(j1,n,i,m) - t g(j1,n,i,m) = g(j1,n,i,m) + t 530 continue 540 continue 550 continue 560 continue 570 continue 580 continue Before After

OPTIMIZATION: Data Restructuring Separate electric field components elx, ely, elz regrouped into single array el(1,...)=elx(...), el(2,...)=ely(...) el(3,...)=elz(...) so that they are now contiguous in memory call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sct, cdddd1 incx+1, incy+1, incz, kstrt, 1 incx, incy, incz, kstrt, 2 ncxv, ncyppv, nczv, 3 nncxppp, nnczp, nnczp1, nblok, nblok, ncypp, 4 ncypp, ncy ) c do 810 l = 1,nblok do 810 k = 1,nczp do 810 j = 1,nncy1 do 810 i = 1,nncx1 elx(i,j,k,l) = g1x(i,j,k,l) el(1,i,j,k,l)=g1x(i,j,k,l) 810 continue c After call pfft3r(g1x,g1xt,bs,br,1,ntpose,mixup,sct, cdddd1 incx+1, incy+1, incz, kstrt, 1 incx, incy, incz, kstrt, 2 ncxv, ncyppv, nczv, 3 nncxppp, nnczp, nnczp1, nblok, nblok, ncypp, 4 ncypp, ncy ) c do 810 l = 1,nblok do 810 k = 1,nczp do 810 j = 1,nncy1 do 810 i = 1,nncx1 elx(i,j,k,l) = g1x(i,j,k,l) 810 continue c Before

OPTIMIZATION: Memory Access Control c extt(k,jj,l)= 1 elx(lx,ly,lz,l)*alf1 + elx(lxr,ly,lz,l)*alf2 1 + elx(lx,lyr,lz,l)*alf3 + elx(lxr,lyr,lz,l)*alf4 2 + elx(lx,ly,lzr,l)*alf5 + elx(lxr,ly,lzr,l)*alf6 3 + elx(lx,lyr,lzr,l)*alf7 + elx(lxr,lyr,lzr,l)*alf8 c eytt(k,jj,l)= 1 ely(lx,ly,lz,l)*alf1 + ely(lxr,ly,lz,l)*alf2 1 + ely(lx,lyr,lz,l)*alf3 + ely(lxr,lyr,lz,l)*alf4 2 + ely(lx,ly,lzr,l)*alf5 + ely(lxr,ly,lzr,l)*alf6 3 + ely(lx,lyr,lzr,l)*alf7 + ely(lxr,lyr,lzr,l)*alf8 c eztt(k,jj,l)= 1 elz(lx,ly,lz,l)*alf1 + elz(lxr,ly,lz,l)*alf2 1 + elz(lx,lyr,lz,l)*alf3 + elz(lxr,lyr,lz,l)*alf4 2 + elz(lx,ly,lzr,l)*alf5 + elz(lxr,ly,lzr,l)*alf6 3 + elz(lx,lyr,lzr,l)*alf7 + elz(lxr,lyr,lzr,l)*alf8 c Particle pusher modified to take advantage of data restructuring, so as to minimize memory excursions and improve data locality c dx= 1 el(1,lx,ly,lz,l)*alf1+el(1,lxr,ly,lz,l)*alf2 1 +el(1,lx,lyr,lz,l)*alf3+el(1,lxr,lyr,lz,l)*alf4 dy= 1 el(2,lx,ly,lz,l)*alf1+el(2,lxr,ly,lz,l)*alf2 1 +el(2,lx,lyr,lz,l)*alf3+el(2,lxr,lyr,lz,l)*alf4 dz= 1 el(3,lx,ly,lz,l)*alf1+el(3,lxr,ly,lz,l)*alf2 1 +el(3,lx,lyr,lz,l)*alf3+el(3,lxr,lyr,lz,l)*alf4 c c..... e.s. field c extt(k,jj,l)= dx 2 + el(1,lx,ly,lzr,l)*alf5 + el(1,lxr,ly,lzr,l)*alf6 3 + el(1,lx,lyr,lzr,l)*alf7 + el(1,lxr,lyr,lzr,l)*alf8 c eytt(k,jj,l)= dy 2 + el(2,lx,ly,lzr,l)*alf5 + el(2,lxr,ly,lzr,l)*alf6 3 + el(2,lx,lyr,lzr,l)*alf7 + el(2,lxr,lyr,lzr,l)*alf8 c eztt(k,jj,l)= dz 2 + el(3,lx,ly,lzr,l)*alf5 + el(3,lxr,ly,lzr,l)*alf6 3 + el(3,lx,lyr,lzr,l)*alf7 + el(3,lxr,lyr,lzr,l)*alf8 c Before After

OPTIMIZATION: Particle Sorting cc cc initial sort cc call psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1 nnopmax,nblok,ny1,nyzpm1) c c...................................................................... c...... main loop..................................................... nsortt=5 c...................................................................... call timera(-1,'total ') c 100 continue c nt = nt + 1 nb = ( nt / nbb ) * nbb nsort=(nt/nsortt)*nsortt call pmove3( part, 1 edges, npp, sbufr, sbufl, rbufr, rbufl, ihole, 2 jsr, jsl, jss, 3 ncz, kstrt, nvp, idimp, nnopmax, nblok, idps, 4 nbmax, ntmax, ierr ) c if(nsort.eq.nt) then call psortp3yz(part,pt,ip,npic,npp,noff,idimp, 1 nnopmax,nblok,ny1,nyzpm1) end if c Sorting Every 5 Time StepsNo Sorting c c...................................................................... c...... main loop..................................................... c...................................................................... call timera(-1,'total ') c 100 continue c nt = nt + 1 nb = ( nt / nbb ) * nbb c call pmove3( part, 1 edges, npp, sbufr, sbufl, rbufr, rbufl, ihole, 2 jsr, jsl, jss, 3 ncz, kstrt, nvp, idimp, nnopmax, nblok, idps, 4 nbmax, ntmax, ierr ) c

Performance Improvements on NERSC’s CRAY T3E Resulting from Optimizations Improvements by better than a factor of 2 achieved in terms of overall time per particle per step and flops as measured by the ‘pat’ utility on NERSC’s CRAY T3E

Performance Improvements on NERSC’s CRAY T3E and IBM SP Resulting from Optimizations Optimization for cache starved T3E still results in ~ 30% performance improvement on SP

Performance with Increasing Problem Size and Number of Processors on NERSC’s IBM SP Linear scaling obtained when number of particles increases with number of processors (factor of 2 each at a time) fit

Comparison on Performance on Appleseed Mcintosh G4 cluster and IBM SP Equivalent performance up to 8 processors

Summary Optimization strategy successfully implemented Optimizations have improved performance on both the T3E and SP With up to 8 processors, performance is comparable on UCLA’s Appleseed cluster of Macintosh G4s and on NERSC’s IBM SP

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,

Similar presentations

Presentation on theme: "Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,

Similar presentations

Presentation on theme: "Optimization of the Parallel Performance of a Global Gyrokinetic Particle Code on NERSC’s CRAY T3E and IBM SP J. N. Leboeuf and V. K. Decyk, UCLA R. Sydora,"— Presentation transcript:

Similar presentations

About project

Feedback