Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015.

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015

Overview 1. PIC: Reminder 2. Implementation / Parallelisation Approach 3. Results 29.10.2015 Stefan Hegglin and Adrian Oeftiger3

Motivation 29.10.2015 Stefan Hegglin and Adrian Oeftiger4 self-consistent space charge models: particle-in-cell (PIC) algorithm is dominating time consumer in simulations parallelisation is challenging (PIC  memory-bound algorithm, i.e. few FLOP/byte)

Output 29.10.2015 Stefan Hegglin and Adrian Oeftiger5 we parallelised PIC on the GPU (graphics processing unit) PyPIC: PIC algorithms in shared python library 2.5D (slice-by-slice transverse) and full 3D model  much higher resolution possible, suppress noise issues courtesy: F. Kesting, GSI, https://eventbooking.stfc.ac.uk/uploads/spacecharge15/numericalnoisekesting.pdf example: on mesh size 128x128, reduced artificial emittance growth for more particles

How to Approach Noise Issues? less noise  longer applicability/validity of simulations e.g. SPS injection plateau: 10.8 seconds ≈ 500’000 turns!  impossible, instead we typically gain O(10’000 turns) validity for a simulation time scale O(1 week) with current software 29.10.2015 Stefan Hegglin and Adrian Oeftiger6 choose grid resolution (acc. to physics) ≥10 macro- particles per grid cell fix total #macro- particles evaluate emittance growth convergence study

New Available Parameter Space 1’000’000 macro- particles 20 slices128 x 128 mesh size 29.10.2015 Stefan Hegglin and Adrian Oeftiger7 152ms per kick 134ms per kick 110ms per kick

Poisson Solving with PIC particle-in-cell algorithm: standard in accelerator physics domain solve Poisson equation  finite differences  Hockney: FFT, (integrated) Green’s function for open boundaries  FMM, particle-particle,…  see Ji Qiang’s talks in PyHEADTAIL meeting and Space Charge WG meeting: https://indico.cern.ch/event/433371/ https://indico.cern.ch/event/433371/ 8 29.10.2015 Stefan Hegglin and Adrian Oeftiger

PIC – 3 Steps particle-in-cell algorithm: 1) particles to mesh: deposit charges to mesh nodes 2) solve the Poisson equation on the mesh  Hockney’s algorithm 3) mesh to particles: interpolate the mesh fields to the particles 9 29.10.2015 Stefan Hegglin and Adrian Oeftiger

Hockney’s algorithm Solve Poisson equation on a structured grid Green’s function: analytical solution for open boundaries Formal solution using convolution: O(n^2) Trick: implementation using FFTs of 2x domain size, 10 29.10.2015 Stefan Hegglin and Adrian Oeftiger

Green’s function approach has problems when mesh has large aspect ratio (numerical integration uses constant function value per cell)  Integrated Green’s function: main idea: integrate Green’s function analytically for each mesh cell, then sum all cells 11 Integrated Green’s function 29.10.2015 Stefan Hegglin and Adrian Oeftiger

Integrated Green’s function 12.10.2015 Stefan Hegglin and Adrian Oeftiger12 Error of ex x Comparing IGF and GF for an aspect ratio of 1:5 Abell et al, PAC 07, 9850561

GPUs GPU = Graphic Processing Unit: threads running massively parallel one concurrent instruction on >1000 cores  large data arrays ‼ expensive global memory access resources for ABP simulations: CERN: LIU-PS-GPU server 4x NVIDIA Tesla C2075 cards (mid 2011) CNAF (Bologna): high performance cluster 7x NVIDIA Tesla K20m (early 2013) 8x Tesla K40m (late 2013) 13 29.10.2015 Stefan Hegglin and Adrian Oeftiger

How to use the GPU Script: minimal changes for GPU how to submit a GPU job (CNAF): python: GPU data introspection works as flexible as on CPU (print(), calculations with GPUArrays, …) 29.10.2015 Stefan Hegglin and Adrian Oeftiger14 GPUCPU

Parallelisation Approach 29.10.2015 Stefan Hegglin and Adrian Oeftiger15 identify bottleneck optimise code verify functionality profiling

Different Bottlenecks: CPU vs. GPU 29.10.2015 Stefan Hegglin and Adrian Oeftiger16 CPUGPU FFT solving is bottleneck FFT: O(nx² log nx), p2m: O(nx²) particle-to-mesh deposition is bottleneck

Implementation of 3 Steps particle-in-cell algorithm: particles to mesh (p2m): 1) atomicAdd: thread  particle 2) parallel sort: thread  cell Solve: cuFFT (parallel FFT) mesh to particles (m2p): thread  particle 17 29.10.2015 Stefan Hegglin and Adrian Oeftiger

Variant 1 of p2m 1 thread per particle 29.10.2015 Stefan Hegglin and Adrian Oeftiger18 race condition  AtomicAdd: properly serialise memory updates  slow but correct

Variant 2 of p2m 1 thread per node Sort particles by node index (optimise memory access!) 29.10.2015 Stefan Hegglin and Adrian Oeftiger19 Avoids race condition (no concurrent writes)

Different numerical models 2.5D slice bunch into n slices: solve n independent 2D Poisson equations. Approximation: bunch very long CPU: serial GPU: compute all slices simultaneously 3D solve the full 3D bunch on a 3D grid CPU: not implemented (very slow) GPU: large memory requirements due to Hockney’s algorithm 20 29.10.2015 Stefan Hegglin and Adrian Oeftiger

21 29.10.2015 Stefan Hegglin and Adrian Oeftiger fixed mesh size: 256x256 Numeric Parameter Scans: Fixed nx fixed mesh size: 512x512 x4 x2

Timing: Fully Loaded GPU Parameters  2.5D model works well at high particle numbers, i.e. at low numbers the GPU is far from full exploitation!  different slope of CPU vs. GPU (characteristic behaviour)  new hardware at CNAF more efficient (x1.8) 22 29.10.2015 Stefan Hegglin and Adrian Oeftiger

Timing: CUDA 6 vs CUDA 7 29.10.2015 Stefan Hegglin and Adrian Oeftiger23 speedup of up to x1.5 due to a faster implementation of the sorting algorithm (thrust 1.8) and better cuFFT 2D, K20m @ CNAF

Summary PyHEADTAIL now offers 2.5D (slice-by-slice transverse) and 3D self-consistent direct space charge models (on CPU and GPU):  3D model allows cross-checking 2.5D approximations GPU speeds up ≈13x for large meshes and #particles wide numeric parameter spaces available now!  larger resolutions help to mitigate noise effects (artefacts such as numerical emittance blow-up)  improved validity for long simulations (real machine time) next steps: SPS simulations (resonances) 29.10.2015 Stefan Hegglin and Adrian Oeftiger24

Specifications of Used GPU Machines available machines at CNAF: http://wiki.infn.it/strutture/cnaf/clusterhpc/home 29.10.2015 Stefan Hegglin and Adrian Oeftiger13

Specification of Used CPU Machine LIUPSGPU CPU: 29.10.2015 Stefan Hegglin and Adrian Oeftiger27

PyPIC on GPU Standalone Python module: GPU interfacing via PyCUDA/Thrust Flexible 2D/3D (integrated) Green’s function cuFFT  http://github.com/PyCOMPLETE/PyPIC (new interface under branch: new_pypic_cpu_and_gpu) http://github.com/PyCOMPLETE/PyPIC 28 29.10.2015 Stefan Hegglin and Adrian Oeftiger

29.10.2015 Stefan Hegglin and Adrian Oeftiger29 Timing: Fully Loaded GPU Parameters II on GPU, particle-to-mesh deposition dominates  for fixed mesh size, more macro-particles onto the same grid induce memory bandwidth limitations on speed up

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015.

Similar presentations

Presentation on theme: "Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015.

Similar presentations

Presentation on theme: "Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting – 29.10.2015."— Presentation transcript:

Similar presentations

About project

Feedback