Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting.

Similar presentations


Presentation on theme: "Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting."— Presentation transcript:

1 Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting 2009 LA-UR 09-06300

2 Help in designing linear colliders based on staging PWFA or LWFA sections Develop integrated codes for modeling a staged wakefield “system”. Develop efficient and high fidelity modeling for optimizing a single LWFA or PWFA stage. Develop PIC algorithms for advanced architectures Enable routine modeling of existing and future experiments: BELLA FACET Others Real time steering of experiments? Code validation against numerous worldwide experiments Code Verification Goals and vision for advanced accelerator modeling under COMPASS

3 QuickPIC Implementation (1D domain) The driver evolution can be calculated in a 3D moving box, while the plasma response can be solved for slice by slice with the  being a time-like variable.

4 Exploiting more parallelism: Pipelining Pipelining technique exploits parallelism in a sequential operation stream and can be adopted in various levels. Modern CPU designs include instruction level pipeline to improve performance by increasing the throughput. In scientific computation, software level pipeline is less common due to hidden parallelism in the algorithm. We have implemented a software level pipeline in QuickPIC to exploit the parallelism in a quasi-static algorithm. Moving Window plasma response Instruction pipelineSoftware pipeline OperandInstruction streamPlasma slice Operation IF, ID, EX, MEM, WB Plasma/beam update Stages5 ~ 311 ~(# of slices)

5 Ponderomotive guiding center approximation: Big 3D time step Plasma evolution: Maxwell’s equations Lorentz Gauge Quasi-Static Approximation Implementation of laser envelope model This can be pipelined too

6 solve plasma response update beam solve plasma response update beam solve plasma response update beam solve plasma response update beam beam 12 3 4 Initial plasma slab Scaling to 100,000+ processors and enabling high resolution capability: Pipelining Stage 1 Stage 2 Stage 3 Stage 4 Schematic of Pipelining Implementation in QuickPIC Communication overlap with computation Particles leaving pipeline stage are buffered Overall efficiency as high as 85% (2048 processors in 64 pipeline stages)

7 Performance in pipeline mode (with 1d decomp. for the beam) Fixed problem size, strong scaling study, increase number of processors by increasing pipeline stages In each stage, the number of processors is chosen according to the transverse size of the problem. Benchmark shows that pipeline operation can be scaled to at least 1,000+ processors with substantial throughput improvement. Feng et al, 228(15), 5340, JCP 2009

8 Domain decomposition and pipelining Increase throughput through more execution units (similar to CPU design). Pipeline stages separated in time/space, it can be viewed as a special case of domain decomposition. Each pipeline stage can work with arbitrary domain decomposition (for the beam) : 1D decomposition along propagation direction works well for small problem, but it limits the amount of CPUs that can be used. Domain decomposition in plasma solver is unaffected (no benefit for using 2D decomposition). General 2d decomposition or 3D decomposition for the beam requires complicate data redistributions, but is good for load-balancing. Matching domain decomposition for the beam and the plasma is simple and avoid data redistributions. 1D decomposition 2D decomposition

9 VerificationParallel scaling W/O Pipelining With pipelining Pipeline algorithm verification and scaling 2048×2048×256 grids, 4 particles/cell, 128 cores/stage, smallest domain 2048×16×2.


Download ppt "Improved pipelining and domain decomposition in QuickPIC Chengkun Huang (UCLA/LANL) and members of FACET collaboration SciDAC COMPASS all hands meeting."

Similar presentations


Ads by Google