GeantV – Parallelism, transport structure and overall performance

GeantV – Parallelism, transport structure and overall performance
Andrei Gheata (CERN) for the GeantV development team

Outlook Motivation & objectives Implementation Concurrent services
October 2016 Outlook Motivation & objectives Implementation Concurrent services Tuning knobs and adaptive behavior Interface to accelerators NUMA awareness Global performance and perspectives

GeantV – Adapting simulation to modern hardware
Classical simulation hard to approach the full machine potential GeantV simulation needs to profit at best from all processing pipelines Single event scalar Embarrassing parallelism Cache coherence – low Vectorization – low (scalar auto-vectorization) Multi-event vector-aware Fine grain parallelism Cache coherence – high Vectorization – high (explicit multi-particle interfaces)

GeantV concurrency: static thread approach Geometry Filters
(Fast)Physics Filters Stepper Input queue Vol1 Vol2 Vol3 Voln e+/e- γ Vector stepper Baskets Step sampling MT vector/scalar processing Filter neutrals (Field) Propagator Step limiter reshuffle Outputs Coproc. broker VecGeom navigator Full geometry Simplified geometry Physics sampler Basketizer Phys. Process post-step Secondaries TO SCHEDULER

A Gheata - GeantV scheduler
Scheduler features Efficient concurrent basketizers Filtering tracks by several possible locality criteria Giving reasonable size vectors all along the simulation Provide scalable & balanced workload Minimize memory footprint Minimize cool-down phase (tails) Adaptive behavior to maximize performance Dynamic switch of vector/scalar processing Learning dynamically the “important” filters Adjust dynamically event slots to control memory Accommodate additional concurrent processing in the simulation workflow Hits/digits/kinematics I/O Digitization/reconstruction tasks A Gheata - GeantV scheduler

Data handling challenges
fEventV fParticleV … Use A What we want before doing work on data … fEventV fParticleV … Compact Move A B What we need to do … … single threaded fEventV fParticleV … Reshuffle A fEventV fParticleV … Basketize A B C … concurrently A Gheata - GeantV scheduler

And the price to pay… 24-core dual socket E GHz (IVB). Run-time fraction spent in different parts of GeantV A Gheata - GeantV scheduler

Concurrent services: queues
Important as workload balancing tools Several mutex-based/lock free implementations evaluated GeantV queues can work at ~105 transactions/sec Lock free queues are doing great on MacOSX + clang compared to mutex-based ones (50x factor!)

Scheduler “knobs” Keep memory under control Keep the vectors up
Limiting number of buffered events Prioritizing events “mostly” transported Using watermark limit to clean baskets Keep the vectors up Optimize vector size Too large: to many pending baskets Too small: inefficient vectorization Trigger postponing tracks or tracking with scalar algorithms Popularity service: basketize only the “important” volumes Adjust also dynamically basket size Headlines only A Gheata - GeantV scheduler

Optimization of scheduling parameters
Depends on what needs to be optimized E.g. memory vs. computing time A multivariate problem, probably too early to optimize Development is iterative with short cycles GA approach started to be investigated A Gheata - GeantV scheduler

Monitoring and learning
Implemented real-time monitoring tools based on ROOT Very useful to understand model behavior Some parameters can be really adaptive Such as “important” volumes that can feed vectors Basketizing only 10% of volumes in CMS leads to 70% of transport done in vector mode FixedShield102880: steps * HVQX8780: steps * ZDC_EMLayer9b00: steps * BeamTube22b780: steps * OQUA6780: steps * QuadInner3300: steps * ZDC_EMAbsorber9d00: steps * QuadOuter3700: steps * QuadCoil3680: steps * ZDC_EMFiber9e80: steps A Gheata - GeantV scheduler

Memory control Memory determined by number of tracks “in flight”
October 2016 Memory control Memory determined by number of tracks “in flight” Determined by number of events “in flight” Controlling the memory is important for low production cuts Number of secondary particles can explode Currently implemented a policy to delete empty baskets when reaching a memory watermark Not fully effective, but keeping the memory constant Extra levers: Reducing dynamically the number of events in flight (possible with new event server) Prioritizing transport of low energy tracks queued baskets memory tracks in flight

Optimizations for dense physics: reusing tracks
If interacting in the current step, no need to re-basketize (same volume) Recycle the input basket Large gain for dense physics Normally the basketizer becomes fully blocking Large part of tracks can be reused in the same thread to release load A Gheata - GeantV scheduler

Integration with task-based experiment frameworks
Some experiments (e.g. CMS) adopted task-based frameworks Integrate GeantV in a task-based workflow is very important (and now possible) Several scenarios invoking GeantV as a task possible, e.g: Full/fast simulation (GeantV) Particle filter (type, region, energy, …) Digitization (MC truth + det. response) Event Generator/ Reader Tracking/ Reconstruction Experiment framework

Framework: GeantV internal flow in the task approach
user task inject event EventServer Transport task may be further split into subtasks 0..n Initial task Top level task spawning a “branch” in TBB tree of tasks Transport task Transports one basket for one step reuse tracks keeping locality transported tracks output input User scoring Basketizer(s) concurrent service injects full baskets enqueue basket Basket queue concurrent service command: dump all your baskets event finished? inspect I/O task Garbage collector /Flushing/prioritizing task Flow control task event finished? queue empty? memory threshold queue empty? User Digitizers event finished?

Integration with user framework
October 2016 Integration with user framework ongoing R&D RunSimulation Task StartRunTask EndRunTask Configure GeantV, start event loop task Initialize GeantV if needed, trigger user event injection, start run Optional post-simulation task GeantRunManager RunSimulation InitTask InitTask SpawnUserEndRunTask() InitTask InitTask fApplication CMSSWApplication GeantV GeantApplication TBB fGenerator GeantV task system CMSSWGenerator PrimaryGenerator Event server NextEvent() AddTrack(GeantTrack &atrack)

Preliminary TBB results
A first implementation of a task-based approach for GeantV using TBB was deployed. Connectivity via FeederTask, steering concurrency by launching InitialTask(s) Some overheads on Haswell/AVX2 not so obvious on KNL/AVX512 AVX2 Intel(R) Xeon(R) CPU E GHz 2 sockets x 8 physical cores KNL/AVX512

GeantV beyond the socket: accelerators
GeantV scheduler can communicate with arbitrary device brokers Getting work and processing in native mode or processing steps of work and sending data back to the host Implemented so far: CUDA broker, KNC offload interface. CPU stepper (multithreading) Geometry Physics Basketizer GPU broker KNC broker (offload) MPI broker Generator Device stepper

Topology-aware GeantV
Tracks Transport Basketizer0 Scheduler0 Global basketizer Tracks Transport Basketizer1 Scheduler1 Replicate schedulers on NUMA clusters One basketizer per NUMA node libhwloc to detect topology Possible to use pinning/NUMA allocators to increase locality Multi-propagator mode running one/more clusters per quadrant Loose communication between NUMA nodes at basketizing step Implemented, currently being integrated Tracks Transport Basketizer2 Scheduler2 Tracks Transport Basketizer3 Scheduler3

Handling sub-NUMA clustering
Known scalability issues (see next) of full GeantV due to fine grain synchronization in re- basketizing New approach deploying several propagators with SNC implemented Objectives: improved scalability at the scale of KNL and beyond, address HPC mode with MPI event servers (workload balancing) + non-homogenous resources Now implemented GeantV run manager GeantV propagator NUMA discovery service (libhwloc) Tracks Transport Basketizer0 Scheduler0 Basketizer1 Scheduler1 Basketizer2 Scheduler2 Basketizer3 Scheduler3 Global basketizer GeantV propagator GeantV propagator (…) Scheduler Basketizer node Scheduler Basketizer Scheduler Basketizer socket socket socket

Multi-propagator performance
October 2016 Multi-propagator performance To be redone

GeantV plans for HPC environments
Standard mode (1 independent process per node) Always possible, no-brainer Possible issues with work balancing (events take different time) Possible issues with output granularity (merging may be required) Multi-tier mode (event servers) Useful to work with events from file, to handle merging and workload balancing Communication with event servers via MPI to get event id’s in common files Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service

Validation and performance for LHC setups
October 2016 Validation and performance for LHC setups Exercise at the scale of LHC experiments (CMS & LHCb) Full geometry converted to VecGeom + uniform magnetic field Tabulated physics, fixed 1MeV Measuring several cumulative observables in sensitive detectors Energy deposit and particle flux densities for p, π, K Comparing GeantV single threaded with the corresponding Geant4 application Geant4.10.2, special physics list using tabulated physics Comparable signal, number of secondaries, total steps and physics steps within statistical fluctuations. TG4/TGV = 3.5 TG4/TGV = 2.5 Speed-up due to: 1.5 - Infrastructure optimizations 2.4 - Algorithmic improvements in geometry 3.5 - Extra locality/vectorization To be profiled

Future work SOA->AOS integration Tuning for many-core
October 2016 Future work SOA->AOS integration Tuning for many-core R&D and testing in HPC environments Adapting to new architectures (Power8) Integration with physics and optimization: R-K propagator and multiple scattering

Conclusions GeantV core delivering already a part of the hoped performance Many optimization requirements, now understanding how to handle most of them More performance to be extracted from vectorization soon Additional levels of locality (NUMA) available in modern HW Topology detection available in GeantV, currently being integrated Integration with task-based HEP frameworks now possible A TBB-enabled GeantV version ready Studying more efficient use of HPC resources Using a multi-tier approach for better workload balancing Very promising results in complex applications Gains from infrastructure simplification, geometry and locality/vectorization

GeantV – Parallelism, transport structure and overall performance

Similar presentations

Presentation on theme: "GeantV – Parallelism, transport structure and overall performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GeantV – Parallelism, transport structure and overall performance

Similar presentations

Presentation on theme: "GeantV – Parallelism, transport structure and overall performance"— Presentation transcript:

Similar presentations

About project

Feedback