GeantV – Parallelism, transport structure and overall performance

GeantV – Parallelism, transport structure and overall performance
Andrei Gheata (CERN) for the GeantV development team

Outlook Motivation & objectives Implementation Concurrent services
October 2016 Outlook Motivation & objectives Implementation Concurrent services Tuning knobs and adaptive behavior Interface to accelerators NUMA awareness Global performance and perspectives

GeantV – Adapting simulation to modern hardware
October 2016 GeantV – Adapting simulation to modern hardware Classical simulation hard to approach the full machine potential GeantV simulation needs to profit at best from all processing pipelines Single event scalar transport Embarrassing parallelism Cache coherence – low Vectorization – low (scalar auto-vectorization) Multi-event vector-aware transport Fine grain parallelism Cache coherence – high Vectorization – high (explicit multi-particle interfaces) Compared to the classical approach, GeantV transports vectors of tracks grouped by geometry/physics to get access to fine grain parallelism features like ILP/SIMD, with much better cache coherence.

GeantV concurrency: static thread approach SCHEDULER Inconvenients:
October 2016 GeantV concurrency: static thread approach Inconvenients: thread “awareness” (thread id) hard to connect to task-based frameworks SCHEDULER Geometry Filters (Fast)Physics Filters Propagator Input queue Vol1 Vol2 Vol3 Voln e+/e- γ Vector stepper Baskets Step sampling MT vector/scalar processing Filter neutrals (Field) Propagator Step limiter reshuffle Outputs Coproc. broker VecGeom navigator Full geometry Simplified geometry Physics sampler Basketizer Phys. Process post-step Secondaries TO SCHEDULER

Scheduler features Efficient concurrent basketizers
Filtering tracks by several possible locality criteria Giving “reasonable” size vectors all along the simulation Provide scalable & balanced workload Minimize memory footprint Minimize cool-down phase (tails) Adaptive behavior to maximize performance Dynamic switch of vector/scalar processing Learning dynamically the “important” filters Adjust dynamically event slots to control memory Accommodate additional concurrent processing in the simulation workflow Hits/digits/MC truth I/O Digitization/reconstruction tasks “reasonable” compares to the size of the vector units “cool-down” phase refers to the GeantV regime when most tracks were transported and the remaining ones are pending in partially filled baskets, when flushing/prioritizing is needed The dynamic adaptation was needed since a fixed basket size and “stubborn” vector treatment (initial implementations) made suffer either memory or run time. We have to be able to make larger vectors for volumes that may become important after some time. October 2016

SOA data handling challenges
fEventV fParticleV … Use A What we want before doing work on data … fEventV fParticleV … Compact Move A B crossing What we need to do … … single threaded fEventV fParticleV … Reshuffle A charged neutral fEventV fParticleV … Basketize A B C … concurrently Current approach in GeantV is SOA, used both in basketizing and transport. This was done to allow easy SIMD, but it is an overkill for performance critical concurrent operations (see next slide) and introduce a lot of un-needed copy overhead. The “new” approach will rather handle for basketizing AOS of track pointers, morphing into SOA only what and when needed to be dispatched to geometry/physics. October 2016

October 2016 And the price to pay… fraction of run time Geometry Run-time fraction spent in different parts of GeantV 24-core dual socket E GHz (IVB). Hyperthreading Basketizing is the most important operation generating concurrency cost, but the overhead by data copying is larger than 20%. Q1: can we reduce? -> Yes, by handling AOS, Q2: Is the overhead lesser than benefits? -> Yes, but only after being demonstrated by benchmarks. Concurrency on top of SOA is not so efficient due to the long time taken by copying all track fields, introducing also a lot of false sharing. The observed contention led us to organize tracks in a different way, in order to reduce the concurrency cost

Concurrent services: queues
October 2016 Concurrent services: queues We use concurrent queues for Aggregating/balancing the work load between workers Several mutex-based/lock free implementations evaluated GeantV queues can work at ~105 transactions/sec Lock free queues are doing great on MacOSX + clang compared to mutex-based ones (50x factor!) We have evaluated many, but not all concurrent queues available. The required functionality (Push/Pop/TryPop/priority) was added on the wrapper layer.

Monitoring to understand behavior
October 2016 Monitoring to understand behavior scalar vector frequency of baskets basket size Implemented real-time monitoring tools based on ROOT Very useful to understand model behavior Triggered improvement/evolution of the model Some parameters can be really adaptive Such as “important” volumes that can feed vectors Basketizing only 10% of volumes in CMS leads to 70% of transport done in vector mode Switching the fraction of vector/scalar work has impact on performance in a large range Real time snapshots FixedShield102880: steps * HVQX8780: steps * ZDC_EMLayer9b00: steps * BeamTube22b780: steps * OQUA6780: steps * QuadInner3300: steps * ZDC_EMAbsorber9d00: steps * QuadOuter3700: steps * QuadCoil3680: steps * ZDC_EMFiber9e80: steps #tracks transported

Memory control Memory determined by number of tracks “in flight”
October 2016 Memory control Memory determined by number of tracks “in flight” Determined by number of events “in flight” Controlling the memory is important for low production cuts Number of secondary particles can explode Currently implemented a policy to delete empty baskets when reaching a memory watermark Not fully effective, but keeping the memory constant Extra levers (future work): Reducing dynamically the number of events in flight (possible with new event server) Back burner (waiting queue) for high energy tracks queued baskets memory tracks in flight High energy tracks represent a lot of ”future” work. They can be put on waiting queues to allow low energy particles to fade and disappear if memory is stressed. Idle threads may take with priority from such queue in case memory is fine. Future work)

Scheduler “knobs” Keep memory under control Keep the vectors up
October 2016 Scheduler “knobs” Keep memory under control Limiting number of buffered events Prioritizing events “mostly” transported Using watermark limit to clean baskets Keep the vectors up Optimize vector size Too large: to many pending baskets Too small: inefficient vectorization Trigger postponing tracks or tracking with scalar algorithms Adjust also dynamically basket size Popularity service: basketize only the “important” volumes Headlines only

Optimization of scheduling parameters
October 2016 Optimization of scheduling parameters Depends on what needs to be optimized E.g. memory vs. computing time A multivariate problem, probably too early to optimize Development is iterative with short cycles Genetic algorithm approach started to be investigated

Optimizations for “dense” physics: reusing tracks
October 2016 Optimizations for “dense” physics: reusing tracks If interacting in the current step, no need to re-basketize (same volume) Recycle the input basket Large gain for dense physics Normally the basketizer becomes fully blocking Large part of tracks can be reused in the same thread to release load “Dense” refers to interaction length much less than volume size, resulting in several steps in the volume

New development: Integration with task-based experiment frameworks
October 2016 New development: Integration with task-based experiment frameworks Some experiments (e.g. CMS) adopted task-based frameworks Integrate GeantV in a task-based workflow is very important (and now possible) Several scenarios invoking GeantV as a task possible, e.g: Full/fast simulation (GeantV) Particle filter (type,region, energy, …) Digitization (MC truth + det. response) Event Generator/ Reader Tracking/ Reconstruction Experiment framework

Framework: GeantV flow of work in the task approach
October 2016 Framework: GeantV flow of work in the task approach user task inject event EventServer Transport task may be further split into subtasks 0..n Initial task Top level task spawning a “branch” in TBB tree of tasks Transport task Transports one basket for one step reuse tracks keeping locality transported tracks output input User scoring Basketizer(s) concurrent service injects full baskets enqueue basket Basket queue concurrent service event finished? command: dump all your baskets inspect I/O task Flow control task event finished? queue empty? Prioritizer Flush baskets/prioritize events task queue empty? memory threshold The Initial task spawns a chain of TBB tasks which is basically an interplay between a flow controller task and a transport task. The transport tasks will call the “user actions” tasks during the stepping, at the end of each event and at the end of the run. The transport starts <N> InitialTask which will create as many GeantV transport chains. The transport tasks in every chain uses a common concurrent service for workload balancing, so all TBB chains will finish the work at the same time, modulo the last baskets that cannot be split. User Digitizers tasks event finished?

Connection to user framework
October 2016 Connection to user framework UserGenerator #1 User ReadEventsTask#1 UserRunSim Geant:: ProcessEvents UserGenerator #2 User ReadEventsTask#2 UserRunSim Geant:: ProcessEvents UserGenerator #3 User ReadEventsTask#3 UserRunSim Geant:: ProcessEvents ongoing R&D User framework (e.g running concurrent event reader tasks) Prepare events to be picked-up (using PrimaryGenerator interface) UserGenerator#n-> NextEvent() GeantEventServer Events UserEndRun Task StartSimulation Task Post-simulation task (e.g steering reco) Configure GeantV and call initialization task <N> pull baskets from event server/work queue Run finished? InitTask InitTask InitTask <N> User Framework Initialization InitialTask FlowController UserEndEvent Task event finished? Assume the user framework spawn maximum <N> concurrent chains of tasks starting by reading events from some external source. To run simulation, each reader spawns a UserRunSim task, preparing to serve its events in the form of a UserGenerator object, deriving from the Geant::PrimaryGenerator interface. The events are not immediately pushed by the user into the GeantEventServer, but rather picked-up by the GeantRunManager during the Geant::ProcessEvents task which has to be spawned as child of each UserRunSim task. The reason for the pick-up mechanism is to have a uniform procedure/interface for injecting events also in the mode where the event loop is steered by GeantV, calling internally an event generator which has a single instance. The ProcessEvent task is the entry point for GeantV simulation and it keeps constant (equal to <N>) the number of task chains running concurrently transport (InitTask, registered to the RunManager). This means that TBB will be able to schedule maximum <N> transport task chains at a time. The interaction with the user code is as following: the flow controller task detects the end of each event and spawns a UserEndEventTask which at its turn can steer user event digitization (or whatever event action). When all primaries are fully transported, the FlowController will de-register the InitTask(s) from the run manager, then exit the corresponding simulation chain. Finishing all simulation chains triggers the spawning of an UserEndRunTask, where the user can steer reconstruction. Note: In case the user framework has a single ‘master control’ task steering event reading, simulation, digitization, reconstruction in the same tbb::exec(), it is easier to factorize these steps in a graph of tasks where the completion of a step spawns a task doing the next step. Even in case e.g. the digitizer task may have alternatives depending on some parameters, we just have to figure out how to handle some flags in order to spawn the right type of task after each event. The alternative is to pause/resume the user ‘master control’ task at the expense of launching a system thread and put it to sleep, to be waked up after each event and spawn the next step task. This looks however more complicated. TransportTask Keep <N> chains of GeantV simulation tasks Post-event task (e.g steering digitizers)

Preliminary TBB results
October 2016 Preliminary TBB results A first implementation of a task-based approach for GeantV using TBB was deployed. Some overheads on Haswell/AVX2 not so obvious on KNL/AVX512 Less than 20% performance loss for the first implementation AVX2 Intel(R) Xeon(R) CPU E GHz 2 sockets x 8 physical cores KNL/AVX512

Topology-aware GeantV
October 2016 Topology-aware GeantV Tracks Transport Basketizer0 Scheduler0 Global basketizer Tracks Transport Basketizer1 Scheduler1 Replicate schedulers on NUMA clusters One basketizer per NUMA node libhwloc to detect topology Possible to use pinning/NUMA allocators to increase locality Multi-propagator mode running one/more clusters per quadrant Loose communication between NUMA nodes at basketizing step Implemented, currently being integrated Tracks Transport Basketizer2 Scheduler2 Tracks Transport Basketizer3 Scheduler3

Handling sub-node clustering
October 2016 Handling sub-node clustering Known scalability issues of full GeantV due to synchronization in re-basketizing New approach deploying several propagators clustering resources at sub-node level Objectives: improved scalability at the scale of KNL and beyond, address both many-node and multi- socket (HPC) modes + non-homogenous resources Implemented recently GeantV run manager GeantV propagator NUMA discovery service (libhwloc) Tracks Transport Basketizer0 Scheduler0 Basketizer1 Scheduler1 Basketizer2 Scheduler2 Basketizer3 Scheduler3 Global basketizer GeantV propagator GeantV propagator (…) Scheduler Basketizer node Scheduler Basketizer Scheduler Basketizer This approach is much closer to multi-processing, but still sharing (loose connection). Large common data structures (geometry/physics) may need to be replicated in each NUMA node. socket socket socket

Multi-propagator performance
October 2016 Multi-propagator performance First version revealed a bottleneck in event fetching Triggered the development of the event server Scalability gets better by increasing number of propagators Not final results, still fixing/optimizing New version has still a bug in basketizing #ncores KNL #ncores XEON

GeantV plans for HPC environments
future R&D Standard mode (1 independent process per node) Always possible, no-brainer Possible issues with work balancing (events take different time) Possible issues with output granularity (merging may be required) Multi-tier mode (event servers) Useful to work with events from file, to handle merging and workload balancing Communication with event servers via MPI to get event id’s in common files MPI Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service MPI Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service Event feeder Node1 Transport Numa0 Numa1 Node2 Event server Nodemod[N] Merging service October 2016

Performance measurements for LHC setups: test matrix
October 2016 Semantic changes Scheduler Geometry Physics Magnetic Field Stepper Geant4 only Legacy G4 Various Physics Lists Various RK implementations Geant4 or GeantV VecGeom 2016 scalar Tabulated Physics Scalar Physics Code Helix (Fixed Field) Cash-Karp Runge-Kutta GeantV only VecGeom 2015 VecGeom 2016 vector Legacy TGeo Vector Physics Code Vectorized RK Implementation 3 2 1

Validation and performance for LHC setups
October 2016 Validation and performance for LHC setups Exercise at the scale of LHC experiments (CMS & LHCb) Full geometry converted to VecGeom + uniform magnetic field Tabulated physics, fixed 1MeV Measuring several cumulative observables in sensitive detectors Energy deposit and particle flux densities for p, π, K Comparing GeantV single threaded with the corresponding Geant4 application Geant4.10.2, special physics list using tabulated physics Comparable signal, number of secondaries, total steps and physics steps within statistical fluctuations. TG4/TGV = 3.5 TG4/TGV = 2.5 Speed-up due to: 1.5 - Infrastructure optimizations 2.4 - Algorithmic improvements in geometry 3.5 - Extra locality/marginal basket vectorization To be profiled The results show that the overheads from rebasketizing are under control. The global approach does not benefit yet from relevant vectorization throughput. This is pre-empted by recent R&D in VecGeom related to navigation specialization, opening these opportunities. Some of the improvements (geometry scalar) can be back-ported to G4.

Future work SOA->AOS integration Tuning for many-core
October 2016 Future work SOA->AOS integration Tuning for many-core R&D and testing in HPC environments Adapting to new architectures (Power8) Integration with physics and optimization: R-K propagator and multiple scattering

October 2016 Conclusions GeantV core delivering already a part of the hoped performance Many optimization requirements, now understanding how to handle most of them More performance to be extracted from vectorization soon Additional levels of locality (NUMA) available in modern HW Topology detection available in GeantV, currently being integrated Integration with task-based HEP frameworks now possible A TBB-enabled GeantV version ready Studying more efficient use of HPC resources Using a multi-tier approach for better workload balancing Very promising results in complex applications Gains from infrastructure simplification, geometry and locality/vectorization

GeantV – Parallelism, transport structure and overall performance

Similar presentations

Presentation on theme: "GeantV – Parallelism, transport structure and overall performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GeantV – Parallelism, transport structure and overall performance

Similar presentations

Presentation on theme: "GeantV – Parallelism, transport structure and overall performance"— Presentation transcript:

Similar presentations

About project

Feedback