Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie

A Performance Model of non-Deterministic Particle Transport on Large-Scale Systems
Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie Performance and Architectures Laboratory (PAL) Los Alamos National Laboratory Presented by: Kei Davis

Performance & Architectures Lab
Performance Analysis Portfolio at Los Alamos: Benchmarking near-to-market advanced system Large-scale Simulation: Parsims Design of advanced systems Application centric modeling Developed models of many applications: Deterministic transport (Sweep3D, Tycho) Hydro code (SAGE) Ocean Simulation (POP) MCNP (described here) Models are being used in many ways: Predict performance prior to availability Comparison of Large-scale systems (e.g. ASCI Q vs. the Earth Simulator) In Procurement of ASCI purple (expected to be a 100T system in 2004/5) During installation of ASCI Q (just completed, 20T Alpha system)

Why Model Performance? Performance analysis is necessary to evaluate the impact of architectural evolution and innovation. Application modeling provides insight into the achievable performance on current systems, and allows exploration of expected performance improvements possible on future systems.

Need to have an expectation
Complex machines and software single processors, interactions within nodes, interaction between nodes (communication networks), I/O Large cost for development, deployment and maintenance Need to know in advance what performance will be. Lots of system choices Some measurement possible (small scale) What should we buy? (ASCI Purple) Verification of ASCI Q Performance Update of SW and/or HW Maintenance Installation Procurement Implementation Design

MCNP (Monte-Carlo N-Particle)
General-purpose code that can be used for neutron, photon, electron, or coupled transport. Simulates individual particle (histories) and records aspects (tallies) of their average behavior. Sequentially, MCNP simulates the requested number of histories on a given input geometry and reports the requested output tallies. In parallel, MCNP copies the entire input geometry from a master to two or more slaves Each slave simulates a different set of particles. In each iteration, or cycle, the master merges tallies from all slaves during a rendezvous. Complexity of the problem is constrained by available memory at a single node – due to the input geometry being copied to each PE. Hence, parallelism is utilized to solve the problem faster, rather than solve a more complex problem in the same amount of time. (Strong Scaling)

Example Experiment - Criticality
A “critical” system is one where exactly one of the neutrons produced in a fission reaction continues a chain reaction. Such a system has a neutron multiplication of one, or keff =1. In a “subcritical” system, keff <1, and the chain reaction will die away. If keff >1, the system is “supercritical”. Such a system will produce large amounts of radiation and persistent radioactive contamination. MCNP can be used to simulate the neutron interactions for a given input geometry and calculate keff . An example input geometry consists of an insulating cylinder with rods of various types arranged in the middle.

Example Input Geometry
Vertical cross-section Horizontal cross-section

Parallel Activity in MCNP
Scatter Phase Work Gather Phase Master Slave1 SlaveP-1 Stage 1 2 3 4 5 6 4 5 6 7 8 To develop a model, an understanding of the key processing operations and their scaling behavior is required. The parallel activity for one cycle of MCNP is shown above. An analytical model is obtained from this type of analysis.

Analysis of Parallel Activity
Stage Source Action Quantity Description 1 Master bcast P*8 particle range to be computed by slaves 2 229240 update current history 3 Slave work Thist*  Nph/(P-1)  Thist times the number of particle histories 4 pt2pt 5512 task common 5 320 tally data 6 204920 task array 1 7 48*Nph/(P-1)  task array 2 8 32 timing data Only main activities shown Stages 1 and 2 correspond to the scatter phase Stage 3 is the work phase Stages 4-8 correspond to the gather phase

Performance Model (Overview)
Performance described by analytical expressions. Top level: Elements represent the main processing stages. For example: Parameters in model enable scalability studies, e.g.: P (# PEs), Nph (# histories),

System Model The system model encapsulates key system characteristics including: Communication (e.g. latency and bandwidth) Computational Performance (e.g. processor speed). For example, point-to-point communications can be modeled as a piece-wise linear curve: Tpt2pt(S) = 0 £ n £ 32 T ~ 5 ms 64 £ n £ 1024 T ~ 5 ms + 15 ns / byte n > 1024 T ~ 10ms ns / byte S (= message size in bytes)

Single-Processor Performance
The single-processor performance can be modeled or measured. A measured value has several advantages… Avoids necessity to model compiler optimizations (which are complex!) Eliminates need to model memory hierarchy. and disadvantages… Requires preliminary benchmarking experiments (and access to system). Values needed for all systems in a comparison

Experimental Test-bed
Compaq Alphaserver ES40: 32 nodes, each with 4 PEs 833MHz, EV68 Alpha processors 64K L1, 8MB L2 8GB memory per node Quadrics QsNet Interconnect Fat-tree topology Low latency (typically 6µs), high bandwidth (~ 300MB/s)

MCNP Model Parameters Type Parameter Values System Lc(S), Bc(S)
5.05µs, 0.0MB/s (S < 64) 5.47µs, 78MB/s (64 <S < 512) 10.3µs, 294MB/s (S > 512) Tpack(S) 0.12ns (S < 32K) 0.16ns (32K < S < 4M) 0.67ns (S > 4M) Application Nph 100, 500, 1000, 5000, 10000, 50000, Thist 798µs S in bytes

Model Validation The model predicts well.
The predicted time is often within 10% of the measured time. Accuracy generally decreases as the number of PEs grows. Some work remains to increase model accuracy

Exploring Performance
Once validated, the model can be used to predict performance. E.g. new scenarios on the current architecture. What-if we processed a larger problem? What-if we used more processors? Strong Scaling Weak Scaling

Exploring Performance (2)
Can explore performance on possible future architectures: What-if the network was faster? What-if the processors were faster? What-if message packing was faster? Can predict the performance for possible code modifications: What-if the “gather” phase was re-implemented using reductions? Strong Scaling Weak Scaling

Conclusions Developed an analytical performance model for MCNP.
Validated the model on a large-scale Alphaserver testbed. predicted time is often within 10% of the measured time. Used the model to explore a number of scenarios. Studied strong and weak scaling modes for small and large inputs. Predicted performance for improved systems and code. Showed that most performance gain will come from increased processor speed. Illustrated the benefits of developing a performance model of an application. Part of an on-going effort to model large-scale systems

Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie

Similar presentations

Presentation on theme: "Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie

Similar presentations

Presentation on theme: "Mark Mathis Texas A&M University Darren Kerbyson and Adolfy Hoisie"— Presentation transcript:

Similar presentations

About project

Feedback