FNAL Geant4 Performance Group Issues and Progress Daniel Elvira for M. Fischler, J. Kowalkowski, M. Paterno.

Slides:



Advertisements
Similar presentations
TPTP 4.4 New Java Profiler (JVMTI) Test and Performance Tools Platform (TPTP) Heap Analysis Enhancements for TPTP 4.4 Asaf Yaffe Software and Solutions.
Advertisements

The Components There are three main components of inDepth Lite, inDepth and inDepth+ Real Time Component Reporting Package Configuration Tools.
Analysis. Start with describing the features you see in the data.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
V0.01 © 2009 Research In Motion Limited Introduction to Java Application Development for the BlackBerry Smartphone Trainer name Date.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Virtual Memory Tuning   You can improve a server’s performance by optimizing the way the paging file is used   You may want to size the paging file.
Presenter: Shant Mandossian EFFECTIVE TESTING OF HEALTHCARE SIMULATION SOFTWARE.
Evaluation of G4 Releases in CMS (Sub-detector Studies) Software used Electrons in Tracker Photons in the Electromagnetic Calorimeter Pions in the Calorimeter.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
Evaluating Robustness of Signal Timings for Conditions of Varying Traffic Flows 2013 Mid-Continent Transportation Research Symposium – August 16, 2013.
Lecture No.01 Data Structures Dr. Sohail Aslam
G4 Collaboration Meeting (2010/10/4)V. Daniel Elvira1 Geant4 Computing Performance Activities V. Daniel Elvira (Fermilab) for the G4 Users and Performance.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Geant4 Acceptance Suite for Key Observables CHEP06, T.I.F.R. Mumbai, February 2006 J. Apostolakis, I. MacLaren, J. Apostolakis, I. MacLaren, P. Mendez.
*** CONFIDENTIAL *** © Toshiba Corporation 2008 Confidential Creating Report Templates.
Disk Fragmentation 1. Contents What is Disk Fragmentation Solution For Disk Fragmentation Key features of NTFS Comparing Between NTFS and FAT 2.
Implementing a dual readout calorimeter in SLIC and testing Geant4 Physics Hans Wenzel Fermilab Friday, 2 nd October 2009 ALCPG 2009.
Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Refactoring1 Improving the structure of existing code.
CSE 303 Concepts and Tools for Software Development Richard C. Davis UW CSE – 12/6/2006 Lecture 24 – Profilers.
Server to Server Communication Redis as an enabler Orion Free
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
August 2005 TMCOps TMC Operator Requirements and Position Descriptions Phase 2 Interactive Tool Project Presentation.
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Parallellising Geant4 John Allison Manchester University and Geant4 Associates International Ltd 16-Jan-2013Geant4-MT John Allison Hartree Meeting1.
CS 241 Section Week #9 (11/05/09). Topics MP6 Overview Memory Management Virtual Memory Page Tables.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
MultiJob pilot on Titan. ATLAS workloads on Titan Danila Oleynik (UTA), Sergey Panitkin (BNL) US ATLAS HPC. Technical meeting 18 September 2015.
CMS Computing Model Simulation Stephen Gowdy/FNAL 30th April 2015CMS Computing Model Simulation1.
PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.
Mantid Stakeholder Review Nick Draper 01/11/2007.
1ECFA/Vienna 16/11/05D.R. Ward David Ward Compare these test beam data with Geant4 and Geant3 Monte Carlos. CALICE has tested an (incomplete) prototype.
Overview of a Plan for Simulating a Tracking Trigger (Fermilab) Overview of a Plan for Simulating a Tracking Trigger Harry Cheung (Fermilab)
1 Becoming More Effective with C++ … Day Two Stanley B. Lippman
Status Report on the Validation Framework S. Banerjee, D. Elvira, H. Wenzel, J. Yarba Fermilab 15th Geant4 Collaboration Workshop 10/06/
G4 Users Workshop, 2009/10/17V. Daniel Elvira1 Summary of Geant4 Computing Performance Activities V. Daniel Elvira (Fermilab) for the G4 Users and Performance.
Recommending Adaptive Changes for Framework Evolution Barthélémy Dagenais and Martin P. Robillard ICSE08 Dec 4 th, 2008 Presented by EJ Park.
Krzysztof Genser/Fermilab For the Fermilab Geant4 Performance Team.
Sep. 23 rd, 2013 Geant4 Collaboration Meeting 1 Hans Wenzel, for the Physics Validation Task Force Parallel Session 1B – Physics Validation Tools Sep 23.
Software Engineering Prof. Dr. Bertrand Meyer March 2007 – June 2007 Chair of Software Engineering Lecture #20: Profiling NetBeans Profiler 6.0.
Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,
Event Mixing Rob Kutschke, Fermilab Software and Simulation Meeting October 5, 2011 Mu2e-doc-1874-v1.
SINGLE SUBJECT RESEARCH PREPARED FOR: DR EDDY LUARAN PREPARED BY: AFZA ARRMIZA BINTI RAZIF [ ] HANIFAH BINTI RAMLEE IZYAN NADHIRAH BINTI.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Stopping Particles in Geant4: Overview and Remarks for Development Julia Yarba Fermilab 16th Geant4 Collaboration Workshop 09/19/
Geant4 Computing Performance Task with Open|Speedshop Soon Yung Jun, Krzysztof Genser, Philippe Canal (Fermilab) 21 st Geant4 Collaboration Meeting, Ferrara,
HPC In The Cloud Case Study: Proteomics Workflow
Getting the Most out of Scientific Computing Resources
Getting the Most out of Scientific Computing Resources
Computing models, facilities, distributed computing
SuperB and its computing requirements
Tom LeCompte High Energy Physics Division Argonne National Laboratory
Kilohertz Decision Making on Petabytes
Performance profiling and benchmark for medical physics
Geant4 profiling performance for medical physics
of experiment simulations
Introduction Enosis Learning.
Automated Code Coverage Analysis
Introduction Enosis Learning.
Eastern Academic Scholars’ Trust (EAST)
Lesson 1 – Chapter 1B Chapter 1B – Terminology
Search for coincidences and study of cosmic rays spectrum
Lecture 2 The Art of Concurrency
Presentation transcript:

FNAL Geant4 Performance Group Issues and Progress Daniel Elvira for M. Fischler, J. Kowalkowski, M. Paterno

G4 Performance The LHC experiments are the main G4 customers within the HEP, FNAL (and the FNAL/G4 team) is involved in CMS. Then… It makes sense for the FNAL/G4 performance group to use (primarily) the CMS simulation application to profile Geant4 code For the work presented here: G4.8.1p01, QGSP/QGSP_Bertini (CMSSW180), Z ’ (dijets)  identify places for improvement,  design and implement improvements,  feed those improvements back to Geant4. We use a tool of our own design, SimpleProfiler, to collect detailed call stack samples. We also use our own tools to manage, analyze, present the data. Work is in progress to make these tools available to the public.

CHIPS performance analysis CMS reported that large number of small memory allocations were causing memory fragmentation and this troubled them. Average t-tbar event in CMS simulation created and destroyed about 1.6 million G4QHadron objects.  G4QHadron constructors took >1% of program time; odd for a constructor.  Derived class G4QNucleus constructors took ~2% of program time.

CHIPS code modification & result Problem was one data member:  std::vector * Tb; There was no need for this vector to be on the heap; change the data member to:  std::vector Tb; Result is 1.5% speed improvement in the whole CMS simulation.

Bertini Performance Analysis 2% (median) of program time in G4ElementaryParticleCollider constructor. These large spreads will be discussed later. Partial profiling result for 100 jobs of 100 events each. Overall percentage of time time taken in functions (no children).

Bertini code modification We reorganized G4ElementaryParticleCollider.  Removed 20 of 21 data members that did not need to be part of its state (they were equivalent for all instances of the class).  Much of the redundant code (across the 20 data members) was replaced by a few templates. Result was a reduction in source code bulk of about 40 printed pages

Bertini performance improvement Performance increase ~ 4% Note that the increase in speed is greater than the obvious expectations initial profiling Reduction in object allocation / deletion benefits other classes Note: we have a large enough data sample to accurately characterize the differences

Irregularity in random number use Recall that the Bertini performance analysis showed an unusually large spread in function speed across runs of the same job. We took this to indicate that there might be a reproducibility problem. We decided to investigate this further. It looks as though there are two distinct groups of measurements.

First observation of irreducibility 91 runs each represented by a line on the plot Each line shows the time taken per event Expect all lines to form one “cable” Observed that after event 38, the jobs separate into two branches, each of which appear to process the events differently What is the cause of this? Highly reproducible for a long period of time. Unfortunately it disappeared. Apparently, there were changes to the cluster.

Architecture dependence in output Excerpt from 4GB Geant4 log file (at line 1,093,802) showing subtle difference in physics output on different platforms. Discovered while trying to understand the irreproducibility problem. AMD: e e+03 TECBackDisk KaonPlusInelastic :----- List of 2ndaries - #SpawnInStep= 10 : e e+03 pi+ : e pi- : e pi+ : e e+03 pi- : e proton : e e+03 pi- : e e+03 kaon+ : e gamma : e+03 2 gamma : e C11[0.0] INTEL: e e+03 TECBackDisk KaonPlusInelastic :----- List of 2ndaries - #SpawnInStep= 9 : e e+03 pi+ : e pi- : e pi+ : e e+03 pi- : e proton : e e+03 pi- : e e+03 kaon+ : e gamma : e C11[0.0]

Random number usage issue Starting looking at the random number generators as cause of the physics differences. The random number streams are identical on AMD and Intel architectures. Different amount of random numbers drawn from the generator depending on architecture, observable on the first event This result is completely reproducible AMD = 59,230,872 random numbers Intel = 59,511,723 random numbers (difference of 280,851) We are still investigating the cause.

Plans for the future What we had in mind in the short term (rest of 2008): 1-Perform a major design review for the CHIPS library. 2-Investigate the Intel vs. AMD dependence of the simulation output. 3- Return to profiling, resuming with the newest version of Geant4. 4- Continue work with two interns from Northern Illinois University, who are helping to improve out data collation, analysis, and display tools; the tools will be made public once they are sufficiently robust. But… Fermilab management has temporarily pulled out J. Kowalkowski, M. Paterno from G4 efforts (for at least 2-3 months effective ~Sep 15 th. ) M. Fischler + 1 FTE (computer scientist) will undertake (1) shortly, with minimum guidance from M. Paterno.

Plans for the future In the longer term 2009: M. Fischler, J. Kowalkowski, M. Paterno will resume (2), (3) The FNAL/G4 team has interest in efforts to support multi-core/multi- threaded programming: make code thread safe, create code that scales well with multiple CPUs. Man-power is an issue. During this workshop we should discuss with Gabriele/John a wish list for A well defined long term program would help allocate FNAL resources to G4.