The optimisation of ALICE code Federico Carminati January 19, 2012 1.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Memory Management and Protection Part 3:Virtual memory, mode switching,

Advertisements

Thoughts on Shared Caches Jeff Odom University of Maryland.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Multiprocessor OS The functional capabilities often required in an OS for a multiprogrammed computer include the resource allocation and management schemes,

Impact of parallelism on HEP software April 29 th 2013 Ecole Polytechnique/LLR Rene Brun.

Rules for Designing Multithreaded Applications CET306 Harry R. Erwin University of Sunderland.

CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

Software Group © 2006 IBM Corporation Compiler Technology Task, thread and processor — OpenMP 3.0 and beyond Guansong Zhang, IBM Toronto Lab.

Chapter 1 Software Development. Copyright © 2005 Pearson Addison-Wesley. All rights reserved. 1-2 Chapter Objectives Discuss the goals of software development.

CMS Full Simulation for Run-2 M. Hildrith, V. Ivanchenko, D. Lange CHEP'15 1.

Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.

Language Evaluation Criteria

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

Roger Jones, Lancaster University1 Experiment Requirements from Evolving Architectures RWL Jones, Lancaster University Ambleside 26 August 2010.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Offline Coordinators  CMSSW_7_1_0 release: 17 June 2014  Usage:  Generation and Simulation samples for run 2 startup  Limited digitization and reconstruction.

Relational Databases and Transaction Design Lecture 27.

Programming for Beginners Martin Nelson Elizabeth FitzGerald Lecture 15: More-Advanced Concepts.

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

1 Software Development Software Engineering is the study of the techniques and theory that support the development of high-quality software The focus is.

Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.

- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.

CS162 Week 5 Kyle Dewey. Overview Announcements Reactive Imperative Programming Parallelism Software transactional memory.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

1-1 Software Development Objectives: Discuss the goals of software development Identify various aspects of software quality Examine two development life.

CCA Common Component Architecture CCA Forum Tutorial Working Group CCA Status and Plans.

Processor Architecture

ATLAS Meeting CERN, 17 October 2011 P. Mato, CERN.

PARTICLE TRANSPORT REFLECTING ON THE NEXT STEP R.BRUN, F.CARMINATI, A.GHEATA 1.

The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico Carminati.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Threaded Programming Lecture 2: Introduction to OpenMP.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

Update on G5 prototype Andrei Gheata Computing Upgrade Weekly Meeting 26 June 2012.

Preliminary Ideas for a New Project Proposal.  Motivation  Vision  More details  Impact for Geant4  Project and Timeline P. Mato/CERN 2.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

Concurrency and Performance Based on slides by Henri Casanova.

Concurrency (Threads) Threads allow you to do tasks in parallel. In an unthreaded program, you code is executed procedurally from start to finish. In a.

1/50 University of Turkish Aeronautical Association Computer Engineering Department Ceng 541 Introduction to Parallel Computing Dr. Tansel Dökeroğlu

Report on Vector Prototype J.Apostolakis, R.Brun, F.Carminati, A. Gheata 10 September 2012.

16 th Geant4 Collaboration Meeting SLAC, September 2011 P. Mato, CERN.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

GeantV – Adapting simulation to modern hardware Classical simulation Flexible, but limited adaptability towards the full potential of current & future.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Topics to be covered Instruction Execution Characteristics

SuperB and its computing requirements

Exploiting Parallelism

Report on Vector Prototype

Lecture 5: GPU Compute Architecture

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 5: GPU Compute Architecture for the last time

Architectures of distributed systems Fundamental Models

ECE 551: Digital System Design & Synthesis

Memory System Performance Chapter 3

Synchronization These notes introduce:

Presentation transcript:

The optimisation of ALICE code Federico Carminati January 19,

Rationale The HEP code An embarrassing parallelism An inextricable mix of branches / integer / float / double A “flat” timing distribution – no “hot spots” We always got away with clock rate, now it is not possible any more Parallelism is there to stay We cannot claim that we are resource-hungry and then exploit ~10%-50% of the hardware Just think what it means in terms of money 2

parallelism 3 From a recent talk by Intel

If you trust Intel 4

If you trust Intel 2 5

Why it is so difficult? No clear kernel C++ code generation / optimisation not well understood Most of the technology is coming out now Lack of standards Technological risk Non professional coders Fast evolving code No control on hardware acquisition 6

Why it is so difficult (cont)? Amdhal law sets stringent limits to the results that can be achieved No “low level” optimisation alone will yield results Heterogeneous parallelism forces multi-level parallelisation Essentially the code (all of it!) will have to be re- written 7

8

ALICE strategy (unauthorised) Use the LSD-1 to essentially re-write AliRoot Use the LSD-2 to expand the parallelism to the Grid Hopefully the major thrust will be on MiddleWare Refactor the code in order to expose the maximum of parallelism present at each level Keep the code in C++ (no CUDA, OpenCL etc.) Explore the possible use of #pragma’s (OpenMP, OpenACC) Experiment on all hardware at hand (OpenLab, but not only) 9

Timeline May 2012 Kick-off Jan2013 Work starts June 2013 Mid term review phase I Dec 2013 End phase I June 2014 Mid term review Phase II Dec 2014 End phase II } R&D } Phase I } Phase II

One example – Simulation The LHC experiments use extensively G4 as main simulation engine. They have invested in validation procedures. Any new project must be coherent with their framework. One of the reasons why the experiments develop their own fast MC solution is the fact that a full simulation is too slow for several physics analysis. These fast MCs are not in the G4 framework (different control, different geometries, etc), but becoming coherent with the experiments frameworks. Giving the amount of good work with the G4 physics, it is unthinkable to not capitalize on this work. 11

Event loop and stacking 12 User application Push primaries Stack Stack manager Current transporter Loop over particles Geometry navigator Field Virtual transporter Physics processes Push secondaries Step manager Step actions for selected process User step actions Current transporter

Fast and Full MonteCarlo We would like an architecture (via the abstract transporters) where fast and full MC can be run together. To make it possible one must have a separate particle stack. However, it was clear from the very beginning in January that the particle stack depends strongly on the constraints of parrallelism. Multiple threads cannot update efficiently a tree data structure. 13

Conventional Transport At each step, the navigator *nav has the state of the particle x,y,z,px,py,pz, the volume instance volume*, etc. We compute the distance to the next boundary with something like Dist = nav->DistoOut(volume,x,y,z,px,py,pz) Or the distance to one physics process with, eg Distp = nav->DistPhotoEffect(volume,x,y,z,px,py,pz) 14

15

Current Situation We run jobs in parallel, one per core. Nothing wrong with that except that it does not scale in case of many cores because it requires too much memory. A multithreaded version may reduce (say by a factor 2 or 3) the amount of required memory, but also at the expense of performance. A multithreaded version does not fit well with a hierarchy of processors. So, we have a problem, in particular with the way we have designed some data structures, eg HepMC. 16

Can we make progress? We need data structures with internal relations only. This can be implemented by using pools and indices. When looping on collections, one must avoid the navigation in large memory areas killing the cache. We must generate vectors of reasonable size well matched to the degree of parallelism of the hardware and the amount of memory. We must find a system to avoid the tail effects 17

tails, tails, tails 18

Tails again 19 A killer if one has to wait the end of col(i) before processing col(i+1) Average number of objects in memory

New Transport Scheme 20 o o o o o o o o o o o o o o o o o o o o o o T1 T3 T2 o o o o o o o o o o o o o o o o o o o o T4 All particles in the same volume type are transported in parallel. Particles entering new volumes or generated are accumulated in the volume basket. Events for which all hits are available are digitized in parallel

Generations of baskets When a particle enters a volume or is generated, it is added to the basket of particles for the volume type. The navigator selects the basket with the highest score (with a high and low water mark algorithm). The user has the control on the water marks, but the idea that this should be automatic in function of the number of processors and the total amount of memory available. (see interactive demo) 21

New Transport At each step, the navigator *nav has the state of the particles *x,*y,*z,*px,*py,*pz, the volume instances volume**, etc. We compute the distances (array *Dist) to the next boundaries with something like nav->DistoOut(volume,x,y,z,px,py,pz,Dist) Or the distances to one physics process with, eg nav->DistPhotoEffect(volume,x,y,z,px,py,pz,DispP) 22

New Transport The new transport system implies many changes in the geometry and physics classes. These classes must be vectorized (a lot of work!). Meanwhile we can survive and test the principle by implementing a bridge function like 23 MyNavigator::DisttoOut(int n, TGeoVolume **vol, double *x,..) { for int i=0;i<n;i++) { Dist[i] = DisttoOutOld(vol[i],x[i],…); }

A better solution 24 Pipeline of objects Checkpoint Synchronization. Only 1 « gap » every N events This type of solution required anyhow for pile-up studies

A better better solution 25 checkpoints At each checkpoint we have to keep the non finished objects/events. We can now digitize with parallelism on events, clear and reuse the slots.

26

27

Vectorizing the geometry (ex1) 28 Double_t TGeoPara::Safety(Double_t *point, Bool_t in) const { // computes the closest distance from given point to this shape. Double_t saf[3]; // distance from point to higher Z face saf[0] = fZ-TMath::Abs(point[2]); // Z Double_t yt = point[1]-fTyz*point[2]; saf[1] = fY-TMath::Abs(yt); // Y // cos of angle YZ Double_t cty = 1.0/TMath::Sqrt(1.0+fTyz*fTyz); Double_t xt = point[0]-fTxz*point[2]-fTxy*yt; saf[2] = fX-TMath::Abs(xt); // X // cos of angle XZ Double_t ctx = 1.0/TMath::Sqrt(1.0+fTxy*fTxy+fTxz*fTxz); saf[2] *= ctx; saf[1] *= cty; if (in) return saf[TMath::LocMin(3,saf)]; for (Int_t i=0; i<3; i++) saf[i]=-saf[i]; return saf[TMath::LocMax(3,saf)]; } Huge performance gain expected in this type of code where shape constants can be computed outside the loop

Vectorizing the geometry (ex2) 29 G4double G4Cons::DistanceToIn( const G4ThreeVector& p, const G4ThreeVector& v ) const { G4double snxt = kInfinity ; // snxt = default return value const G4double dRmax = 100*std::min(fRmax1,fRmax2); static const G4double halfCarTolerance=kCarTolerance*0.5; static const G4double halfRadTolerance=kRadTolerance*0.5; G4double tanRMax,secRMax,rMaxAv,rMaxOAv ; // Data for cones G4double tanRMin,secRMin,rMinAv,rMinOAv ; G4double rout,rin ; G4double tolORMin,tolORMin2,tolIRMin,tolIRMin2 ; // `generous' radii squared G4double tolORMax2,tolIRMax,tolIRMax2 ; G4double tolODz,tolIDz ; G4double Dist,s,xi,yi,zi,ri=0.,risec,rhoi2,cosPsi ; // Intersection point vars G4double t1,t2,t3,b,c,d ; // Quadratic solver variables G4double nt1,nt2,nt3 ; G4double Comp ; G4ThreeVector Normal; // Cone Precalcs tanRMin = (fRmin2 - fRmin1)*0.5/fDz ; secRMin = std::sqrt(1.0 + tanRMin*tanRMin) ; rMinAv = (fRmin1 + fRmin2)*0.5 ; if (rMinAv > halfRadTolerance) { rMinOAv = rMinAv - halfRadTolerance ; } else { rMinOAv = 0.0 ; } tanRMax = (fRmax2 - fRmax1)*0.5/fDz ; secRMax = std::sqrt(1.0 + tanRMax*tanRMax) ; rMaxAv = (fRmax1 + fRmax2)*0.5 ; rMaxOAv = rMaxAv + halfRadTolerance ; // Intersection with z-surfaces tolIDz = fDz - halfCarTolerance ; tolODz = fDz + halfCarTolerance ; …… //here starts the real algorithm Huge performance gain expected in this type of code where shape constants can be computed outside the loop All these statements are independent of the particle !!!

Vectorizing the Physics This is going to be more difficult when extracting the physics classes from G4. However important gains are expected in the functions computing the distance to the next interaction point for each process. There is a diversity of interfaces and we have now sub- branches per particle type. 30

Plan ahead (no timing yet) Continue exploring all concurrency opportunities Develop “virtual transporter” to include a full and fast option Introduce embryonic physics processes (em) to simulate shower development Evaluate the prototype on parallel architectures Evaluate different “parallel” languages (OpenMP, CUDA, OpenCL…) Cooperate with experiments For instance with ATLAS ISF (Integrated Simulation Framework) to put together the fast and full MC.

32