PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.

Slides:

Advertisements

Similar presentations

Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.

Advertisements

O. Stézowski IPN Lyon AGATA Week September 2003 Legnaro Data Analysis – Team #3 ROOT as a framework for AGATA.

Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.

1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

PROOF: the Parallel ROOT Facility Scheduling and Load-balancing ACAT 2007 Jan Iwaszkiewicz ¹ ² Gerardo Ganis ¹ Fons Rademakers ¹ ¹ CERN PH/SFT ² University.

The ALICE Analysis Framework A.Gheata for ALICE Offline Collaboration 11/3/2008 ACAT'081A.Gheata – ALICE Analysis Framework.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Interactive Data Analysis with PROOF Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers CERN.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

ROOT and Federated Data Stores What Features We Would Like Fons Rademakers CERN CC-IN2P3, Nov, 2011, Lyon, France.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 PROOF The Parallel ROOT Facility Gerardo Ganis / CERN CHEP06, Computing in High Energy Physics 13 – 17 Feb 2006, Mumbai, India Bring the KB to the PB.

ROOT-CORE Team 1 PROOF xrootd Fons Rademakers Maarten Ballantjin Marek Biskup Derek Feichtinger (ARDA) Gerri Ganis Guenter Kickinger Andreas Peters (ARDA)

PROOF in Atlas Tier 3 model Sergey Panitkin 1 BNL.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.

WLCG Overview Board, September 3 rd 2010 P. Mato, P.Buncic Use of multi-core and virtualization technologies.

PROOF and ALICE Analysis Facilities Arsen Hayrapetyan Yerevan Physics Institute, CERN.

March, PROOF - Parallel ROOT Facility Maarten Ballintijn Bring the KB to the PB not the PB to the KB.

Super Scaling PROOF to very large clusters Maarten Ballintijn, Kris Gulbrandsen, Gunther Roland / MIT Rene Brun, Fons Rademakers / CERN Philippe Canal.

Threads. Readings r Silberschatz et al : Chapter 4.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

1 Status of PROOF G. Ganis / CERN Application Area meeting, 24 May 2006.

PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

Data Placement Intro Dirk Duellmann WLCG TEG Workshop Amsterdam 24. Jan 2012.

Parallelization Geant4 simulation is an embarrassingly parallel computational problem – each event can possibly be treated independently 1.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.

Analysis experience at GSIAF Marian Ivanov. HEP data analysis ● Typical HEP data analysis (physic analysis, calibration, alignment) and any statistical.

Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.

Data Analysis w ith PROOF, PQ2, Condor Data Analysis w ith PROOF, PQ2, Condor Neng Xu, Wen Guan, Sau Lan Wu University of Wisconsin-Madison 30-October-09.

Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.

ROOT and PROOF Tutorial Arsen HayrapetyanMartin Vala Yerevan Physics Institute, Yerevan, Armenia; European Organization for Nuclear Research (CERN)

I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)

October 19, 2010 David Lawrence JLab Oct. 19, 20101RootSpy -- CHEP10, Taipei -- David Lawrence, JLab Parallel Session 18: Software Engineering, Data Stores,

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Mini-Workshop on multi-core joint project Peter van Gemmeren (ANL) I/O challenges for HEP applications on multi-core processors An ATLAS Perspective.

ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.

Lyon Analysis Facility - status & evolution - Renaud Vernet.

The ALICE Analysis -- News from the battlefield Federico Carminati for the ALICE Computing Project CHEP 2010 – Taiwan.

Getting the Most out of Scientific Computing Resources

Experience of PROOF cluster Installation and operation

Getting the Most out of Scientific Computing Resources

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

ROOT I/O Performance Test

Solid State Disks Testing with PROOF

Diskpool and cloud storage benchmarks used in IT-DSS

Report PROOF session ALICE Offline FAIR Grid Workshop #1

Status of the CERN Analysis Facility

BDII Performance Tests

PROOF in Atlas Tier 3 model

Comments about PROOF / ROOT evolution during coming years

Simulation use cases for T2 in ALICE

Multi-Processing in High Performance Computer Architecture:

Grid Canada Testbed using HEP applications

Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.

Support for ”interactive batch”

PROOF - Parallel ROOT Facility

Simulation in a Distributed Computing Environment

Hybrid Programming with OpenMP and MPI

Multithreaded Programming

LO2 – Understand Computer Software

Objective Understand the concepts of modern operating systems by investigating the most popular operating system in the current and future market Provide.

Presentation transcript:

PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 2 Outline Introduction Introduction Optimizing for local machines: PROOF-Lite Optimizing for local machines: PROOF-Lite Some performance results Some performance results Future Future

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 3 ROOT and threads Multi-threads is the natural way to exploit cores Multi-threads is the natural way to exploit cores Support for threads is available since long time in ROOT but many components cannot be used efficiently with multiple threads Support for threads is available since long time in ROOT but many components cannot be used efficiently with multiple threads Current CINT Current CINT Containers, Files Containers, Files Thread-safeness insured via global mutexes which introduce serialization at many places Thread-safeness insured via global mutexes which introduce serialization at many places Chain processing, looping run generic user code for which you cannot assume thread-safeness Chain processing, looping run generic user code for which you cannot assume thread-safeness The situation should improve in the future with the new CINT The situation should improve in the future with the new CINT

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 4 Using cores to improve IO When reading data a large fraction of time is spent in decompressing When reading data a large fraction of time is spent in decompressing This a case where additional core(s) may help and it is a dedicated task under control of ROOT which could already be done now in a separated thread This a case where additional core(s) may help and it is a dedicated task under control of ROOT which could already be done now in a separated thread Planned for (hopefully) not far future Planned for (hopefully) not far future

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 5 ROOT way to exploit multiple resources PROOF is the ROOT approach to exploit multiple resources to reduce the time needed to solve problems which can be formulated as at a set of independent tasks, i.e. embarrassing or ideally parallel PROOF is the ROOT approach to exploit multiple resources to reduce the time needed to solve problems which can be formulated as at a set of independent tasks, i.e. embarrassing or ideally parallel e.g. HEP events in TTree’s e.g. HEP events in TTree’s Job splitting to address ideal parallelism is an old concept, but Job splitting to address ideal parallelism is an old concept, but PROOF inter-connects many ROOT sessions in such a way that they are seen as a extension of the normal ROOT shell, with minimal syntax differencies. PROOF inter-connects many ROOT sessions in such a way that they are seen as a extension of the normal ROOT shell, with minimal syntax differencies. Splitting is dynamic allowing to optimize loads Splitting is dynamic allowing to optimize loads

The ROOT data model: Trees & Selectors Begin() Create histos, … Define output list Process() preselection analysis Terminate() Final analysis (fitting, …) output list Selector loop over events OK event branch leaf branch leaf 12 n last n read needed parts only Chain branch leaf

// Open the PROOF session root[0] TProof *p = TProof::Open(“master”) // Get a TFileCollection describing your dataset root[1] TFileCollection *fc = ->Get(“mydata”); // Register your dataset (only once) root[2] p->RegisterDataSet(“mydata”, fc); // Run over “mydata” you analysis selector mysel.C root[3] p->Process(“mydata”, “mysel.C+”) Typical PROOF session // Get a TFileCollection describing your dataset root[0] TFileCollection *fc = ->Get(“mydata”); // Create a TChain root[1] TChain *c = new TChain; root[2] c->AddFileInfoList(fc->GetList()); // Run over “mydata” you analysis selector mysel.C root[3] c->Process(“mysel.C+”) // Open the PROOF session root[4] TProof *p = TProof::Open(“master”) // Process on PROOF root[5] c->SetProof() root[6] c->Process(“mysel.C+”) PROOF Processing by name Local processing PROOF processing

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 8 PROOF PROOF has been developed having in mind the case of T2/T3 analysis facilities, clusters O(100) nodes PROOF has been developed having in mind the case of T2/T3 analysis facilities, clusters O(100) nodes Its flexible multi-tier architecture allows to adapt to very different situations, and to move in size in both directions Its flexible multi-tier architecture allows to adapt to very different situations, and to move in size in both directions Expand to federate clusters, eventually to the GRID Expand to federate clusters, eventually to the GRID See A. Manarof at PROOF07 See A. Manarof at PROOF07 Shrink to few machines Shrink to few machines Multi-Core is at the extreme: one machine, lot of CPU power … Multi-Core is at the extreme: one machine, lot of CPU power … How does vanilla PROOF on multi-cores ? How does vanilla PROOF on multi-cores ?

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 9 PROOF in a slide PROOF: Dynamic approach to end-user HEP analysis on distributed systems exploiting the intrinsic parallelism of HEP data submaster workersMSS geographical domain topmaster submaster workers MSS submaster workersMSS master client list of output objects (histograms, …) commands,scripts PROOF enabled facility

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 10 PROOF exploiting multi-cores Demo at Intel Quad Demo at Intel Quad launch (Nov 2006) launch (Nov 2006) Analysis: a search for Analysis: a search for  0 ’s in ALICE  0 ’s in ALICE Data: 4 GB simulated Data: 4 GB simulated (fit in memory) (fit in memory) Additional computing power fully exploited: quite promising! Evt/s MB/s 8 cores 4 cores 2 cores

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 11 However … … the analysis was effectively CPU-bound with quite small outputs (a few 1D histos). … the analysis was effectively CPU-bound with quite small outputs (a few 1D histos). What if we are in the opposite extreme (IO bound, large outputs)? What if we are in the opposite extreme (IO bound, large outputs)? PROOF forum report (Feb 2007): PROOF forum report (Feb 2007): “I have a dual-core 64 bit Intel machine, running SLC 4.3. … I setup local PROOF system and made a simple tree that I have filled and local PROOF system and made a simple tree that I have filled and analyzed. This is faster on one processor without PROOF than on two analyzed. This is faster on one processor without PROOF than on two with PROOF …” with PROOF …” What was the problem: What was the problem: one disk, no special hardware one disk, no special hardware Very light events: extremely I/O bound analysis Very light events: extremely I/O bound analysis Quite large output: large overhead from merging and object transfer Quite large output: large overhead from merging and object transfer

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 12 Lesson Rather a trivial one Depending on what you do, increasing the available CPU is not the end of the story: the bottle neck can be elsewhere Depending on what you do, increasing the available CPU is not the end of the story: the bottle neck can be elsewhere An improved IO system may be needed An improved IO system may be needed e.g. multiple disks, possibly in RAIDs e.g. multiple disks, possibly in RAIDs

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 13 PROOF optimizations While standard, 3-tier, PROOF seems basically OK for certain tasks, there is space of improvements for the cases of large outputs While standard, 3-tier, PROOF seems basically OK for certain tasks, there is space of improvements for the cases of large outputs Target: minimize number of creations of output objects Target: minimize number of creations of output objects Output object history trace in standard PROOF: Output object history trace in standard PROOF: Each worker creates an output object and streams it out to the master socket Each worker creates an output object and streams it out to the master socket The master re-creates it by streaming-in from the socket The master re-creates it by streaming-in from the socket The master merges it to the final version object The master merges it to the final version object The master streams the final object out to the client socket The master streams the final object out to the client socket The client re-creates it by streaming-in from the socket The client re-creates it by streaming-in from the socket Is all this needed locally? Is all this needed locally?

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 14 PROOF optimizations (2) Not really: Not really: The master is not needed locally: The master is not needed locally: the client can have master functionality the client can have master functionality Communication between processes can be improved Communication between processes can be improved Using UNIX sockets Using UNIX sockets Producing the objects in a shared area from to avoid streaming-in/-out from sockets Producing the objects in a shared area from to avoid streaming-in/-out from sockets e.g. a file or a shared memory e.g. a file or a shared memory

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 15 PROOF Lite PROOF Lite is a realization of PROOF in 2 tiers PROOF Lite is a realization of PROOF in 2 tiers The client starts and controls directly the workers The client starts and controls directly the workers Communication goes via UNIX sockets Communication goes via UNIX sockets No need of daemons: No need of daemons: workers are started via a call to ‘system’ and call back the client to establish the connection workers are started via a call to ‘system’ and call back the client to establish the connection Starts N CPU workers by default Starts N CPU workers by default Currently available from SVN ‘branches/dev/prooflite’ Currently available from SVN ‘branches/dev/prooflite’ Soon in the trunk Soon in the trunk

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 16 PROOF Lite (2) Additional reasons for PROOF-Lite Can ported on Windows Can ported on Windows There is no plan to port current daemons to Windows There is no plan to port current daemons to Windows Needs a substitute for UNIX sockets Needs a substitute for UNIX sockets Use TCP initially Use TCP initially Can be easily used to test PROOF code locally before submitting to a standard cluster Can be easily used to test PROOF code locally before submitting to a standard cluster Some problems with users’ code are difficult to debug directly on the cluster Some problems with users’ code are difficult to debug directly on the cluster

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 17 Merging from files Recent addition to PROOF Recent addition to PROOF Each worker writes the output object to a file Each worker writes the output object to a file The client-master gets just the location of the file and merges them using optimized merging The client-master gets just the location of the file and merges them using optimized merging Quite significant improvements for the case of large outputs Quite significant improvements for the case of large outputs

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 18 Some results: test setup Test machine Test machine Intel Xeon (2x2) x 2.66 Ghz Intel Xeon (2x2) x 2.66 Ghz 8 GB RAM 8 GB RAM Analysis: Analysis: Event generation: simple events ($ROOTSYS/test/Event.h) Event generation: simple events ($ROOTSYS/test/Event.h) small output (TH1 histograms) small output (TH1 histograms) large output (TTree ~350 MB compressed; merging via files) large output (TTree ~350 MB compressed; merging via files) Process TTree from files (~80 MB/file) Process TTree from files (~80 MB/file) Full dataset: 22GB Full dataset: 22GB Sub-datasets: {2, 4, 6, 7, 8, 9, 10, 12} GB Sub-datasets: {2, 4, 6, 7, 8, 9, 10, 12} GB Read from the same disk or from separated disks Read from the same disk or from separated disks Results from average of {4,10} runs in the same conditions Results from average of {4,10} runs in the same conditions Non-PROOF results obtained using the same machinery Non-PROOF results obtained using the same machinery

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 19 Simple scaling Simple event generation and 1D histogram filling Simple event generation and 1D histogram filling ROOT Standard PROOF

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 20 Simple scaling ~large output Simple event generation, TTree filling, merging via file (output ~ 350MB after compression) Simple event generation, TTree filling, merging via file (output ~ 350MB after compression) ROOT Processing only Including merging Overhead due to merging ~ - 30%

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 21 Scaling processing a tree Data sets 2 GB (fits in memory), 22 GB Data sets 2 GB (fits in memory), 22 GB ROOT 2 GB, no memory refresh 22 GB

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 22 Hardware impact on scaling Courtesy of Neng Xu, Wisconsin PROOF07 Nov 2007

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 23 Performance vs fraction of RAM Reading datasets of increasing size Reading datasets of increasing size All in memory Refreshing memory

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 24 What next? PROOF-Lite PROOF-Lite Further optimizations for merging objects Further optimizations for merging objects Porting on Windows Porting on Windows Related developments Related developments Improve interface for non-TTree based analysis, currently based directly on TSelector Improve interface for non-TTree based analysis, currently based directly on TSelector TSelector template where to plug macros TSelector template where to plug macros Dedicated macros to instrument the code to transparently run loops on PROOF Dedicated macros to instrument the code to transparently run loops on PROOF Continue testing different scenario to find optimal configurations Continue testing different scenario to find optimal configurations

Begin() Create histos, … Define output list Terminate() Final analysis (fitting, …) output list Selector Time Process() analysis 1…N // Open the PROOF session root[0] TProof *p = TProof::Open(“master”) // Run 1000 times the analysis defined in the // MonteCarlo.C TSelector root[1] p->Process(“MonteCarlo.C+”, 1000) New TProof::Process(const char *selector, Long64_t times) Implement algorithm in a TSelector Generic, non-data-driven analysis

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 26 Summary PROOF is currently the ROOT way to exploit multi- cores PROOF is currently the ROOT way to exploit multi- cores Performance: Performance: CPU-bound: already quite good CPU-bound: already quite good IO-bound: critically depends on I/O performance as for all systems IO-bound: critically depends on I/O performance as for all systems Handling of large outputs significantly improved by file-based merging Handling of large outputs significantly improved by file-based merging Version optimized for multi-core machine is available for test Version optimized for multi-core machine is available for test