PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.

PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 2 Outline Introduction Introduction Optimizing for local machines: PROOF-Lite Optimizing for local machines: PROOF-Lite Some performance results Some performance results Future Future

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 3 ROOT and threads Multi-threads is the natural way to exploit cores Multi-threads is the natural way to exploit cores Support for threads is available since long time in ROOT but many components cannot be used efficiently with multiple threads Support for threads is available since long time in ROOT but many components cannot be used efficiently with multiple threads Current CINT Current CINT Containers, Files Containers, Files Thread-safeness insured via global mutexes which introduce serialization at many places Thread-safeness insured via global mutexes which introduce serialization at many places Chain processing, looping run generic user code for which you cannot assume thread-safeness Chain processing, looping run generic user code for which you cannot assume thread-safeness The situation should improve in the future with the new CINT The situation should improve in the future with the new CINT

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 4 Using cores to improve IO When reading data a large fraction of time is spent in decompressing When reading data a large fraction of time is spent in decompressing This a case where additional core(s) may help and it is a dedicated task under control of ROOT which could already be done now in a separated thread This a case where additional core(s) may help and it is a dedicated task under control of ROOT which could already be done now in a separated thread Planned for (hopefully) not far future Planned for (hopefully) not far future

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 5 ROOT way to exploit multiple resources PROOF is the ROOT approach to exploit multiple resources to reduce the time needed to solve problems which can be formulated as at a set of independent tasks, i.e. embarrassing or ideally parallel PROOF is the ROOT approach to exploit multiple resources to reduce the time needed to solve problems which can be formulated as at a set of independent tasks, i.e. embarrassing or ideally parallel e.g. HEP events in TTree’s e.g. HEP events in TTree’s Job splitting to address ideal parallelism is an old concept, but Job splitting to address ideal parallelism is an old concept, but PROOF inter-connects many ROOT sessions in such a way that they are seen as a extension of the normal ROOT shell, with minimal syntax differencies. PROOF inter-connects many ROOT sessions in such a way that they are seen as a extension of the normal ROOT shell, with minimal syntax differencies. Splitting is dynamic allowing to optimize loads Splitting is dynamic allowing to optimize loads

The ROOT data model: Trees & Selectors Begin() Create histos, … Define output list Process() preselection analysis Terminate() Final analysis (fitting, …) output list Selector loop over events OK event branch leaf branch leaf 12 n last n read needed parts only Chain branch leaf

// Open the PROOF session root[0] TProof *p = TProof::Open(“master”) // Get a TFileCollection describing your dataset root[1] TFileCollection *fc = ->Get(“mydata”); // Register your dataset (only once) root[2] p->RegisterDataSet(“mydata”, fc); // Run over “mydata” you analysis selector mysel.C root[3] p->Process(“mydata”, “mysel.C+”) Typical PROOF session // Get a TFileCollection describing your dataset root[0] TFileCollection *fc = ->Get(“mydata”); // Create a TChain root[1] TChain *c = new TChain; root[2] c->AddFileInfoList(fc->GetList()); // Run over “mydata” you analysis selector mysel.C root[3] c->Process(“mysel.C+”) // Open the PROOF session root[4] TProof *p = TProof::Open(“master”) // Process on PROOF root[5] c->SetProof() root[6] c->Process(“mysel.C+”) PROOF Processing by name Local processing PROOF processing

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 8 PROOF PROOF has been developed having in mind the case of T2/T3 analysis facilities, clusters O(100) nodes PROOF has been developed having in mind the case of T2/T3 analysis facilities, clusters O(100) nodes Its flexible multi-tier architecture allows to adapt to very different situations, and to move in size in both directions Its flexible multi-tier architecture allows to adapt to very different situations, and to move in size in both directions Expand to federate clusters, eventually to the GRID Expand to federate clusters, eventually to the GRID See A. Manarof at PROOF07 See A. Manarof at PROOF07 Shrink to few machines Shrink to few machines Multi-Core is at the extreme: one machine, lot of CPU power … Multi-Core is at the extreme: one machine, lot of CPU power … How does vanilla PROOF on multi-cores ? How does vanilla PROOF on multi-cores ?

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 9 PROOF in a slide PROOF: Dynamic approach to end-user HEP analysis on distributed systems exploiting the intrinsic parallelism of HEP data submaster workersMSS geographical domain topmaster submaster workers MSS submaster workersMSS master client list of output objects (histograms, …) commands,scripts PROOF enabled facility

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 10 PROOF exploiting multi-cores Demo at Intel Quad Demo at Intel Quad launch (Nov 2006) launch (Nov 2006) Analysis: a search for Analysis: a search for  0 ’s in ALICE  0 ’s in ALICE Data: 4 GB simulated Data: 4 GB simulated (fit in memory) (fit in memory) Additional computing power fully exploited: quite promising! Evt/s MB/s 8 cores 4 cores 2 cores

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 11 However … … the analysis was effectively CPU-bound with quite small outputs (a few 1D histos). … the analysis was effectively CPU-bound with quite small outputs (a few 1D histos). What if we are in the opposite extreme (IO bound, large outputs)? What if we are in the opposite extreme (IO bound, large outputs)? PROOF forum report (Feb 2007): PROOF forum report (Feb 2007): “I have a dual-core 64 bit Intel machine, running SLC 4.3. … I setup local PROOF system and made a simple tree that I have filled and local PROOF system and made a simple tree that I have filled and analyzed. This is faster on one processor without PROOF than on two analyzed. This is faster on one processor without PROOF than on two with PROOF …” with PROOF …” What was the problem: What was the problem: one disk, no special hardware one disk, no special hardware Very light events: extremely I/O bound analysis Very light events: extremely I/O bound analysis Quite large output: large overhead from merging and object transfer Quite large output: large overhead from merging and object transfer

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 12 Lesson Rather a trivial one Depending on what you do, increasing the available CPU is not the end of the story: the bottle neck can be elsewhere Depending on what you do, increasing the available CPU is not the end of the story: the bottle neck can be elsewhere An improved IO system may be needed An improved IO system may be needed e.g. multiple disks, possibly in RAIDs e.g. multiple disks, possibly in RAIDs

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 13 PROOF optimizations While standard, 3-tier, PROOF seems basically OK for certain tasks, there is space of improvements for the cases of large outputs While standard, 3-tier, PROOF seems basically OK for certain tasks, there is space of improvements for the cases of large outputs Target: minimize number of creations of output objects Target: minimize number of creations of output objects Output object history trace in standard PROOF: Output object history trace in standard PROOF: Each worker creates an output object and streams it out to the master socket Each worker creates an output object and streams it out to the master socket The master re-creates it by streaming-in from the socket The master re-creates it by streaming-in from the socket The master merges it to the final version object The master merges it to the final version object The master streams the final object out to the client socket The master streams the final object out to the client socket The client re-creates it by streaming-in from the socket The client re-creates it by streaming-in from the socket Is all this needed locally? Is all this needed locally?

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 14 PROOF optimizations (2) Not really: Not really: The master is not needed locally: The master is not needed locally: the client can have master functionality the client can have master functionality Communication between processes can be improved Communication between processes can be improved Using UNIX sockets Using UNIX sockets Producing the objects in a shared area from to avoid streaming-in/-out from sockets Producing the objects in a shared area from to avoid streaming-in/-out from sockets e.g. a file or a shared memory e.g. a file or a shared memory

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 15 PROOF Lite PROOF Lite is a realization of PROOF in 2 tiers PROOF Lite is a realization of PROOF in 2 tiers The client starts and controls directly the workers The client starts and controls directly the workers Communication goes via UNIX sockets Communication goes via UNIX sockets No need of daemons: No need of daemons: workers are started via a call to ‘system’ and call back the client to establish the connection workers are started via a call to ‘system’ and call back the client to establish the connection Starts N CPU workers by default Starts N CPU workers by default Currently available from SVN ‘branches/dev/prooflite’ Currently available from SVN ‘branches/dev/prooflite’ Soon in the trunk Soon in the trunk

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 16 PROOF Lite (2) Additional reasons for PROOF-Lite Can ported on Windows Can ported on Windows There is no plan to port current daemons to Windows There is no plan to port current daemons to Windows Needs a substitute for UNIX sockets Needs a substitute for UNIX sockets Use TCP initially Use TCP initially Can be easily used to test PROOF code locally before submitting to a standard cluster Can be easily used to test PROOF code locally before submitting to a standard cluster Some problems with users’ code are difficult to debug directly on the cluster Some problems with users’ code are difficult to debug directly on the cluster

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 17 Merging from files Recent addition to PROOF Recent addition to PROOF Each worker writes the output object to a file Each worker writes the output object to a file The client-master gets just the location of the file and merges them using optimized merging The client-master gets just the location of the file and merges them using optimized merging Quite significant improvements for the case of large outputs Quite significant improvements for the case of large outputs

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 18 Some results: test setup Test machine Test machine Intel Xeon (2x2) x 2.66 Ghz Intel Xeon (2x2) x 2.66 Ghz 8 GB RAM 8 GB RAM Analysis: Analysis: Event generation: simple events ($ROOTSYS/test/Event.h) Event generation: simple events ($ROOTSYS/test/Event.h) small output (TH1 histograms) small output (TH1 histograms) large output (TTree ~350 MB compressed; merging via files) large output (TTree ~350 MB compressed; merging via files) Process TTree from files (~80 MB/file) Process TTree from files (~80 MB/file) Full dataset: 22GB Full dataset: 22GB Sub-datasets: {2, 4, 6, 7, 8, 9, 10, 12} GB Sub-datasets: {2, 4, 6, 7, 8, 9, 10, 12} GB Read from the same disk or from separated disks Read from the same disk or from separated disks Results from average of {4,10} runs in the same conditions Results from average of {4,10} runs in the same conditions Non-PROOF results obtained using the same machinery Non-PROOF results obtained using the same machinery

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 19 Simple scaling Simple event generation and 1D histogram filling Simple event generation and 1D histogram filling ROOT Standard PROOF

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 20 Simple scaling ~large output Simple event generation, TTree filling, merging via file (output ~ 350MB after compression) Simple event generation, TTree filling, merging via file (output ~ 350MB after compression) ROOT Processing only Including merging Overhead due to merging ~ - 30%

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 21 Scaling processing a tree Data sets 2 GB (fits in memory), 22 GB Data sets 2 GB (fits in memory), 22 GB ROOT 2 GB, no memory refresh 22 GB

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 22 Hardware impact on scaling Courtesy of Neng Xu, Wisconsin PROOF07 Nov 2007

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 23 Performance vs fraction of RAM Reading datasets of increasing size Reading datasets of increasing size All in memory Refreshing memory

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 24 What next? PROOF-Lite PROOF-Lite Further optimizations for merging objects Further optimizations for merging objects Porting on Windows Porting on Windows Related developments Related developments Improve interface for non-TTree based analysis, currently based directly on TSelector Improve interface for non-TTree based analysis, currently based directly on TSelector TSelector template where to plug macros TSelector template where to plug macros Dedicated macros to instrument the code to transparently run loops on PROOF Dedicated macros to instrument the code to transparently run loops on PROOF Continue testing different scenario to find optimal configurations Continue testing different scenario to find optimal configurations

Begin() Create histos, … Define output list Terminate() Final analysis (fitting, …) output list Selector Time Process() analysis 1…N // Open the PROOF session root[0] TProof *p = TProof::Open(“master”) // Run 1000 times the analysis defined in the // MonteCarlo.C TSelector root[1] p->Process(“MonteCarlo.C+”, 1000) New TProof::Process(const char *selector, Long64_t times) Implement algorithm in a TSelector Generic, non-data-driven analysis

15/04/2007 G. Ganis, Parall.-MultiCore Workshop 26 Summary PROOF is currently the ROOT way to exploit multi- cores PROOF is currently the ROOT way to exploit multi- cores Performance: Performance: CPU-bound: already quite good CPU-bound: already quite good IO-bound: critically depends on I/O performance as for all systems IO-bound: critically depends on I/O performance as for all systems Handling of large outputs significantly improved by file-based merging Handling of large outputs significantly improved by file-based merging Version optimized for multi-core machine is available for test Version optimized for multi-core machine is available for test

PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.

Similar presentations

Presentation on theme: "PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008.

Similar presentations

Presentation on theme: "PROOF on multi-core machines G. GANIS CERN / PH-SFT for the ROOT team Workshop on Parallelization and MultiCore technologies for LHC, CERN, April 2008."— Presentation transcript:

Similar presentations

About project

Feedback