Analysis efficiency Andrei Gheata ALICE offline week 03 October 2012
Sources for improving analysis efficiency The analysis flow involves mixed processing phases per event – Reading event data from disk – sequential (!) – De-serializing the event object hierarchy – sequential (!) – Processing the event - parallelizable – Cleaning the event structures - sequential – Writing the output – sequential but parallelizable – Merging the outputs – sequential but parallelizable The efficiency of the analysis job: – job_eff = (t ds +t proc +t cl )/t total – analysis_eff = t proc / t total Time/event for different phases depending on many factors – T read ~ IOPS*event_size/read_throughput – to be minimized Minimize event size, keep under control read throughput – T ds +T cl ~ event_size*n_branches – to be minimized Minimize event size and complexity – T proc = ∑ wagons T i – to be maximized Maximize number of wagons and useful processing – T write = output_size/write_throughput – to be minimized t read t ds t proc t cl t write Event #0 Event #1 Event #2 Event #m Event #0 Event #1 Event #2 Event #n Event #0 Event #1 Event #2 Event #p t merge
Monitoring analysis efficiency Instrumentation at the level of TAlienFile and AliAnalysisManager Collecting timing, data size transfers, efficiency for different stages – Correlated with site, SE, LFN, PFN Collection of data per subjob, remote or local – mgr->SetFileInfoLog(“fileinfo.log”); – Already in action for LEGO trains
Monitored analysis info ################################################################# pfn /11/60343/578c e1-9cd cfd8b68#AliAOD.root url root://xrootd3.farm.particle.cz:1094//11/60343/578c e1-9cd cfd8b68#AliAOD.root se ALICE::Prague::SE image 1 nreplicas 0 openstamp opentime runtime filesize readsize throughput ################################################################# pfn /13/34934/2ed51c74-618b-11e1-a1cc-63e6dd7c661e#AliAOD.root url root://xrootd3.farm.particle.cz:1094//13/34934/2ed51c74-618b-11e1-a1cc-63e6dd7c661e#AliAOD.root se ALICE::Prague::SE image 1 nreplicas 0 openstamp opentime runtime filesize readsize throughput #summary######################################################### train_name train root_time root_cpu init_time io_mng_time exec_time alien_site CERN host_name lxbse13c04.cern.ch Processed input files Analysis info
Throughput plots A simple and intuitive way to present the results Will allow diagnosing both the infrastructure & the analysis Throughput [MB/sec] Time [sec] PFN1PFN2PFN3PFN4PFN5 Initialization I/O execution
Few numbers for an empty analysis Spinning 50 MB/sAccess time 13 msRead size 270 MB AOD PbPb Job time 45.5 secThroughput 5.93 MB/sJob efficiency 86.5 % SSD 266 MB/sAccess time 0.2 msRead size 270 MB AOD PbPb Job time 39.5 secThroughput 6.83 MB/sJob efficiency 94.1 % Inter site 7.4 MB/s (JINR)Access time = RTT 63 ms + local disk access time (?) Read size MB AOD PbPb L=200, Job time 258 secThroughput MB/sJob efficiency 2.5 % L= number of concurrent processes running on the disk storage server I/O latency is a killer for events with many branches De-serialization is determinant for locally available data – it depends on the size, but ALSO on the complexity (number of branches) L=5, Job time 46.8 secThroughput 0.46 MB/sJob efficiency 13.4 %
The source of problems Highly fragmented buffer queries over high latency network – Big number of buffers retrieved sequentially No asynchronous reading or prefetching enabled in xrootd or elsewhere ROOT provides the mechanism to compact buffers and read them async: TTreeCache – Not used until now – Now added in AliAnalysisManager
Reading improvement AOD AOD PbPb, JINR::SE (RTT=65ms to CERN) Cache sizeAsync. readSpeedup 0 (current status)-1 50 MBNo Yes MBNo Yes MBNo Yes MBNo Yes
Reading improvement AOD AOD pp, LBL::SE (RTT=173ms to CERN) Cache sizeAsync. readSpeedup 0 (current status)-1 50 MBNo Yes MBYes20.12
Reading improvement MC ESD pp, CNAF::SE (RTT=20 ms to CERN) ESD pp, CERN::EOS (RTT=0.3 ms) Cache sizeAsync. readSpeedup 0 (current status)-1 50 MB, ESD cache onlyYes MB, ESD, TK, TR cachesYes4.22 Cache sizeAsync. readSpeedup 0 (current status)-1 50 MB, ESD, TK, TR cachesYes1.32
What to do to get it For AOD or ESD data, nothing – Cache set by default to 100 MB, async read enabled – Size of cache can be tuned via: mgr->SetCacheSize(bytes) For MC, the cache sizes for kinematics and TR will follow the manager setting – Don’t forget to use: mcHandler->SetPreReadMode(AliMCEventHandler::kLmPreRead)
To do’s Feed analysis info to alimonitor DB – Provide info in real time about analysis efficiency and status of data flows – Point out site configuration and dispatching problems TTreePerfStats-based analysis – Check how our data structures perform and pin down eventual problems