Presentation is loading. Please wait.

Presentation is loading. Please wait.

Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week – 11.07.2008.

Similar presentations


Presentation on theme: "Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week – 11.07.2008."— Presentation transcript:

1 Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week – 11.07.2008

2 Outline CAF Usage and Users’ grouping Disk monitoring Datasets CPU Fairshare monitoring User query Conclusions & Outlook

3 CERN Analysis Facility Cluster of 40 machines since two years 80 CPUs, 8 TB of disk pool 35 machines as PRO partition, 5 as DEV Head node is xrootd redirector and PROOF master Other nodes are xrootd data servers and PROOF slaves

4 Available resources in CAF must be fairly used Highest attention to how disks and CPUs are used Users are grouped At present, sub-detectors and physics working groups Users can belong to several groups (PWG has precedence over sub-detector) Each group has a disk space (quota) which is used to stage datasets from AliEn has a CPU fairshare target (priority) to regulate concurrent queries CAF Usage

5 CAF Groups Groups#UsersDisk quota (GB)CPU quota (%) PWG05100010 PWG11100010 PWG221100010 PWG38100010 PWG417100010 EMCAL110 HMPID110 ITS310 T0110 MUON310 PHOS110 TPC210 TOF110 ZDC110 proofteam510010 testusers4010 marco120010 COMMON1200010 Not absolute quotas 18 registered groups ~ 60 users 165 users have used CAF: please register to groups!

6 Resource Monitoring ML ApMon running on each node Sends monitoring information each minute Default monitoring (Load, CPU, memory, swap, disk I/O, network) Additional information: PROOF and disk servers status (xrootd/olbd) Number of PROOF sessions (proofd master) Number of queued staging requests and hosted files (DS manager)

7 Status Table

8 lxb6047: 310 lxb6048: 309 lxb6049: 308 lxb6050: 308 lxb6051: 308 lxb6052: 309 lxb6053: 309 lxb6054: 0 lxb6055: 309 lxb6056: 311 lxb6057: 307 lxb6058: 308 lxb6059: 309 lxb6060: 310 lxb6061: 311 lxb6062: 309 lxb6063: 309 lxb6064: 307 lxb6065: 308 lxb6066: 1089 lxb6067: 309 lxb6068: 311 lxb6069: 309 lxb6070: 313 lxb6071: 311 lxb6072: 309 lxb6073: 312 lxb6074: 312 lxb6075: 310 lxb6076: 311 lxb6077: 309 lxb6078: 307 lxb6079: 312 lxb6080: 309 ----- 10992 2081 1434 1491 1285 1510 1135 1548 1248 2200 1997 3017 1370 1486 2275 1105 657 1337 1254 2104 1595 1354 1077 3700 1236 1794 1147 1378 887 759 1422 1710 2859 1125 1088 ----- 53665 lxb6047: 4505252 lxb6048: 4765808 lxb6049: 4535616 lxb6050: 4626584 lxb6051: 4611296 lxb6052: 4357392 lxb6053: 4618860 lxb6054: 0 lxb6055: 4617420 lxb6056: 4616616 lxb6057: 4604636 lxb6058: 4616228 lxb6059: 4498200 lxb6060: 4503860 lxb6061: 4615572 lxb6062: 4442524 lxb6063: 4648184 lxb6064: 4506060 lxb6065: 4617604 lxb6066: 8205852 lxb6067: 4616636 lxb6068: 4610376 lxb6069: 4503408 lxb6070: 4621144 lxb6071: 4617128 lxb6072: 4503624 lxb6073: 4614408 lxb6074: 4578440 lxb6075: 4617216 lxb6076: 4575632 lxb6077: 4508424 lxb6078: 4502668 lxb6079: 4503700 lxb6080: 4408096 -------- 154294464 136503228 131620332 129713692 130692932 131562416 131304256 133114200 147588376 136338280 129738196 134240916 131261948 137193076 136599452 135547372 133563072 138580720 135901416 136091560 79725312 135399544 127204136 131962576 138608904 138949284 125406368 143095248 128861436 126755308 139379600 136096716 125332288 132440152 142560756 ---------- 4508933068 Hosted files and Disk Usage #Raw files: 11k #Sim files: 54k Raw on disk:154GB Sim on disk:4.5TB #Raw files: 11k #Sim files: 54k Raw on disk:154GB Sim on disk:4.5TB Number of FilesDisk Pool usage (Kb) Raw dataSim dataRaw dataSim data ESDs from RAW data production ready to be staged

9 Datasets (DS) are used to stage files from AliEn A DS is a list of files (usually ESDs or archives) registered by users for processing with PROOF DSs may share same physical files Staging script issues new staging requests and touch files every 5 mins Files are uniformly distributed by the xrootd data manager Interaction with the GRID

10 Dataset Manager The DS manager takes care of the quotas at file level Physical location of files is regulated by xrootd The DS manager daemon sends: The overall number of files Number of new, touched, disappeared, corrupted files Staging requests Disk utilization for each user and for each group Number of files on each node and total size

11 Dataset Monitoring - PWG1 is using 0% of 1TB - PWG3 is using 5% of 1TB - PWG1 is using 0% of 1TB - PWG3 is using 5% of 1TB

12 Datasets List /COMMON/COMMON/ESD5000_part | 1000 | /esdTree | 100000 | 50 GB | 100 % /COMMON/COMMON/ESD5000_small | 100 | /esdTree | 10000 | 4 GB | 100 % /COMMON/COMMON/run15034_PbPb | 967 | /esdTree | 939 | 500 GB | 97 % /COMMON/COMMON/run15035_PbPb | 962 | /esdTree | 952 | 505 GB | 98 % /COMMON/COMMON/run15036_PbPb | 961 | /esdTree | 957 | 505 GB | 99 % /COMMON/COMMON/run82XX_part1 | 10000 | /esdTree | 999500 | 289 GB | 99 % /COMMON/COMMON/run82XX_part2 | 10000 | /esdTree | 922600 | 289 GB | 92 % /COMMON/COMMON/run82XX_part3 | 10000 | /esdTree | 943100 | 288 GB | 94 % /COMMON/COMMON/sim_160000_esd | 95 | /esdTree | 9400 | 267 MB | 98 % /PWG0/COMMON/run30000X_10TeV_0.5T | 2167 | /esdTree | 216700 | 90 GB | 100 % /PWG0/COMMON/run31000X_0.9TeV_0.5T | 2162 | /esdTree | 216200 | 57 GB | 100 % /PWG0/COMMON/run32000X_10TeV_0.5T_Phojet | 2191 | /esdTree | 219100 | 83 GB | 100 % /PWG0/COMMON/run33000X_10TeV_0T | 2191 | /esdTree | 219100 | 108 GB | 100 % /PWG0/COMMON/run34000X_0.9TeV_0T | 2175 | /esdTree | 217500 | 65 GB | 100 % /PWG0/COMMON/run35000X_10TeV_0T_Phojet | 2190 | /esdTree | 219000 | 98 GB | 100 % /PWG0/phristov/kPhojet_k5kG_10000 | 100 | /esdTree | 1100 | 4 GB | 11 % /PWG0/phristov/kPhojet_k5kG_900 | 97 | /esdTree | 2000 | 4 GB | 20 % /PWG0/phristov/kPythia6_k5kG_10000 | 99 | /esdTree | 1600 | 4 GB | 16 % /PWG0/phristov/kPythia6_k5kG_900 | 99 | /esdTree | 1100 | 4 GB | 11 % /PWG2/COMMON/run82XX_test4 | 10 | /esdTree | 1000 | 297 MB | 100 % /PWG2/COMMON/run82XX_test5 | 10 | /esdTree | 1000 | 297 MB | 100 % /PWG2/akisiel/LHC500C0005 | 100 | /esdTree | 97 | 663 MB | 100 % /PWG2/akisiel/LHC500C2030 | 996 | /esdTree | 995 | 4 GB | 99 % /PWG2/belikov/40825 | 1355 | /HLTesdTree | 1052963 | 143 GB | 99 % /PWG2/hricaud/LHC07f_160033DataSet | 915 | /esdTree | 91400 | 2 GB | 99 % /PWG2/hricaud/LHC07f_160038_root_archiveDataSet| 862 | /esdTree | 86200 | 449 GB | 100 % /PWG2/jgrosseo/sim_1600XX_esd | 33568 | /esdTree | 3293900 | 103 GB | 98 % /PWG2/mvala/PDC07_pp_0_9_82xx_1 | 99 | /rsnMVTree | 990000 | 1 GB | 100 % /PWG2/mvala/RSNMV_PDC06_14TeV | 677 | /rsnMVTree | 6442101 | 24 GB | 100 % /PWG2/mvala/RSNMV_PDC07_09_part1 | 326 | /rsnMVTree | 2959173 | 5 GB | 100 % /PWG2/mvala/RSNMV_PDC07_09_part1_new | 326 | /rsnMVTree | 2959173 | 5 GB | 100 % /PWG2/pganoti/FirstPhys900Field_310000 | 1088 | /esdTree | 108800 | 28 GB | 100 % /PWG3/arnaldi/PDC07_LHC07g_200314 | 615 | /HLTesdTree | 45000 | 787 MB | 94 % /PWG3/arnaldi/PDC07_LHC07g_200315 | 594 | /HLTesdTree | 42600 | 744 MB | 95 % /PWG3/arnaldi/PDC07_LHC07g_200316 | 366 | /HLTesdTree | 30700 | 513 MB | 99 % /PWG3/arnaldi/PDC07_LHC07g_200317 | 251 | /HLTesdTree | 20100 | 333 MB | 100 % /PWG3/arnaldi/PDC08_170167_001 | 1 | N/A | 33 MB | 0 % /PWG3/arnaldi/PDC08_LHC08t_170165 | 976 | /HLTesdTree | 487000 | 4 GB | 99 % /PWG3/arnaldi/PDC08_LHC08t_170166 | 990 | /HLTesdTree | 495000 | 4 GB | 100 % /PWG3/arnaldi/PDC08_LHC08t_170167 | 975 | /HLTesdTree | 424500 | 8 GB | 87 % /PWG3/arnaldi/myDataSet | 975 | /HLTesdTree | 424500 | 8 GB | 87 % /PWG4/anju/myDataSet | 946 | /esdTree | 94500 | 27 GB | 99 % /PWG4/arian/jetjet15-50 | 9817 | /esdTree | 973300 | 630 GB | 99 % /PWG4/arian/jetjetAbove_50 | 94 | /esdTree | 8000 | 7 GB | 85 % /PWG4/arian/jetjetAbove_50_real | 958 | /esdTree | 90500 | 73 GB | 94 % /PWG4/elopez/jetjet15-50_28000x | 7732 | /esdTree | 739800 | 60 GB | 95 % /PWG4/elopez/jetjet50_r27000x | 8411 | /esdTree | 793100 | 92 GB | 94 % Jury produced Pt specturm plots staging his own DS (run #40825, TPC+ITS, field on) Start staging common DSs of reconstructed runs? Jury produced Pt specturm plots staging his own DS (run #40825, TPC+ITS, field on) Start staging common DSs of reconstructed runs? ~ 4.7GB used out of 6GB (34 * 200MB - 10%)

13 Usages retrieved each 5 mins, averaged each 6 hours Compute new priorities applying a correction formula in [  *quota..  *quota] 100% f(x) =  q +  q*exp(kx) k = 1/q*Ln(1/4) 10% 40% quota (q) priorityMin priorityMax 0% 20% CPU Fairshare α = 0.5, β = 2 usage

14 Priorities are used for CPU fairshare and converge to quotas Usages are averaged to gracefully converge to quotas If no competition, users get max CPUs Only relative priorities are modified! Priority Monitoring

15 CPU quotas in practice - only PWGs + default groups - default usually has the highest usage - only PWGs + default groups - default usually has the highest usage

16 Query Monitoring When a user query completes, PROOF master sends statistics: Read bytes Consumed CPU time (base for CPU fairshare) Number of processed events User waiting time Values are aggregated per user and group

17 accumulated per interval Query Monitoring

18 Outlook User sessions monitoring in average 4-7 sessions in parallel (daily hours, EU time), with peek of 15-20 users during the tutorial sessions: running history missing need to monitor #workers per user when load-based scheduling will be introduced Additional monitoring per single query (disk used and Files/sec not implemented yet) Network traffic correlation among nodes Xrootd activity with the new bulk staging requests Debug Tool to monitor and kill a hanging session when Reset doesn’t work (need to restart the cluster) Hardware New ALICE MAC cluster “ready” (16 workers) New IT 8-core machines coming Training PROOF/CAF is the key setup for interactive user analysis (and more) Number of people attending the monthly tutorial is increasing (20 persons last week!)


Download ppt "Statistics of CAF usage, Interaction with the GRID Marco MEONI CERN - Offline Week – 11.07.2008."

Similar presentations


Ads by Google