Presentation is loading. Please wait.

Presentation is loading. Please wait.

New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008.

Similar presentations


Presentation on theme: "New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008."— Presentation transcript:

1 New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008

2 Outline New CAF: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Users, Groups Machines, Files, Disks, Datasets, CPUs Staging problems Conclusions

3 Timeline 28.09 startup of the new CAF cluster 01.10 1st day with users on the new cluster 07.10 old CAF dismissed by IT Usage 26 workers instead of 33 (but much faster, see later) Head node is « alicecaf » instead of « lxb6046 » GSI based authentication, AliEn certificate needed Announced since July but many last-minute users with AliEn account != afs account or server certificate unknown Datasets clean up, staged only latest data production (First physics - stage 3) AF v4-15 meta package redistributed New CAF

4 Technical Differences Cmsd (Cluster Management Service Daemon) Why? Olbd not supported any longer What? Dynamic load balancing of files and data name-space How? Stager daemon can benefits from: bulk prepare replaces touch file bulk prepare allows "co-locate" files on the same node GSI authentication Secure communication using user certificates and LDAP based configuration management

5 Architectural Differences New CAFOld CAF ArchitectureAMD 64Intel 32 Machines13 x 8-core33 x dual CPU Space for staging13 x 2.33 TB33 x 200 GB Workers26 (2/node)33 (1/node) Mperf85701307 Why « only » 26 workers? You could use 104 if you are alone With 26 workers 4 users can effectively run concurrently Estimate average of 8 concurrent users… Processing units 6.5x faster than old CAF

6 Outline CAF2: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Users, Groups Machines, Files, Disks, Datasets, CPUs Staging problems Conclusions

7 CAF1 vs CAF2 (Processing Rate) Test Dataset First physics (stage 3) pp, Pythia6, 5kG, 10TeV /COMMON/COMMON/LHC08c11_10TeV_0.5T 1840 files, 276k events Tutorial task that runs over ESDs and displays Pt distribution Other comparison test: RAW data reconstruction (Cvetan)

8 Reminder The test is dependent on the file distribution for the used dataset Parallel code: Creation of workers Files validation (workers opening the files) Events loop (execution of the selector on the dataset) Serial code: Initialization of PROOF master, session and query objects Files look up Packetizer (file slices distribution) Merging (biggest task)

9 #nodes#eventsSize (GB)Init_timeProc_timeEv/sMB/sSpeedupEfficiency 332k0.250.8s3s64450 20k1.3517s114377 120k8.1149s2423164 200k13.531m23s2405163 276k18.712m34s1783120 262k0.250.4s2s1062811.6x 20k1.356s32992252.8x 120k8.1128s42532891.8x 200k13.5342s47433232.0x 276k18.7155s43653402.8x 1042k0.250.9s2s8481240.8x 20k1.355s35722441.1x27% 120k8.1119s62804271.4x35% 200k13.5331s63654331.3x32% 276k18.7145s61204171.2x30% Task executed 5 times and averaged

10 Processing Rate Comparison (1) The final average rate is the only important information 104 workers, 200k evs104 workers, 276k evs Final tail reflects the fact one by one workers stop working data unevenly distributed A longer tail shows a worker overloaded on the last packet(s) 3 workers maximum helping on the same «slow» packet

11 Processing Rate Comparison (2) Events/sec #events MB/sec ___104 workers ___ 26 workes ___ 33 workers

12 Outline CAF2: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Users/Groups Machines, Files, Disks, Datasets, CPU s Staging problems Conclusions

13 Available resources in CAF must be fairly used Highest attention to how disks and CPUs are used Users are grouped ( sub-detectors / physics working groups) Each group has a disk space (quota) which is used to stage datasets from AliEn has a CPU fairshare target (priority) to regulate concurrent queries CAF Usage

14 CAF Groups Groups#Users PWG021 (5) PWG13 (1) PWG239 (21) PWG318 (8) PWG430 (17) EMCAL2 (1) HMPID1 (1) ITS6 (3) T02 (1) MUON4 (3) PHOS4 (1) TPC3 (2) TOF1 (1) TRD4 (0) ZDC1 (1) VZERO2 (0) ACORDE1 (0) PMD3 (0) DEFAULT 19 registered groups 145 (60) registered users In brackets () the situation at the previous offline week

15 CAF Status Table

16 Files Distribution Nodes with more files can produce tails in processing rate Above a defined threshold files are not stored any longer Min: 1727 Max: 1863 Max difference: 8%

17 Disk Usage Max: 116 Min: 105 Max difference: 10%

18 Dataset Monitoring - 28TB disk space for staging - PWG0: 4TB - PWG1: 1TB - PWG2: 1TB - PWG3: 1TB - PWG4: 1TB - ITS: 0.2TB - COMMON: 2TB - 28TB disk space for staging - PWG0: 4TB - PWG1: 1TB - PWG2: 1TB - PWG3: 1TB - PWG4: 1TB - ITS: 0.2TB - COMMON: 2TB

19 CPU Quotas - default group is not the most consuming anymore

20 Outline CAF2: features CAF1 vs CAF2 processing rate comparison Current Statistics Users, Groups Machines, Files, Disks, Datasets, CPUs File Staging Conclusions

21 File Stager CAF intensively uses 'prepare’ 0-size files in Castor2 cannot be staged, but replicas are ok Check at stager level to avoid spawning infinite prepare on the same empty file unable to get online replica[i] in Castor && size==0? Copy replica (API service) Loop over the replicas (CERN, if any, taken first) replica[i] is not staged? Add to StageLIST Skip it STOP File corrupted. Skip it Stage StageLIST STOP

22 Outline CAF2: features CAF1 vs CAF2 Processing Rate comparison Current Statistics Files Distribution Users/Groups Staging Conclusions

23 CAF Usage Subscribe to alice-project-analysis-task-force@cern.ch using CERN SIMBA (http://listboxservices.web.cern.ch/listboxservices) Web page at http://aliceinfo.cern.ch/Offline/Analysis/CAF CAF tutorial once a month New CAF Faster machines, more space, more fun Shaky behavior due to higher user activity is under intensive investigation Credits PROOF Team and IT for the prompt support If (ever) you cannot connect just drop a mail and wait for… … « please try again »


Download ppt "New CERN CAF facility: parameters, usage statistics, user support Marco MEONI Jan Fiete GROSSE-OETRINGHAUS CERN - Offline Week – 24.10.2008."

Similar presentations


Ads by Google