Download presentation
Presentation is loading. Please wait.
Published byBrice Poole Modified over 9 years ago
1
AMOD Report Simone Campana CERN IT-ES
2
Grid Services A very good week for sites – No major issues for T1s and T2s The only one to report is CASTOR@TW – Tail of problems after an hardware failure – DB index corrupted, need rebuild and a scheduled downtime – A typhoon in TW brought complications to the schedule
3
ATLAS services: DDM SS On Saturday many SS restarts due to ARDACallback agent crashing The problem was not related to Dashboard but to activeMQ – Both ActiveMQ and ARDA Dashb callbacks sent by the same agent Martin B. spotted an issue in an ActiveMQ broker ActiveMQ callbacks have been disabled in many SS machines CERN IT has been contacted about the faulty broker
4
ATLAS Services: DDM SS (follow up) The case needs to be added to the AMOD documentation (or the DDM documentation) The AMOD needs to be able to see the ActiveMQ monitoring (now certificate protected) The AMOD needs to be able to login to Dashboard machines (was possible, not working now) DDM SS need to be protected against this behavior – Martin has a list of possible improvements
5
ATLAS Services: DBs On Sunday afternoon, Online to Offline replication of non DCS data was “yellow” for 2 hours. This is not ADC responsibility: – The P1 shifter should report to the shift leader – The shift leader should contact the proper people Something went wrong in this – It is explained in the AMOD twiki but the AMOD missed to see it The problem vanished by itself
6
ATLAS Services: schedconfig There was a “partial” update of schedconfig – Some queue with “copytool=lcgcp2”, “lfcregister=None” in IN2P3-CC What happens: – The pilot uploads in the SE and does not register in LFC (feature of of lcgcp2) – The panda server does not register in LFC (since lfcregister=None) – Both Panda and Pilot believe all is OK and the job finishes successfully Now we have dark data and Prodsys thinking the task is complete …
7
ATLAS Services: schedconfig (follow up) Ueda is registering missing files by hand – 50% of files produced by IN2P3-CC in one week … – We are lucky Ueda is Ueda … I would take 2 month of holiday. Schedconfig should protect against this (I am not sure how or if AGIS can protect and how) since: – Human errors happen – The meaning and behavior of schedconfig fields is not well documented – We have many queues, many panda sites and many attributes for each of them BTW, let’s please push for getting rid of those panda queues once forever (see A. Di Girolamo’s thread)
8
ATLAS Services: comp@P1 terminal Firefox in the comp@P1 terminal crashed in the night of Wednesday The shifter tried the procedure to restart but did not succeed for 1h – Unable to connect to any page – The he called the AMOD. Who could not do much – But the system magically started to work again The (non confirmed) hypothesis is that the conTZole crashed Firefox – Happened in the past – But this time there was at least another problem – Ueda suggests to run conTZole and all the rest in separate windows
9
Conclusions Very quiet shift – My last AMOD was 1 week before the Higgs seminar … 2 night calls (both of them for a good reason)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.