Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMOD Report Doug Benjamin Duke University. Running Jobs last 7 days 120K MC sim Users MC Rec Group.

Similar presentations


Presentation on theme: "AMOD Report Doug Benjamin Duke University. Running Jobs last 7 days 120K MC sim Users MC Rec Group."— Presentation transcript:

1 AMOD Report Doug Benjamin Duke University

2 Running Jobs last 7 days 120K MC sim Users MC Rec Group

3 DDM activity last 7 days 40 TB

4 FT3-Pilot and Functional test issues On going FT3-Pilot problems (GGUS:97359 and GGUS:97419) effect Functional tests to all sites. o Var area full on single FT3-Pilot machine o Functional test served by FT3-Pilot o Rucio testing uses same machinery o Immediate Issue solved o Additional resources requested FT3-Pilot (GGUS:97359) Problem with a cached proxy affected functional tests to all sites. (solved) Functional Tests to Tier 1 sites stopped for a couple of days – Santa Claus needed to be restarted o Wednesday - Network intervention likely the cause (next slide) o Service restored over the weekend

5 Wednesday Network router upgrages Wednesday (18-Sep) – various core routers were upgraded – Outages were plan to be sporadic and brief. (finished by 10:00 am) But…. Several redundant routers were simultaneously upgraded instead of upgraded in series. Net result – o Site level monitoring frozen and offline o Many VM’s not accessable o Lxvoadm – group of machines used to access critical VM in ATLAS distributed computing machines not accessible until 2 hours after planned outage time. o Santa Claus – in stopped state but SLS monitoring was green (after it had been restored).

6 Lost files – AFS issues Triumf – many 10K’s lost during storage system migration – exact extent being determined. AFS - ~13:03 on 19-Sept spurious rm process on /afs/cern.ch/atlas/offline/* removed RW areas including panda client areas needed by Hammer Cloud. Computing operations restored the needed area from tape promptly when alerted 20-Sept. Exact cause of rm is unknown.(INC:388802) ATLAS investigation continues. o Users and Hammer Cloud affected o Panda Client code removed o Various areas restored from tape

7 Thanks Thanks to the ADCOS shifters and experts who help report and debug the issues during the week Thanks to ATLAS central operations for recovering the unexpected outages on Wednesday Thanks to CERN IT staff who help restore services Special thanks Ale DiG. Whose patience with DB is always appreciated especially on the weekend.


Download ppt "AMOD Report Doug Benjamin Duke University. Running Jobs last 7 days 120K MC sim Users MC Rec Group."

Similar presentations


Ads by Google