Presentation is loading. Please wait.

Presentation is loading. Please wait.

FAX status. Overview Status of endpoints and redirectors Monitoring Failover Overflow.

Similar presentations


Presentation on theme: "FAX status. Overview Status of endpoints and redirectors Monitoring Failover Overflow."— Presentation transcript:

1 FAX status

2 Overview Status of endpoints and redirectors Monitoring Failover Overflow

3 Endpoints Status on Sat. 15 Nov. Got one more site: RO-07-NIPNE Problems: We work on CSCS Not working at all: Nikhef Flip-flopping: FZK-LCG2 and NDGF-T1

4 Direct access Expired cert Wrong config Test jobs were unable to get proxy

5 Upstream redirection

6 Downstream redirection Redirectors moved to AI machines

7 Moving redirectors Herve had to move all the EU redirectors to the Agile Infrastructure. Simultaneously upgraded to xrootd 4.0.4. Started with DE redirector. Had to re-implement access rules. Continued with two redirectors per day. But old machines got re-introduced, confused everybody. A new set of changes being applied right now. Now situation clear, but sites need to restart their services as IP’s changed.

8 Monitoring Machine receiving info from AMQ and giving it to SSB etc. had to move to Agile Infrastructure. Took much more time then expected but it’s done now. EU sites were moving to sending monitoring data to CERN. Current state may be seen here (thanks to Igor Pelevanyuk): http://dashb- xrootd-comp.cern.ch/cosmic/ATLASmigrationMonitoring/ http://dashb- xrootd-comp.cern.ch/cosmic/ATLASmigrationMonitoring/ Still a lot of effort needed to make summary and detailed monitoring match: http://dashb-ai-621.cern.ch/cosmic/DB_ML_Comparator/ http://dashb-ai-621.cern.ch/cosmic/DB_ML_Comparator/ Started deeper analysis of Panda job info data transported into Hadoop at CERN. Further improvements in FSB

9 Cost matrix

10 Overflow Slowly expanding: BNL still missing, even the reverse proxy hardware is there. ANALY_AGLT2_SL6ANALY_INFN-T1 ANALY_CONNECTANALY_IN2P3-CC ANALY_BU_ATLASANALY_MPPMU ANALY_MWT2_SL6ANALY_DESY-HH ANALY_OU_OCHEPANALY_QMUL_SL6 ANALY_SLAC ANALY_SFU Can’t use data from the rest of EU cloud

11 Snakey overflow plots - success

12 Snakey overflow plots - failures

13 Overflow - workload

14 Overflow – workload

15 Overflow – job efficiency

16

17 Overflow – CPU efficiency

18 Reactions Up to now only two sites noticed the overflows: – TRIUMF – Jedi sent a lot of jobs to almost all US cloud sites, all reading from TRIUMF. Saturated their proxy (1Gb/s). They since made it 2 Gb/s. – QMUL – Chris Walker noticed 5Gbps+ at their NAT gateway, ~10TB/day. Not a problem for now.

19 Failover Jobs per 4 hours


Download ppt "FAX status. Overview Status of endpoints and redirectors Monitoring Failover Overflow."

Similar presentations


Ads by Google