Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014.

Similar presentations


Presentation on theme: "INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014."— Presentation transcript:

1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014

2 Outline Common services Network Farming Storage 20/05/2013Andrea Chierici2

3 Common services

4 Cooling problem in march Problem at cooling system, we had to switch the whole center off – Obviously the problem happened on Sunday at 1am  Took almost a week to completely recover and have our center 100% back on-line – But LHC exp. opened after 36h We learned a lot from this (see separate presentation) 20/05/2013Andrea Chierici4

5 New dashboard 20/05/2013Andrea Chierici5

6 Example: Facility 20/05/2013Andrea Chierici6

7 Installation and configuration CNAF seriously evaluating to move to puppet + foreman as common installation and configuration infrastructure INFN-T1 historically a quattor supporter New man power, wider user base and activities pushing us to change Quattor would stay around as much as needed – at least 1 year to allow for the migration of some critical services 20/05/2013Andrea Chierici7

8 Heartbleed No evidence of compromised nodes Updated SSL and certificates on bastions hosts and critical services (grid nodes, Indico, wiki) Some hosts were not exposed due to older version installed 20/05/2013Andrea Chierici8

9 Grid Middleware status EMI-3 update status – All core services updated – All WNs updated – Some legacy services (mainly UIs) still at EMI-1/2, will be phased out asap 20/05/2013Andrea Chierici9

10 Network

11 WAN Connectivity NEXUSCisco7600 RAL PIC TRIUMPH BNL FNAL TW-ASGC NDFGF IN2P3 SARA T1 resources LHC ONE LHC OPN General IP GARR Bo1 40Gb/s 10Gb/s 10 Gb/s CNAF-FNAL CDF (Data Preservation) 40 Gb Physical Link (4x10Gb) shared for LHCOPN and LHCONE. 10Gb/s 10 Gb/s For General IP Connectivity 20/05/201311Andrea Chierici

12 Current connection model INTERNET LHCOPN/ONE cisco 7600 bd8810 nexus 7018 10Gb/s Disk Servers Farming Switch Worker Nodes 4X1Gb/s Old resources 2009-2010 Farming Switch 20 Worker Nodes per switch 2x10Gb/s Up to 4x10Gb/s Core switches and routers are fully redundant (power, CPU, fabrics) Every Switch is connected with load sharing on different port modules Core switches and routers have a strict SLA (next solar day) for maintenance 20/05/2013Andrea Chierici12 4X10Gb/s 10Gb/s

13 Farming

14 Computing resources 150K HS-06 – Reduced compared to last WS – Old nodes have been phased-out (2008 and 2009 tender) Whole farm running on SL6 – Supporting a few VOs that still require sl5 via WNODeS 20/05/2013Andrea Chierici14

15 New CPU tender 2014 tender delayed – Funding issues – We were running over-pledged resources Trying to take into account TCO (energy consumption) not only sales price Support will cover 4 years Trying to open it as much as possible – Last tender only 2 bidders – “Relaxed” support constrains Would like to have a way to easily share specs, experiences and hints about other sites procurements 20/05/2013Andrea Chierici15

16 Monitoring & Accounting (1) 20/05/2013Andrea Chierici16

17 Monitoring & Accounting (2) 20/05/2013Andrea Chierici17

18 New activities (last ws) Did not migrate to Grid Engine, we stick to LSF – Mainly INFN-wide decision – Man power Testing zabbix as a platform for monitoring computing resources – More time required Evaluating APEL as an alternative to DGAS for grid accounting system not done yet 20/05/2013Andrea Chierici18

19 New activities Configure Ovirt cluster to manage service VMs done – standard libvirt mini-cluster for backup, with GPFS shared storage Upgrade LSF to v.9 Setup of a new HPC cluster (Nvidia GPUs + Intel MIC) Multicore task force Implement log analysis system (logstash, kibana) Move some core grid services to OpenStack infrastructure (first one will be site-BDII) Evaluation of Avoton CPU (see separate presentation) Add more VOs to WNODeS 20/05/2013Andrea Chierici19

20 Storage

21 Storage Resources Disk Space: 15 PB-N (net) on-line – 4 EMC2 CX3-80 + 1 EMC2 CX4-960 (~1,4 PB) + 80 servers (2x1 gbps connections) – 7 DDN S2A 9950 + 1 DDN SFA 10K + 1 DDN SFA 12K(~13.5PB) + ~90 servers (10 gbps) – Upgrade of the latest system (DDN SFA 12K) was completed 1Q 2014. Aggregate bandwidth: 70 GB/s Tape library SL8500 ~16 PB on line with 20 T10KB drives, 13 T10KC drives and 2 T10KD drives – 7500 x 1 TB tape capacity, ~100MB/s of bandwidth for each drive – 2000 x 5 TB tape capacity, ~200MB/s of bandwidth for each drive The 2000 tapes can be ‘‘re-used’’ with the T10KD tech with 8.5 TB tape capacity – Drives interconnected to library and servers via dedicated SAN (TAN). 13 Tivoli Storage manager HSM nodes access the shared drives – 1 Tivoli Storage Manager (TSM) server common to all GEMSS instances A tender for additional 3000 x 5TB/8.5TB tape capacity for 2014-2017 is ongoing All storage systems and disk-servers on SAN (4Gb/s or 8Gb/s) 20/05/2013Andrea Chierici21

22 Storage Configuration All disk space is partitioned in ~10 GPFS clusters served by ~170 servers – One cluster per main experiment (LHC) – GPFS deployed on SAN implements a full High Availability system – System scalable to tens of PBs and able to serve thousands of concurrent processes with an aggregate bandwidth of tens GB/s GPFS coupled with TSM offers a complete HSM solution: GEMSS Access to storage granted through standard interfaces (posix, SRM, XRootD and WebDAV) – FS directly mounted on WNs 20/05/2013Andrea Chierici22

23 Storage research activities Studies on more flexible and user-friendly methods for accessing storage over WAN – Storage federations, based on http/WebDAV for Atlas (production) and LHCb (testing) – Evaluation of different file systems (CEPH) and storage solutions (EMC2 Isilon over OneFS). Integration between GEMSS Storage System and Xrootd in order to match the requirements of CMS, Atlas, Alice and LHCb using ad-hoc Xrootd modifications – This is currently in production 20/05/2013Andrea Chierici23

24 LTDP Long Term Data preservation (LTDP) for CDF experiment – FNAL-CNAF Data Copy Mechanism is completed Copy of the data will follow this timetable: – end 2013 - early 2014 → All data and MC user level n-tuples (2.1 PB) – mid 2014 → All raw data (1.9 PB) + Databases Bandwidth of 10 Gb/s reserved on transatlantic Link CNAF ↔ FNAL 940 TB already at CNAF code preservation: CDF legacy software release (SL6) under test analysis framework: in the future, CDF services and analysis computing resources will possibly be instantiated on demand on pre-packaged VMs in a controlled environment 20/05/2013Andrea Chierici24


Download ppt "INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014."

Similar presentations


Ads by Google