Presentation is loading. Please wait.

Presentation is loading. Please wait.

INFN-GRID: Stato ed Organizzazione

Similar presentations


Presentation on theme: "INFN-GRID: Stato ed Organizzazione"— Presentation transcript:

1 INFN-GRID: Stato ed Organizzazione
Alessandro Paolini INFN-CNAF Incontro dei Progetti PON Avviso 1575 con il ROC di INFN Grid Catania, 4 luglio 2008

2 Primary components of the production grid
The primary components of the Italian Production Grid are: Computing and storage resources Access point to the grid Services Other elements are as much fundamental for the working, managing and monitoring of the grid: Middleware Monitoring tool Accounting tool Management and control infrastructure Users

3 GRID Management Grid management is performed by the Italian Regional Operation Center (ROC). The main activities are: Production of Infngrid release and test Deployment of the release to the sites, support to local administrators and sites certification Periodical check of the resources and services status Support at an Italian level Support at an European level Introduction of new Italian sites in the grid Introduction of new regional VOs in the grid

4 The Italian Regional Operation Center (ROC)
Operations Coordination Centre (OCC) Management, oversight of all operational and support activities Regional Operations Centres (ROC) providing the core of the support infrastructure, each supporting a number of resource centres within its region Grid Operator on Duty Grid User Support (GGUS) At FZK, coordination and management of user support, single point of contact for users One of 10 existing ROC in EGEE

5 Central Management Team (CMT) Shifts
About 20 supporters perform a checking activity composed of 1 shift per day, from Monday to Friday, with 2 person per shift, during which a report is compiled: Checking of the grid status and problem warning, tailing them until their solution if possible Doing sites certification during the deployment phases Checking of the ticket still opened and pressing the expert or the site-managers for answering and solving them

6 Service Availability Monitoring (SAM)
SAM jobs are launched every hour and allow to find out submission problem, among which batch system errors, CA not updated and replicas errors There are also more specific tests for SRM, SE and LFC

7 Service Availability Monitoring (SAM)
CE Tests

8 SAM Admin SAM jobs available for both EGEE production and preproduction sites each site-manager can submit new sam test on his site Each ROC can submit new tests job of the site of own region

9 GSTAT GSTAT queries the Information System every 5 minutes
The sites and nodes checked are those registered in the GOC DB The inconsistency of the information published and the eventual missing of a service that a site should publish are reported as an error

10 Introducing a new site Before entering in grid, each site have to accept several norms of behaviour, described in a Memorandum of Understanding (MoU). The COLG (Grid Local Coordinator) read and sign it, and they fax this document to INFN-CNAF. Moreover all sites must provide this alias: This alias will be used to report problems and it will be added to the site managers' mailing list. Of course it should include all site managers of your grid site. At this point, IT-ROC register site and site-managers in the GOC-DB, and create a supporter-operative group in the ticketing system XOOPS. site-managers have to register in XOOPS, so they can be assigned to their supporter-operative groups; each site-manager has to register in the test VOs infngrid and dteam

11 Introducing a new site Site-managers install the middleware, following the instructions distribuited by the Release Team ( Installation section) . When finished, they make some preliminary test ( --> Test&Cert --> Fry) and then they make the request to be certified by own ROC. IT-ROC log a ticket to communicate with site-managers during the certification.

12 Memorandum of Understanding
Every site have to: Provide computing and storage resources. Farm dimensions (at least 10 cpu) and storage capacity will be agreed with each site Guarantee sufficient man power to manage the site: at least 2 persons Manage efficently the site resources: middleware installation and upgrade, patch application, configuration changes as requested by CMT and do that by the maximum time stated for the several operation Answer to the ticket by 24 hours (T2) or 48 hours (other sites) from Mon to Fry Check from time to time own status Guarantee continuity to site management and support, also in holidays period Partecipate to SA1/Production-Grid phone conferences an meetings and compile weekly pre report Keep updated the information on the GOC DB Enable test VOs (ops, dteam and infngrid), with a higher priority than other VOs Eventual non-fulfilment noticed by ROC will be referred to the biweekly INFNGRID phone conferences, then to COLG, eventually to EB

13 Availability & Reliability
Viene preso in cosiderazione lo stato dei servizi CE,SE,SRM e sBDII come risulta dagli esiti dei test SAM Viene applicato un AND logico tra questi servizi ed un OR logico tra i servizi dello stesso tipo nel caso un sito abbia più istanze di uno stesso servizio Un sito deve risultare disponibile (available) almeno il 70% del tempo al mese (la disponibilità giornaliera è misurata sulle 24 ore) L’affidabilità (reliability) del sito deve essere di almeno 75% al mese (Reliability = Availability / (Availability + Unscheduled Downtime)) I periodi di scheduled downtime devono essere dichiarati in anticipo sul GOC-DB Gli Scheduled Downtime incidono negativamente sulla availability, ma non sulla reliability

14 GRID Services Allow you to use the grid resources:
Resource Broker (RB) / Workload Management System (WMS): they are responsible for the acceptance of submitted jobs and for sending those jobs to the appropriate resources Information System (IS): provides information about the grid resources and their status Virtual Organization Management System (VOMS): database for the authentication and authorization of the users Gridice: monitoring of resources, services and jobs Home Location Register (HLR): database for the accounting informations of the usage of resources LCG file catalog (LFC): file catalog File Transfer Service (FTS): file movements in an efficient and reliable way MonBox: collector for local data of R-GMA

15 General Purpose Services (I)
2 Resource Brokers 1 Top Level BDII 2 voms servers + 1 replica per ciascuno 1 gridice server 1 LCG File Catalog server

16 General Purpose Services (II)
3 WMS 2 Logging & Bookkeeping 2 Resource Brokers 2 Top Level BDII 1 server MyProxy 1 FTS

17 Accounting using DGAS DGAS (Distributed Grid Accounting System) is fully deployed in INFNGrid (13 site HLRs + 1 HLR of 2nd level (testing). The site HLR is a service designed to manage a set of ‘accounts’ for the Computing Elements of a given computing site. For each job executed on a Computing Element (or a on local queue), the Usage Record for that job is stored on the database of the site HLR. Each site HLR can: Receive Usage Records from the registered Computing Elements. Answer to site manager queries such as: Datailed job list queries (with many search keys: per user, VO, FQAN ,CEId…) Aggregate usage reports, such as per hour, day, month…, with flexible search criteria. Optionally forward Usage Records to APEL database. Optionally forward Usage Records to a VO specific HLR. Site HLR Site layer Usage Metering Resource’s layer -Aggregate site info -VO (with role/group) usage on the site. Detailed Resource Usage info Job level info GOC

18 Tier1 & Tier2 HLRs 11 Home Location Register di sito per Tier1 e Tier2
HLR prod-hlr-01.ct.infn.it   (INFN-CATANIA) reference for central-southern area sites host sito hlr-t1.cr.cnaf.infn.it INFN-T1  prod-hlr-02.ct.infn.it  INFN-CATANIA  prod-hlr-01.pd.infn.it  INFN-PADOVA prod-hlr-01.ba.infn.it  INFN-BARI atlashlr.lnf.infn.it  INFN-FRASCATI t2-hlr-01.lnl.infn.it  INFN-LEGNARO prod-hlr-01.mi.infn.it  INFN-MILANO  t2-hlr-01.na.infn.it INFN-NAPOLI (ATLAS, PAMELA) gridhlr.pi.infn.it INFN-PISA t2-hlr-01.roma1.infn.it  INFN-ROMA1, INFN-ROMA1-CMS, INFN-ROMA1-VIRGO grid005.to.infn.it  INFN-TORINO ENEA-INFO INFN-ROMA3 INFN-CAGLIARI ITB-BARI INFN-LECCE SPACI-CS-IA64 INFN-LNS SPACI-LECCE-IA64 INFN-NAPOLI-CMS SPACI-NAPOLI INFN-ROMA2 SPACI-NAPOLI-IA64 11 Home Location Register di sito per Tier1 e Tier2 2 HLRs per i siti medio-piccoli CNR-ILC-PISA INFN-GENOVA CNR-PROD-PISA INFN-PARMA INAF-TRIESTE INFN-PERUGIA INFN-CNAF INFN-TRIESTE INFN-BOLOGNA SNS-PISA INFN-FERRARA UNIV-PERUGIA INFN-FIRENZE HLR prod-hlr-01.pd.infn.it  (INFN-PADOVA) reference for central-northern area sites

19 VO Dedicated Services (I)

20 VO Dedicated Services (II)

21 Experimental Services
Tests su alcuni componenti rilasciati dagli sviluppatori, in parallelo con SA3 Applicazione delle ultime patch appena rilasciate su alcuni WMS presenti in produzione, per consentire alle VO di testarne la compatibilità con i loro tools CreamCE: in collaborazione con alcuni siti in cui sono state installate diverse istanze

22 Other Services

23 Deployment Status (I) 45 Siti in totale: 35 Siti attivi
SITE STATUS CNR-ILC-PISA CERTIFIED INFN-PERUGIA CNR-PROD-PISA INFN-PISA ENEA-INFO INFN-ROMA1 ESA-ESRIN INFN-ROMA1-CMS INFN-BARI INFN-ROMA1-VIRGO INFN-BOLOGNA INFN-ROMA2 INFN-CATANIA INFN-ROMA3 INFN-CNAF INFN-T1 INFN-CNAF-LHCB INFN-TORINO INFN-FERRARA INFN-TRIESTE INFN-FRASCATI SNS-PISA INFN-GENOVA SPACI-CS-IA64 INFN-LECCE UNI-PERUGIA INFN-LNL-2 INAF-TRIESTE HW PROBLEMS INFN-LNS INFN-CAGLIARI INFN-MILANO INFN-CASCINA INFN-NAPOLI INFN-FIRENZE Supp. Unavailable INFN-NAPOLI-ATLAS ITB-BARI Cooling Maint. INFN-NAPOLI-CMS SISSA-TRIESTE TESTs ONGOING INFN-NAPOLI-PAMELA SPACI-LECCE-IA64 HW & MW PROBLEMS INFN-PADOVA SPACI-NAPOLI INFN-PARMA SPACI-NAPOLI-IA64 NEW SITE STATUS INFN-NAPOLI-ARGO CANDIDATE 45 Siti in totale: 35 Siti attivi 2 siti in fase di certificazione 32 siti INFN 13 siti di altri enti (cnr, enea, esa, inaf, spaci, univ.PG) 3 siti con architettura IA64 (1 attivo)

24 Release INFNGRID Based on gLite3
We are still in a O.S. transition phase: there are two releases INFNGRID: 3.0 for SL3, 3.1 for SL4 Several customizations: additional VOs (~20) accounting (DGAS): New profile (HLR server) + additional packages on CE and WN monitoring (GRIDICE) Quattor (collaboration with CNAF-T1) Dynamic Information-Providers for LSF: corrected configuration, new vomaxjobs (3.1/SL4 WIP) transparent support to MPICH and MPICH-2 GRelC (Grid Relational Catalog) StoRM (Storage Resource Manager) GFAL Java API & NTP Work-in-progress: creamCE patched MyProxy (long-live proxy delegation with voms extensions) AMGA Web Interface GSAF (Grid Storage Access Framework) Secure Storage System gLite for Windows with torque/maui support

25 Deployment Status (II)
EGEE gLite 3.1 updates ( Update 26 Update 25 Update 24 Update 23 Update 22 Update 21 Update 20 Update 19 Update 18 Update 17 Update 16 Update 15 Update 14 INFNGRID gLite 3.1 Update 22/23/24/25/26 (SL4) – 24/06/2008 INFNGRID gLite 3.1 Update 18/19/20/21 (SL4) - 28/04/2008 INFNGRID gLite 3.1 Update 17 (SL4) - 01/04/2008 INFNGRID gLite 3.1 Update 14/15/16 (SL4) - 18/03/2008

26 Deployment Status (III)
EGEE gLite 3.0 updates ( Update 43 Update 42 Update 41   Update 40 Update 39 Update 38 Update 37 Update 36 Update 35 INFNGRID gLite 3.0 Update 43 (SL3) - 24/06/2008 INFNGRID gLite 3.0 Update 42 (SL3) - 28/04/2008 INFNGRID gLite 3.0 Update 41 (SL3) - 01/04/2008 INFNGRID gLite 3.0 Update 40 (SL3) - 18/03/2008 INFNGRID gLite 3.0 Update 39 (SL3) - 05/02/2008 INFNGRID gLite 3.0 Update 38 (SL3) - 25/01/2008 INFNGRID gLite 3.0 Update 37 (SL3) - 05/12/2007 INFNGRID gLite 3.0 Update 35/36 (SL3) - 11/29/2007

27 Supported VOs 49 VOs supported: 4 LHC (ALICE, ATLAS, CMS, LHCB)
3 test (DTEAM, OPS, INFNGRID) 20 Regional 1 catch all VO: GRIDIT 21 Other VOs 27

28 VO Regionali 2376 utenti registrati in CDF VO utenti argo 25 bio 68
compassit 7 compchem 59 cyclops 13 egrid 28 enea 12 enmr.eu 14 euchina 61 euindia 51 eumed 99 gridit 132 inaf 27 infngrid 207 ingv libi 17 lights.infn.it 16 pamela 19 planck 33 theophys 57 virgo 18 2376 utenti registrati in CDF

29 Introducing a new VO When an experiment asks to enter in grid and to form a new VO, it is necessary a formal request follwed by some technical steps. Formal Part: Needed resources and economical contribution to agree between the experiment and the grid Executive Board (EB) Pick out the software that will be used and verify its functioning Verify the possibility of the support in the several INFN-GRID production sites Communicate to IT-ROC the names of VO-managers, Software-managers, persons responsible of resources and of the support for the software experiment for the users in every site Software requisites, kind of job and the storage final destination (CASTOR, SE, experiment disk server)

30 Introducing a new VO Once the Executive Board (EB) has
approved the experiment request, the technical part begins: IT-ROC will create the VO on its voms server (if doesn’t exist one) IT-ROC will create the VO support group on the ticketing system VO-manager fill in the VO identity card on the CIC portal IT-ROC will make known the existence of the new VO and inform the sites how to enable it

31 HLRmon

32 WMS MONITOR (I)

33 WMS MONITOR (II)

34 Useful links… Italian grid project: http://grid.infn.it/
Italian production grid: SAM: CIC Portal: GSTAT: GridICE: EGEE SA1 Failover: HLR MON: WMS MON:


Download ppt "INFN-GRID: Stato ed Organizzazione"

Similar presentations


Ads by Google