Presentation is loading. Please wait.

Presentation is loading. Please wait.

– n° 1 The Grid technology and infrastructure in Italy present and future Cristina Vistoli INFN CNAF, Bologna Italy.

Similar presentations


Presentation on theme: "– n° 1 The Grid technology and infrastructure in Italy present and future Cristina Vistoli INFN CNAF, Bologna Italy."— Presentation transcript:

1 – n° 1 The Grid technology and infrastructure in Italy present and future Cristina Vistoli INFN CNAF, Bologna Italy

2 – n° 2 Italian Production Grid – Computing and Storage Resources

3 – n° 3 Italian Grid Production Services u Resource Brokers: n EGEE/LCG infrastructure open to all EGEE VO (LHC exps, Biomed, MAgic, Planck, Compchem, ESR) s egee-rb-01.cnaf.infn.it ; s grid008g.cnaf.infn.it (DAG enabled) n ATLAS VO egee-rb-02.cnaf.infn.it, egee-rb-05.cnaf.infn.it, egee-rb-06 n CMS VO egee-rb-04.cnaf.infn.it, egee-rb-06.cnaf.infn.it u Replica Location Service for babar, virgo, cdf, planck and other Italian VOs n datatag2.cnaf.infn.it, n To be replaced by LFC: lfcserver.cnaf.infn.it u Voms server: n testbed008.cnaf.infn.it n VOs: infngrid, zeus, cdf, planck, compchem u LDAP SERVER FOR National Vos (bio, inaf, ingv, gridit,theophys, virgo) : n grid-vo.cnaf.infn.it

4 – n° 4 Italian Grid Services u MyProxy servers: n testbed013.cnaf.infn.it u User Interfaces: n UIs are not Core servives..., anyway you can find a list of italian UIs s http://grid-it.cnaf.infn.it/index.php?userinterface&type=1http://grid-it.cnaf.infn.it/index.php?userinterface&type=1 u Monitoring:GridICE server: n EGEE/LCG Production infrastructure s http://gridice2.cnaf.infn.it:50080/gridice/site/site.php http://gridice2.cnaf.infn.it:50080/gridice/site/site.php n Italian Production Infrastructure s http://edt002.cnaf.infn.it:50080/gridice/site/site.php http://edt002.cnaf.infn.it:50080/gridice/site/site.php n Atlas s http://egee005.infn.it:50080/gridice/site/site.php http://egee005.infn.it:50080/gridice/site/site.php n CMS s http://egee004.cnaf.infn.it:50080/gridice/site/site.php http://egee004.cnaf.infn.it:50080/gridice/site/site.php

5 – n° 5 INFN-GRID Middleware release u Based on LCG u Several VOs added (about 15) u Authorization via VOMS + LCAS/LCMAPS u MPI configuration u Cert queue for site functional tests u Gridice includes WNs monitor u AFS support in the WN u DAG jobs support u DGAS accounting system

6 – n° 6 INFN Production Grid Support The access point for user and site mangers at INFN is at the INFN Production Grid Website: http://grid-it.cnaf.infn.it There you can find pointers to: Documentation (prerequisites, middleware installation, upgrade, testing, using the Grid, current applications and VOs, etc.) A software repository Monitoring of resources and services (GridICE) Tools for the CMT (Central Management Team), site managers and supporters (Calendar/downtime manager) Trouble ticketing system for problems, advisories and suggestions A knowledge base to complement the documentation

7 – n° 7 Ticketing system u INFN-GRID ticketing system is used: s from users to ask questions or to communicate troubles; s from system manager to communicate about common grid tasks (ex: upgrading to a new grid release) s from CMT to system manager to notify a problem n Support Groups are “helper” groups: n Support Grid Services (RB, RLS, VOMS, GridICE, etc) Group; n Support VOApplications Group (each for every VO:Atlas, CMS, Alice, LHC, Babar, CDF, Virgo,….); n Support Site Group (each for every site) n Operative Groups Operative Central Management Team (CMT); n Operative Release & Deployment Team; Users -> Create a ticket Supporters/Operatives -> Open the ticket Users and/or Supporters/Operatives -> Update an open ticket Supporters/Operatives -> Close the ticket

8 – n° 8 Operations activities u Operation of the grid infrastructure: n Shifts (Monday to Friday 08:00-20:00): about 20 people from different sites monitor and test the status of the grid resources (STF, others), certify of the italian sites/resources and do VO and user support n Management of the grid services (general purpose and VO specific) n Participation to CIC-on-Duty shifts u Support (operations, user and VO): n The regional ticketing system, based on OneorZero, is going to be replaced by a new system (xHelp+Xoops) interfaced to GGUS n Support groups are defined and operational n Participation to Support-on-Duty shifts u Release and installation: n currently INFN-GRID release is based on LCG. INFN-GRID-2.4.0 contains DAG, DGAS, VOMS, support for afs client and GridIce monitoring for the WNs; it is now in the deployment phase on the italian grid infrastructure n Installation and configuration scripts (YAIM) are provided u Grid Install working group  Fabric and software management WG

9 – n° 9 Certification activity – TEST ZONE u The Central Management Team is responsible of the resource centers certification: checking the functionalities of a site before joining the site to the production grid. u Although all certification jobs are VO independent, the INFNGRID VO is used to perform these jobs; u In particular are checked: s GIIS' information consistence; s Local jobs submission (LRMS); s Grid submission with Globus (globus-job-run); s Grid submission with the ResorceBroker; s ReplicaManager functionalities; s MPI functionalities u In order to certificate a site the CMT uses dedicated grid services: n RB & BDII: gridit-cert-rb.cnaf.infn.it u In this way we avoid to have an uncertified site in the production grid services;

10 – n° 10 Towards gLite: Pre-production, Certification u Pre-production Service n 3 sites (CNAF,Padova, Bari) ready n CNAF is the pilot site: s LCG release: 1 CE, 1 SE, 3 WN available s gLite release: 1 CE, 3 WN, 1 I/O service, 1 VOMS server, 1 WMS node (formerly RB), 1 double-stack WN, 1 R-GMA server and 1 UI available soon n Migration strategies will be tested n Stress test middleware components in particular the WMS system n Open to the experiments applications to verify middleware u Certification Infrastructure n 4 sites (CNAF, Padova, Roma and Torino) n goal: certificate the INFN-GRID release partecipating to the certification of the gLite release and adding new, specific middleware components to gLite: storRM, G-PBOX n produce and verify the installation procedures

11 – n° 11 Towards gLite: migration u gLite Migration n gLite release 1 in the pre-production service to verify: s Functionality s Stability s Performance n pre-production open to experiments n INFN-GRID/LCG fall back solution n Production infrastructure migration to gLite step by step s First: migration from RB to gLlite WMS s Then CE / WNs: n Major sites deploy gLite CEs in parallel with the LCG-2 CEs. n Some of the smaller sites convert fully to gLite.

12 – n° 12 Current activities u Production service quality and stability improvements n Improve monitoring systems: Gridice with reactive alarms and notification, new release n Site isolation - need simple mechanism to remove sites n Improve the grid management and monitoring structure u INFN-GRID Support infrastructure: s VO support - Experiments s Site administartors s Grid services support u Accounting and policy system management

13 – n° 13 Resource center status: Feb. 2005

14 – n° 14 Resource center status: Mar. 2005

15 – n° 15 GridAT - Grid Application Test GridAT has the main goal to provide a general and flexible framework for VO application tests in a grid system. It permits to test a grid site from the VO view. Results are stored in a central database and browsable on a web page so it will be also used for certification and test activity.

16 – n° 16 Resources used at days intervals Default param. shown Resources occupancy in the last week per all site and all VO You can select what shall be shown changing the param. criteria: Site, virtual oraganization, time interval, on/off show values on bar, size picture In the legend is shown total resource occupancy (hours)

17 – n° 17 Number of jobs waiting and jobs running vs time Default params shown number o jobs in the last day per all site and all VO Params criteria are: Site, virtual organization, time interval, on/off show values, size picture, show legend In the legend is shown resource occupancy calculated by integration of job running in time interval

18 – n° 18 Number of jobs Used memory (average job size) CPU hours Wall Time hours

19 – n° 19 Completed Jobs Last three weeks per site

20 – n° 20 Site specific data: CPU hours last three weeks per VO

21 – n° 21 Useful links u INFN Production Grid n http://grid-it.cnaf.infn.it/ http://grid-it.cnaf.infn.it/ u INFN GridICE n http://grid-it.cnaf.infn.it/index.php?grisview&type=1 http://grid-it.cnaf.infn.it/index.php?grisview&type=1 u INFN test and certification n http://grid-it.cnaf.infn.it/index.php?sitetest&type=1 http://grid-it.cnaf.infn.it/index.php?sitetest&type=1 u INFN Support n http://grid-it.cnaf.infn.it/index.php?id=51&type=1 http://grid-it.cnaf.infn.it/index.php?id=51&type=1 u Contact n grid-manager@infn.it grid-manager@infn.it n Grid-release@infn.it Grid-release@infn.it n Ticket for operational issue

22 – n° 22 Workload Manager

23 – n° 23 Workload Manager Job submission and cancellation request using JDL

24 – n° 24 Workload Manager Execution request hold in the queue if no resources available.

25 – n° 25 Workload Manager Repository of resource information available to matchmaker Updated via notifications and/or active polling on sources

26 – n° 26 Workload Manager Discover the best CE using resource status and user preferences

27 – n° 27 State of the art a short while ago u Globus GRAM de-facto standard for resource access u Various job submission systems built on top of it n E.g. Condor-G u Used by most Grid projects worldwide n E.g. LCG u Allowed interoperability among different Grid systems

28 – n° 28 Resource access in EGEE u Italy and INFN has to address the resource access problem in the context of the EGEE project u Goals n Provide a simple Computing Element which must prove to efficiently allow remote job submission and job control s To be used by the “Broker” (Workload Manager) or by a generic client (e.g. an end-user) s Possibly addressing open problems n Stick to emerging standards s Service oriented architecture n Facilitate the integration of other important software components already implemented/being implemented

29 – n° 29 Integration of other software components u Having “control” on the Computing Element can facilitate the integration of other software components to be deployed in the CE u Relevant examples: n Grid accounting s DGAS sensors for resource metering n Policy framework s Integration of G-PBox n For setting site policies n For policy evaluation given a submission request n Resource monitoring s GridICE n Resource reservation and co-allocation (?)

30 – n° 30 CREAM LSF Worker Nodes PBS... CEMon Client WEB CE Architecture CE service architecture composed of two Web services Computing Resource Execution And Management Service CE Monitor Service

31 – n° 31 CE LSF Worker Nodes PBS? Mon Client WEB CE Architecture Notifications Job requests Async. notifications about job/CE events Job requests (for CE working in pull mode )

32 – n° 32 Cream LSF Worker Nodes PBS... CEMon Client WEB CEMon get CEInfo Subscribe for async notifications getTopics ( ) getEvent ( Topic ) getInfo ( ) subscribe ( Subscription ) unsubscribe ( SubscriptionRef ) pauseSubscription ( SubscriptionRef ) resumeSubscription ( SubscriptionRef ) ping()

33 – n° 33 CEMon u CEMon is part of Glite rel. 1 u Integrated with the Glite WMS ISM (Information SuperMarket) n See next slide

34 – n° 34 CEMon – InformationSuperMarket Repository of resource information available to matchmaker Updated via notifications and/or active polling on sources

35 – n° 35 CEMon u Provision of Glue schema compliant information regarding the CE n For CondorC based CEs (Glite rel. 1 CEs) n For Globus based CEs (LCG-2 like CEs) s Still supported by the Glite WMS u Synchronous CEMon n Client can periodically poll the CEMon to get the CE information u Asynchronous CEMon n Static subscriptions to be filled in a conf file by the CE admin… s Address of client to be notified s What to notify s Rate of notification sending s “Condition”: RegExp n Only if the condition is true, the notification is sent to the client n To support the pull mode s … n … and/or client can send subscription requests to CEMon

36 – n° 36 CREAM LSF Worker Nodes PBS... CEMon Client WEB CREAM Web service accepting job management requests jobSubmit jobAssess jobCancel jobSuspend jobResume jobStatus jobSignal getVersion ping Broker (WM) end-user Client

37 – n° 37 CREAM LSF Worker Nodes PBS... CEMon WEB CREAM JobSubmit JobAssess JobCancel JobSuspend JobResume JobStatus JobSignal BLAPH Light component accepting commands to manage jobs on different LRMS Client

38 – n° 38 CREAM u What is available now n Job submission s Job wrapper created s Job wrapper submitted to BLAHP for execution on the underlying batch system (PBS/LSF) on the specified queue n Would like to rely on daemon version of BLAPH n Is this supported ? s Sandbox transfers using gridftp n So the sandbox must have been first staged in a GridFtp server n Job removal s Relying on BLAPH as well n Client command line tools s E.g. glite-ce-submit n CEId: : /cream- - s Request sent to remote CE CREAM service u Code under test and debug

39 – n° 39 CREAM: next steps u Implement submission of the other job types n MPI, interactive, checkpointable u Support submission of cluster of jobs n Requirements coming from many communities n Start thinking how to deal with it u Support implementation of heterogeneous CEs n CE encompassing heterogeneous (in hardware, software, enforced policies) Worker Nodes s The underlying resource management system must be instructed so that the job gets dispatched to a WN matching the specified requirements n Requiring that all WNs “belonging” to the same CE have to be homogeneous (EDG/LCG model) was considered a huge pain by sysadmins s Need to setup many different homogeneous CEs  Need to setup many different batch queues u Allow “interactive access” to the jobs (e.g. for debugging purposes)

40 – n° 40 CE standardization u What’s the issue and the goal n Globus GRAM was the de-facto standard for CE resource access This favored the interoperability with other Grids n E.g. EDG WMS able to submit to US resources n Now this is not the case anymore (Globus GRAM, Nordugrid CE, Unicore CE, CondorC based CE, …) s Competition usually helps to produce good artifacts, but this is a problem for the interoperability n Standardization needed s Standard Job Description Language s Standard interface

41 – n° 41 DataGrid Accounting System - DGAS The Purpose of Data Grid Accounting System (DGAS) is to implement Resource Usage Metering, Accounting and Billing in a fully distributed Grid environment. It is conceived to be distributed, secure and extensible. The system is designed in order for Usage Metering, Accounting and Billing to be independent layers.

42 – n° 42 Usage Metering and Accounting The usage of Grid Resources by Grid Users is registered in appropriate servers, known as HLRs (Home Location Registers) where both users and resources are registered. In order to achieve scalability, there can be many independent HLRs. At least one HLR per VO are foreseen, although a finer granularity is possible. Each HLR keeps the records of every grid job executed by each of its registered users or resources, thus being able to furnish usage information with many granularity levels: Per user or resource, per group of users or resources, per VO.

43 – n° 43 Gianduia in gLite: Description ● GIANDUIA (Gianduia Is a Nice Distributed Utility Infrastructure for Accounting) collects information about a Grid Job from the WMS JobWrapper and make them available on the CE node (or the LRMS Head node if it is different from the CE) to other applications that needs to retrieve the job usage record. ● The job Grid related information are integrated with the information about the same job available to the LRMS, thus a complete view of the job is available, in a single place, without the need to parse many different logs to monitor a job. ● These information are stored on the CE filesystem and can then be used by producers for different Accounting or Monitoring systems. ● Information available are, for example: LRMSJobID, GridJobID, UserCertSubject, userProxyCertificate, GridCEId, CPU Time, WallClock Time, Memory (physical and virtual), number of processors assigned to a job, and all the other records available to the LRMS. ● Log format is similar to native LRMS one, so it is easy to adapt existents sw to use Gianduia. ● Gianduia actually works with PBS, LSF, MAUI/Torque.

44 – n° 44 Metering Infrastructure: GIANDUIA

45 – n° 45 StoRM : Storage Resource Manager u Disk-Storage with interface SRM 2.1 to parallel file system with Posix like interface. u ‘storage reservation’ u VOMS roles e u G-Pbox policy aware u Implementation and deployment on-going

46 – n° 46 Policy Management: G-Pbox u Policy management for computing and storage resource in a Grid. u CE policy aware u WMS policy aware u Prototype available

47 – n° 47 VOMS u VO Membership Service: n Support for the Vo group and role management n Based on attribute certificate u DataTAG&DataGrid activity – CERN admin interface u In the INFN production Grid system since last year u Included in gLite - RC1 u Distributed in VDT and used by Grid3


Download ppt "– n° 1 The Grid technology and infrastructure in Italy present and future Cristina Vistoli INFN CNAF, Bologna Italy."

Similar presentations


Ads by Google