Presentation is loading. Please wait.

Presentation is loading. Please wait.

The INFN Grid Project Successful Grid Experiences at Catania Roberto Barbera University of Catania and INFN Workshop CCR Rimini, 08.05.2007.

Similar presentations


Presentation on theme: "The INFN Grid Project Successful Grid Experiences at Catania Roberto Barbera University of Catania and INFN Workshop CCR Rimini, 08.05.2007."— Presentation transcript:

1 http://grid.infn.it The INFN Grid Project Successful Grid Experiences at Catania Roberto Barbera University of Catania and INFN Workshop CCR Rimini, 08.05.2007

2 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 2 Outline Grid @ Catania –Network connection –Catania in the Grid Infrastructures –Production site (ALICE Tier-2) –GILDA –TriGrid VL –PI2S2 Management of the site resources –Goals –Configuration & Policies –Monitoring –Usage statistics Summary & Conclusions

3 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 3 Catania Catania Network Connection (2/2)

4 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 4 Half of the total available bandwidth exploited Usage of the dedicated 1 Gb/s link 20062007 Rather continuous usage (mostly by ALICE)

5 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 5 Catania in the EGEE Grid Map

6 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 6 Catania in other Grid Infrastructures INFN (Catania, CNAF, Roma3) Poland GRNet CNIC IHEP INFN -Catania MA-GRID INFN -Roma3 INFN -Catania INFN - CNAF ULAKBIM CYNET IUCC GRNET Univ. of Tunis UoM ERI

7 Enabling Grids for E-sciencE ~15.000.000 € in 3 years ! ~350 FTEs! (2/3 new employees) More than 2000 CPUs More than 300 TBytes The Sicilian e-Infrastructure

8 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 8 Catania in the TriGrid VL Grid Map

9 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 9 Catania in the GILDA t-Infrastructure

10 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 10 Catania Computing Room (1/2) 3D Model of Catania Data Center Full Area: ~200 m 2 Area # 1 10 racks / 40 kW UPS/PDU Area # 2 80 kW UPS/PDU Area # 2 13 racks Area # 2 80 kW Air Cond. with ~100 kW external chiller

11 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 11 Catania Computing Room (2/2) Area # 2 Area # 1 Security system Area # 2 Fire estinguisher system

12 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 12 INFNGRID Catania Computing and Storage 1. ~270 cores 2. ~280 GB of memory 3. LSF 6.1 as LRMS 1. ~75 TB of raw disk storage FC-2-SATA 2.DPM over GPFS This includes the recent delivery of 40 cores and 50 TB for ALICE

13 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 13 GILDA Computing and Storage 1. 60+ cores 2. 96 GB of memory 3. Torque+MAUI as LRMS 1. ~ 3 TB of raw disk storage 2.DPM over GPFS

14 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 14 TriGrid VL Computing and Storage (Catania site only) 1. 96 cores AMD Opteron 280 2. 96 GB of memory 3. LSF 6.1 HPC as LRMS 1. 25 TB of raw disk storage FC-2-SATA 2.DPM over GPFS New tender for ~260 cores and ~55 TB (for all TriGrid VL sites) expected to start by the end of May

15 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 15 PI2S2 Project Computing and Storage (Catania site only) 2 IBM BladeCenter H enclosures 19 IBM LS21 “blades” 76 cores AMD Opteron 2218 rev. F 152 GB of RAM (2 GB/core) ~ 48.8 mW/SpecInt2000 at full load ! G-Ethernet service network CISCO Topspin Infiniband-4X additional low-latency network for HPC applications LSF 6.1 HPC included ! 1 IBM DS4200 Storage System FC-2-SATA technology 25 TB (raw) of storage Expandability up to ~100 TB GPFS distributed/parallel file sytem included ! New tender for ~1500 cores and ~130 TB (for all PI2S2 sites) expected to start by the end of May

16 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 16 “Owners”, “Users”, and “Goals” Owners of Catania resources: –INFN (ALICE, CMS, GILDA, Theory Group, TriGrid VL Project) –Consorzio COMETA (PI2S2 Project) 31 Virtual Organizations authorized at Catania ! –alice, atlas, babar, bio, biomed, cdf, cms, cometa, compchem, dteam, edteam, eela, egrid, enea, esr, euchina, euindia, eumed, gilda, gridit, inaf, infngrid, ingv, lhcb, magic, planck, theophys, trigrid, virgo, zeus + Local Users (~20) Goals: –Give access to everybody in order to maximize the usage of the site –Let “owners” use “their” resources with zero wait time

17 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 17 Wisdom Rules (by direct experience) 1.Be “open” to new technologies (usually they are not the problem, they are rather the solution). 2.Centralize system administration: create a core team of experts from the beginning and carry on with it. Changes can induce disruption. 3.Share experience and knowledge: no single super-expert knowing crucial element(s) of site administration. 4.Foster bi-directional know-how transfer to ensure long-term sustainability: from permanent staff to temporary younger personnel and viceversa. 5.Be wise and far-sighted in choices: don’t adopt things (especially LRMS) simply because you know them but because their ensure you the biggest possibile adaptability, configurability and scalability (see item 1.). 6.“Remotize” the control/monitoring of your site as much as possible in order to have it managed by a smaller number of persons.

18 Enabling Grids for E-sciencE The LRMS choice at Catania: LSF Schedules the processing of the job based on rules, policies and priorities (e.g., holding a job until a certain license is available, preempting the queue, fairsharing, event driven, calendar driven, in parallel, etc). Manages, monitors and balances all jobs until completion (e.g., restarting and requeuing jobs if system fails or resources become unavailable). Notifies users when done and logs statistics (for accounting & reporting). All this is done transparently to the user!

19 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 19 LSF and the Catania “goals” Centralization of management as a commodity –LSF Administrators users can do every administration task Fault tolerance Rationalization of the coexistence of different groups of machines belonging to different owners: –Grid Users with jobs submitted through gLite –Local Users with jobs submitted through LSF from frontend nodes Maximize the use of every CPU around the clock with as less empty slots as possible

20 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 20 Partitioning the Catania INFNGRID Farm Group Nodes –ALICE Production (alifarm11-67) –CMS Group (cmsfarm01-09) –GR1 Group(gr1farm1-4) –TheoPhys Local Group (dz01-06 dt05-16 ar01-12) From lsb.hosts file Begin Host default ! () () () () () # Default limits dz01 1 3.5/5 18 15 () () # theogroup frontend limits End Host...... Begin HostGroup all_nodes ( alifarm10 alifarm11... alifarm67) gr1farm ( gr1farm2 gr1farm3 gr1farm4 ) cmsfarm ( cmsfarm1 cmsfarm2..... cmsfarm9 ) theofarm ( dz01.... dt16 ) theo4gridfarm ( dz02 ar07 ar08 ar09 dt05 )..... End HostGroup

21 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 21 Addressing scalabilty and fault tolerance Centralized managed with LSF Batch admin commands performed by LSF admin users –Commands can be run from every host transparently –Immediate activation after a reconfig action –Unique configuration directory where config files are shared Fault tolerance –NFS server host, for LSF binaries and config files shared, not belonging to any cluster host (external) –Master List contains 3 hosts for elections in case of fault  LSF_MASTER_LIST="grid012 alifarm14 alifarm15" –License fault tolerance thanks to CNAF FlexLM hosts redundance:  LSF_LICENSE_FILE="1700@mastercr.cnaf.infn.it:1700@bastion1.c naf.infn.it:1700@bastion2.cnaf.infn.it"

22 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 22 Addressing optmization and “ownership” issues Rationalization and Job Optimization implemented with LSF Scheduling Policies –Hosts defined in groups in order to have associations to “owners” –Users defined in groups having distinct submission policies –Queues defined for different kinds of Grid jobs and Users’ Groups with direct relation with the ownership of the hosts –Concept of “owned” and “guest” queues implemented

23 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 23 Scheduling Policies FCFS –Jobs are dispatched on a first-come-first-served basis FAIRSHARE –Jobs are dispatched on a fair-share basis PRE-EMPTIVE/PRE-EMPTABLE –May stop/be stopped jobs in lower/higher priority queues SLA –Service Level Agreements EXCLUSIVE –Job has the exclusive use of an execution host

24 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 24 Begin Queue QUEUE_NAME = short PRIORITY = 70 HOSTS = hostGroupC # potential conflict PREEMPTION = PREEMPTIVE[normal] End Queue Begin Queue QUEUE_NAME = normal PRIORITY = 40 HOSTS = hostGroupC # potential conflict PREEMPTION = PREEMPTABLE[short] End Queue Preemption Scheduling Example: Preemption has been added in this way:

25 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 25 short 70 normal 40 3. Job submitted 5. Job dispatched & running 4. Job suspended 6. Job completed 7. Job resumed 2. Job dispatched & running 1. Job submitted hostA with 1 job slot available Preemption has been added to the definition of queues PREEMPTION=PREEMPTIVE[normal]PREEMPTION=PREEMPTABLE[short] How does Preemption Scheduling work ?

26 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 26..... Begin Queue QUEUE_NAME = lhcb PRIORITY = 15 NICE = 5 #PJOB_LIMIT = 1.000 #HJOB_LIMIT = 2 QJOB_LIMIT = 30 PREEMPTION = PREEMPTABLE[workq alice gr1cmsq theoqs theoqm theoql theoqi] INTERACTIVE = NO CPULIMIT = 60:00 RUNLIMIT = 96:00 RES_REQ = type==LINUX86 USERS = lhcb_pool JOB_STARTER = /sw/lsf/scripts/jobstarter-lsf-lcg.sh HOSTS = all_nodes+2 theofarm cmsfarm DESCRIPTION = LHCb dedicated Grid infinite queue End Queue Jobs coming from EGEE users belonging to the LHCb VO Implementation of Preemption Scheduling (1/3)

27 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 27..... Begin Queue QUEUE_NAME = gr1cmsq PRIORITY = 55 NICE = 5 #PJOB_LIMIT = 1.000 HJOB_LIMIT = 2 PREEMPTION = PREEMPTIVE RES_REQ = type==LINUX86 USERS = cms_group INTERACTIVE = NO HOSTS = gr1farm cmsfarm DESCRIPTION = Dedicated Queue for jobs of gr1-cms group End Queue...... Jobs coming from LOCAL users belonging to the CMS group Implementation of Preemption Scheduling (2/3)

28 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 28..... Begin Queue QUEUE_NAME = alice PRIORITY = 40 NICE = 5 PREEMPTION = PREEMPTIVE[lhcb] PREEMPTABLE[gr1cmsq theoqs theoqm theoql theoqi] #PJOB_LIMIT = 1.000 #HJOB_LIMIT = 2 #QJOB_LIMIT = 65 INTERACTIVE = NO CPULIMIT = 48:00 RUNLIMIT = 72:00 RES_REQ = type==LINUX86 USERS = alice_pool JOB_STARTER = /sw/lsf/scripts/jobstarter-lsf-lcg.sh HOSTS = all_nodes theofarm cmsfarm DESCRIPTION = Alice dedicated Grid infinite queue End Queue...... Jobs coming from EGEE users belonging to the ALICE VO Implementation of Preemption Scheduling (3/3)

29 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 29 Preemption vs. SLA PreemptionSLA Pros Cons Every job slot is filled with suspending actions based on priorities in case of host competition Automatic Restart of the suspended Jobs No need Admin actions after conf The Farm is full & hot 24/7 Remote possible conflicts over host memory caused by not correct memory management of (new) running job Particular config to be adapted for parallel jobs [done!] Service Level Agreements based on projects and timelines (what & when) and priorities Easy project accounting Deadline, Velocity and Throughput goals can be combined Monitoring of progress & tracking historical behavior of SLA Needs LSF Administrator to configure new SLA for new projects (may be frequent) If a VO has reaches its SLA slot limit, jobs can not run any more even if the farm is empty After or during project completion the entire Farm may be not fully used SLA jobs cannot be preempted A goal can be missed because of a misconfiguration

30 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 30 ALICE jobs running on TriGrid VL With the same preemption policies an ALICE “guest” queue has been created in the Catania site of the TriGrid VL infrastructure: –When the TriGrid clusters/queues are empty ALICE jobs can run –If TriGrid jobs are coming in host competition the ALICE jobs running are temporarily suspended to be restarted after TriGrid jobs have finished

31 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 31 LSF Monitoring Tool Monitoring tools available as packages!

32 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 32 Usage of the Catania site (1/3) (All supported VO’s – last year) 20062007

33 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 33 Usage of the Catania site (2/3) (All supported VO’s – last month) TriGrid VL CE INFNGRID CE TriGrid VL CE

34 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 34 Usage of the Catania site (3/3) (ALICE only - last 6 months) Globally, Catania Tier-2 (red circles) is contributing to ALICE like a Tier-1 (black circles)

35 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 35 Summary & Conclusions Catania is a multi-environment Grid site The ALICE Tier-2 is one of the most important services offered by the site but it is NOT the only one. We support tens of other VOs as well as many local users. With the implementation of some “wisdom rules” the site is able to exploit as much as possible all the resources, even if they belong to different “owners”, and maximizes its usage.

36 Enabling Grids for E-sciencE Workshop CCR, Rimini, 08.05.2007 - 36 Questions…


Download ppt "The INFN Grid Project Successful Grid Experiences at Catania Roberto Barbera University of Catania and INFN Workshop CCR Rimini, 08.05.2007."

Similar presentations


Ads by Google