Presentation is loading. Please wait.

Presentation is loading. Please wait.

Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007.

Similar presentations


Presentation on theme: "Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007."— Presentation transcript:

1 Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007

2 GSI Darmstadt2 The ALICE Grid Powered by AliEn Powered by AliEn Interfaces to gLite, ARC and (future) OSG WMS Interfaces to gLite, ARC and (future) OSG WMS As of today – 65 entry points (62 sites), 4 continents As of today – 65 entry points (62 sites), 4 continents Africa (1), Asia (4), Europe (53), North America (4) Africa (1), Asia (4), Europe (53), North America (4) 21 countries, 1 consortium (NDGF) 21 countries, 1 consortium (NDGF) 6 Tier-1 (MSS capacity) sites, 58 Tier-2 6 Tier-1 (MSS capacity) sites, 58 Tier-2 All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB Tape All together – ~5000 CPUs (pledged), 1.5PB disk, 1.5PB Tape Contribution range: from 4 to 1200 CPUs Contribution range: from 4 to 1200 CPUs PIII, PIV, Itanium, Xeon, AMD PIII, PIV, Itanium, Xeon, AMD All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no Gentoo + all possible kernel+gcc combinations All Linux: Mandriva, Suse to Ubuntu, mostly SL3/4, no Gentoo + all possible kernel+gcc combinations

3 GSI Darmstadt3 The ALICE Grid (2) 62 active sites

4 GSI Darmstadt4 Operation ALICE offline is: ALICE offline is: Hosting the central AliEn services: Grid catalogue, task queue, job handling, authentication, API services, user registration Hosting the central AliEn services: Grid catalogue, task queue, job handling, authentication, API services, user registration Organising (guided by the requirements of the PWGs) and running the production Organising (guided by the requirements of the PWGs) and running the production AliEn site services updates and operation (together with the regional experts) AliEn site services updates and operation (together with the regional experts) User analysis support User analysis support Sites are: Sites are: Hosting the VO-boxes (interface to site services) Hosting the VO-boxes (interface to site services) Operating the local services (gLite and site fabric) Operating the local services (gLite and site fabric) Providing CPU and storage Providing CPU and storage This model This model Has been in operation with minor modification since several years and is working quite well for production Has been in operation with minor modification since several years and is working quite well for production Requires minor modification to support a large user community - mostly in the area of user support Requires minor modification to support a large user community - mostly in the area of user support

5 GSI Darmstadt5 History of PDCs Exercise of the ALICE production model Exercise of the ALICE production model Data production / storage/ replication Data production / storage/ replication Validation of AliRoot Validation of AliRoot Validation of Grid software and operation Validation of Grid software and operation User analysis (not yet integral part of the PDC) User analysis (not yet integral part of the PDC) Since April 2006 the PDC is running continuously Since April 2006 the PDC is running continuously

6 GSI Darmstadt6 PDC job history Average of 1500 CPUs running continuously since April 2006

7 GSI Darmstadt7 PDC job history - zoom on last 2 months 2900 jobs in average, saturating all available resources

8 GSI Darmstadt8 Site performance Typical operation: - Up to 10% of the sites not in production at any given moment - Half of these are undergoing scheduled upgrades - The other half - Grid or local services failures - T1s are in general better in stability than T2 - Some T2s are much better than any of the T1s Achieving better stability of the services at the computing centres is a top priority of all parties involved The central services availability is better than 95%

9 GSI Darmstadt9 Production status Total 85,837,100 events as of 26/082007 24:00 hours

10 GSI Darmstadt10 Sites contributions Standard distribution: 50/50 T1/T2 contribution

11 GSI Darmstadt11 Relative contribution - Germany Standard distribution: 50/50 T1/T2 contribution 15% of total

12 GSI Darmstadt12 Efficiencies/debugging Workload management for production Workload management for production Under control and is near production quality Under control and is near production quality We keep saying that, but this time we really mean it We keep saying that, but this time we really mean it Improvements (speed, stability) are expected with the new gLite version 3.1, still untested Improvements (speed, stability) are expected with the new gLite version 3.1, still untested Support and debugging Support and debugging The overall situation is much less fragile now The overall situation is much less fragile now Substantial improvements in AliEn and monitoring are making the work of the experts supporting the operations easier Substantial improvements in AliEn and monitoring are making the work of the experts supporting the operations easier gLite services at the sites are (mostly) well understood and supported gLite services at the sites are (mostly) well understood and supported User support is still very much in need of improvement User support is still very much in need of improvement The issues with user analysis are often unique and sometimes lead to development of new functionality The issues with user analysis are often unique and sometimes lead to development of new functionality But at least the response time (if not the solution) is quick But at least the response time (if not the solution) is quick

13 GSI Darmstadt13 General The Grid is getting better The Grid is getting better Running conditions are improving Running conditions are improving The Grid middleware in general and AliEn in particular are quite stable The Grid middleware in general and AliEn in particular are quite stable After a long and hard work by the developers After a long and hard work by the developers Even user analysis, much derided in the past few months is finally not a painful exercise Even user analysis, much derided in the past few months is finally not a painful exercise The operation is more streamlined now The operation is more streamlined now Better understanding of running conditions and problems by the experts Better understanding of running conditions and problems by the experts We continue with the usual PDC’07 programme We continue with the usual PDC’07 programme Simulation/reconstruction of MC event Simulation/reconstruction of MC event Validation of new middleware components Validation of new middleware components User analysis User analysis And in addition the Full Dress Rehearsal (FDR) And in addition the Full Dress Rehearsal (FDR)

14 GSI Darmstadt14 User analysis issues - short list Major issues - February/June 2007 Major issues - February/June 2007 Jobs do not start/lost/output missing Jobs do not start/lost/output missing Input data collections are difficult to handle and impossible to process at once Input data collections are difficult to handle and impossible to process at once Priorities are not set - single user can ‘grab’ all resources Priorities are not set - single user can ‘grab’ all resources Unclear definition of storage elements (Disk/MSS) Unclear definition of storage elements (Disk/MSS)

15 GSI Darmstadt15 User analysis issues - short list (2) What has been done What has been done Failover CE for user queue (Grid partition ‘Analysis’) Failover CE for user queue (Grid partition ‘Analysis’) Since 20 June - 100% availability Since 20 June - 100% availability Pre staging of data (available on spinning media) and creation of xml collections centrally Pre staging of data (available on spinning media) and creation of xml collections centrally The availability of the pre-staged files is checked periodically The availability of the pre-staged files is checked periodically More robust central services (see previous slides) More robust central services (see previous slides) Use of dedicated SE for user files - this will be transparently increased to multile SEs with quotas Use of dedicated SE for user files - this will be transparently increased to multile SEs with quotas Priority mechanism (not the final version) put in place Priority mechanism (not the final version) put in place We haven’t had reports of unfair use We haven’t had reports of unfair use

16 GSI Darmstadt16 Job completion chart Standard distribution: 50/50 T1/T2 contribution User jobs

17 GSI Darmstadt17 User analysis issues - current Storage availability and consistency Storage availability and consistency Still very few working SEs - common storage solutions are not yet ‘production’ quality Still very few working SEs - common storage solutions are not yet ‘production’ quality The effort is now concentrated on CASTOR2 with xrootd The effort is now concentrated on CASTOR2 with xrootd Sites (GSI f.e.) are installing large xrootd pools - these are tested and working Sites (GSI f.e.) are installing large xrootd pools - these are tested and working With more SEs, holding replicas of the data, the Grid will naturally become more stable With more SEs, holding replicas of the data, the Grid will naturally become more stable Availability of specific data sets Availability of specific data sets Dependent on the storage capacity in operation Dependent on the storage capacity in operation Currently TPC RAW data is being replicated to GSI Currently TPC RAW data is being replicated to GSI With CASTOR2+xrootd working, the number of events on spinning media will increase 20x With CASTOR2+xrootd working, the number of events on spinning media will increase 20x

18 GSI Darmstadt18 User analysis issues - current (2) User applications User applications Compatibility of user installation of ROOT, gcc version, OS - locally complied application will not necessarily run on the Grid Compatibility of user installation of ROOT, gcc version, OS - locally complied application will not necessarily run on the Grid All sites are installed with ‘lowest common denominator’ middleware and packages - currnetly SLC3, gcc v.3.2, while most users have gcc v.3.4 All sites are installed with ‘lowest common denominator’ middleware and packages - currnetly SLC3, gcc v.3.2, while most users have gcc v.3.4 There is no easy way out, until the centres migrate to SL(C)4 and gcc v.3.4 There is no easy way out, until the centres migrate to SL(C)4 and gcc v.3.4 Meanwhile, the experts are looking into repackaging the Grid apps (most notably gshell) Meanwhile, the experts are looking into repackaging the Grid apps (most notably gshell) Currently the only solution is to always compile ROOT and user application with the same compiler, before submitting to Grid Currently the only solution is to always compile ROOT and user application with the same compiler, before submitting to Grid


Download ppt "Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007."

Similar presentations


Ads by Google