Presentation is loading. Please wait.

Presentation is loading. Please wait.

GDA Jun/6, 2004/ZS 1 Next LCG-2 middleware release Zdenek Sekera (for the LCG GD-CT section) GDA 7 Jun 2004.

Similar presentations


Presentation on theme: "GDA Jun/6, 2004/ZS 1 Next LCG-2 middleware release Zdenek Sekera (for the LCG GD-CT section) GDA 7 Jun 2004."— Presentation transcript:

1 GDA Jun/6, 2004/ZS 1 Next LCG-2 middleware release Zdenek Sekera (for the LCG GD-CT section) GDA 7 Jun 2004

2 GDA Jun/6, 2004/ZS 2 Outline Grid Deployment group: Certification and Testing section What are we doing? Who are the members? How are we organized? LCG support, who is helping us? What is the Certification process? What is purpose of that process? What is going to be in the next LCG-2 middleware release? What after RH7.3?

3 GDA Jun/6, 2004/ZS 3 Certification & Testing: what are we doing? Integrate middleware software from several different sources into a homogenous package Provide the production quality software Verify that: it can actually be installed following installation instructions provided by us it can be configured to create a proper environment allowing a site to connect to the world-wide LCG grid it is fully functional as a production system Our tools: “big” certification testbed (~60 machines) “small” certification testbed (~10 machines) quite extensive set of tests

4 GDA Jun/6, 2004/ZS 4 LCG Certification Goal Provide reliable software releases of the LCG software for production use we want to make sure that when YOU download a LCG software from LCG deployment Web site, you have a guarantee it has been certified it installs when installed using installation instructions supplied by LCG it will work as specified in various user and system documentation supplied by LCG If it does NOT, we want hear from you and we will correct it

5 GDA Jun/6, 2004/ZS 5 Certification & Testing: section members Piera BETTINI: GridICE, R-GMA, integration Jean-Philippe BAUD: DataMgt, GFAL, dCache Frederique CHOLLET: Testing, test suites development Gilbert GROSDIDIER: Testing, test suites development Mila KATZAROVA: WEB redesign Maarten LITMAATH: VDT, dCache, general debugging Carlos OSUNA: CVS, Autobuild, Porting Louis PONCET: CVS, Web, sysadmin, HW, porting Marco SERRA: CTB architect, integration,debugging David SMITH: Workload Mgt, debugging Di QING: Integration, debugging Zdenek SEKERA: Management and the rest Plus temporary visitors (1-3 months): E.Slabospitskaja, A.Kirianov, D.Olejnik, M.Sapunov, G.-T. Chiang, H.-L. Shih, M.-H Tsai, others

6 GDA Jun/6, 2004/ZS 6 LCG support Workload Management: primary contact: Massimo Sgaravatto (EDG WP1) Data Management: primary contact: James Casey (now member of GD, formerly EDG WP2) dCache: primary contact: dCache support mailing list R-GMA: primary contact: WP3 mailing list, S.Fisher, S.Traylen

7 GDA Jun/6, 2004/ZS 7 Certification, Testing and Release Cycle Certification testbedDeployment LCG C&T section add features fix problems transmit problems EGEE fix problems new releases VDT fix problems new releases Integrate Basic Functionality Tests errors? yes no yes fix problems Run C&T test suites site test suites no Run Certification Matrix errors? yes EXPERIMENTS INTEGRATION TESTBED Release Candidate tagged errors? no yes no fix problems candidate not acceptable RELEASE PRE-DEPLOYMENT certified release tagged deployment feedback GENERAL RELEASE

8 GDA Jun/6, 2004/ZS 8 What is “production quality”? It is all of the following in no particular order: availability 24 x 7 performance stability, robustness user friendliness maintainability user support

9 GDA Jun/6, 2004/ZS 9 LCG-2 certification basic grid functionality connectivity grid services security resource brokering data management (replication, catalog) configurability error recovery real world applications site verification suite

10 GDA Jun/6, 2004/ZS 10 LCG-2 May/31 release Consolidation/maintenance activities New VDT 1.1.14 (Globus 2.4.3) Workload Management maintenance Data Management maintenance and features GridICE monitoring improvements New features Data Management lcg-utils - tools requested by experiments GFAL integration “long names” Castor client Possibly dCache integration R-GMA integration Accounting

11 GDA Jun/6, 2004/ZS 11 VDT 1.1.14 1.1.14 == 1.1.13 + a few patches implementing Globus "Advisories" (e.g. OpenSSL security upgrade). It is based on Globus 2.4.3 with 48 patches applied on top, fixing bugs (memory and file descriptor leaks, race conditions, logic errors) and adding needed functionality (gridmapdir, gatekeeper accounting and logfile rotation). Almost all our patches (31) have been submitted by VDT to Globus, many have already been incorporated into Globus 2.4.3, a few are on the to-do list for future releases. It is compatible with VDT 1.1.8-14.edg4 currently used in the production system; the only problem is with round-robin IP address load-balancing (e.g. castorgrid), but there is an easy work-around. It has been running on the CTB since a month without any problems.

12 GDA Jun/6, 2004/ZS 12 Workload Management maintenance (1/3) In two steps: 1. move from lcg2-1-20 to lcg2-1-21-1 (only EDG bug fixes) 2. move from lcg2-1-21-1 to lcg2-1-25-1 (only LCG bug fixes) gradual testing on the “small” testbed first so when we installed it on the “big” CTB we knew the upgrade will be painless lcg2-1-25-1 is the first version that was fully built on the LCG CVS server

13 GDA Jun/6, 2004/ZS 13 Workload Management maintenance (2/3) 2004-05-06: patch 150 - WMS lcg2_1_21-1 WMS changes with respect to lcg2_1_20: Fix EDG bugzilla bugs: 1992 - UI ignores TCP port range variable for interactive jobs 1997 - edg-job-list-match "error" messages 2357 - Job refusal at NS is reported incorrectly 2440 - edg-job-submit always produces "edglog.log" file in cwd 2469 - Brokerinfo file only lists a file once 2487 - Error occurred during mkdir for reduced part 2493 - OutputSE not working? 2540 - duplicated entries in ACL list 2566 - SocketAgent::close check for wrong no_error return code by close method

14 GDA Jun/6, 2004/ZS 14 Workload Management maintenance (3/3) 2004-05-18: patch 164 - WMS lcg2_1_25-1 WMS changes with respect to lcg2_1_21_1: Fix LCG savannah bugs: 2682 - workload manager ranking queries 2701 - WP1 & GlueCEUniqueID, GlueClusterUniqueID and GlueSubClusterUniqueID 2715 - edg-wl-lm init.d script and lockfile 2792 - edg-wl-renewd cannot handle change of MyProxy host 2909 - Error in edg-job-get-chkpt 2991 - WM crash possible when specifying OutputData 3258 - edg-wl-ns start takes a long time due to unneeded chown -R 3286 - Timezone for --from & --to options of edg-job-status 3372 - Include WMS job id in the job's globus RLS

15 GDA Jun/6, 2004/ZS 15 Data Management maintenance (1/3) The main focus of this release are: Upgrade of GSoap runtime to 2.3 for all C++ clients (needed by ATLAS) Addition of extra methods into catalogs for bulk operations (requested by CMS/POOL) Refactor of info system interaction and printInfo command in Replica Manager (internal request to rewrite a buggy component that caused many error reports) Integration of EDG-SE StorageResource for AFS interaction at RAL

16 GDA Jun/6, 2004/ZS 16 Data Management maintenance (2/3) So we have ended up with the following new versions edg-replica-manager v1.7.2 edg-local-replica-catalog v2.2.7 edg-replica-metadata-catalog v2.2.7 LRC/RMC C++ clients v2.3.0

17 GDA Jun/6, 2004/ZS 17 Data Management maintenance (3/3) Bugs Fixed 2858 - RM misbehavior if -d option not used (and default SE not available) 2875 - edg-rm pi with -f option prints bad service endpoints. 2887 - edg-rm does not accept port number in SRM SURL 2890 - edg-rm requires VO directory to be absolute path 2947 - If unknown SE turl protocol is in MDS, edg-rm malfunctions 2996 - POOL (RLS) : Allow guid/pfname/lfname as valid query fields 2998 - POOL (RLS) : Array based getMappings() methods 2999 - POOL (RLS) : setAttributes bulk method 3014 - edg-rm and directory creation against edg-se 3428 - edg-rm cr NPEs if a bad URI is given. 3265 - WP2 C+ clients should use gSoap 2.3 3282 - edg-rm cannot handle certain LFNs that should be accepted. 3296 - edg-rm sets SRM FileStatus to Active, not Running 3300 - edg-rm pi displays httpg URIs as https

18 GDA Jun/6, 2004/ZS 18 GridICE monitoring improvements new version of edt-sensor ( edt_sensor-1.4.17-0 ) was integrated into LCG-2 and installed on all CTB clients

19 GDA Jun/6, 2004/ZS 19 GFAL - Grid File Access Library GFAL version 1.3.7 better error handling in MDS interface (avoid core dump when info in MDS is missing or incorrect) several new routines in LRC and RMC interface to support the new lcg_util tools. the interface to the ADS (Rutherford) has been developed but may not be part of the release yet due to insufficient testing

20 GDA Jun/6, 2004/ZS 20 LCG utilities – tools requested by experiments lcg_util version 1.0.6 We now provide the following 11 methods (C API and CLI): lcg-aa: add Alias in RMC lcg-cp: copy file (Atlas) lcg-cr: copy and register file with optionally specified GUID (Atlas) lcg-del: delete file on a given SE or all replicas lcg-gt: get TURL (Atlas) lcg-lg: get the GUID for a given LFN or SURL lcg-lr: list all replicas for a file having a specified GUID lcg-ra: remove Alias in RMC lcg-rep: replicate files between 2 Storage Elements lcg-rf: register file with optionally specified GUID (Alice) lcg-uf: unregister file (Alice)

21 GDA Jun/6, 2004/ZS 21 Castor client supporting “long names” CASTOR-client-1.7.1.4-1.longname was installed on the CTB and will be released with May LCG-2 upgrade. Potential problem exists: It will be necessary (and prudent) for the future to find a way of synchronizing Castor server/client releases with CERN.

22 GDA Jun/6, 2004/ZS 22 R-GMA integration (1/3) R-GMA is required for accounting for specific monitoring by some (e.g. CMS) experiments The installation could be done by one the two ways: A.R-GMA people do everything: packaging, testing, distribution We will not get involved at all. B.R-GMA people will do: packaging, testing, installation and configuration instructions provide some simple tests and instructions on how to use them to verify installation installing a R-GMA registry for us so we don't use the RAL production one for testing We will do: certification for LCG-2, using supplied install & config instructions testing on our C&T testbed include it in LCG-2 distribution, as RPM's to be downloaded by sites, installation & config instructions would be yours

23 GDA Jun/6, 2004/ZS 23 R-GMA integration (2/3) R-GMA group has chosen the option B, which was also our preferred solution In this case the R-GMA packaging has to conform to LCG standards: 1.the RPM's must be relocatable 2.they should not use any pre- or post- installation scripts 3.if an environment that is not LCG needs to be included such as different versions of Java, Tomcat, MySQL etc), it has to be included in such a way it doesn't interfere with the deployed LCG-2. 4.Installed software must be tested on a real-life LCG-2 (or the C&T testbed) before it can be released. 5.Installation & configuration must be batch-like, via a script, no interactive updating of parameters. It is preferable to have one configuration file as a template which may need manual update on each site and the installation & configuration script(s) take all the information from that one file. The configuration has to consist of two parts: the proper R-GMA config the "system" config (setting up various services that should start on boot etc...). Two clearly separated scripts are required.

24 GDA Jun/6, 2004/ZS 24 R-GMA integration (3/3) Current status: we have been working for a long time with R-GMA developers list of services that must be published to enable job monitoring: still waiting We have provided a category called "RGMA" for bug reporting in savannah new bugs opened, we haven’t finished checking new rpm’s yet: 3645 - /tmp is not the best place to put logs 3647 – rgma default log level is debug 3648 – confusing configuration file 3655 – edg-rgma-servlets overwrite configuration file We have provided a category called "RGMA" for bug reporting in savannah Currently unclear if it can be included in the release, no serious testing yet

25 GDA Jun/6, 2004/ZS 25 Accounting integration Three weeks ago, we installed one rpm which should do the work; it had bugs. Some patches were provided since then by one of the R- GMA developers, not by the accounting group. They had to be installed by hand. We received no new rpm’s since. We do not have any news from accounting people. We could not test it, obviously. Consequently the accounting package could not be part of the release.

26 GDA Jun/6, 2004/ZS 26 dCache integration dCache includes SRM 1.1 interface and diskpool manager It is necessary for having a managed disk space LCG has been working with FNAL/DESY developers to integrate their software into the LCG-2 for about 3 months now current dCache status: old version with patches has survived a few stress tests but each dCache server sooner or later gets into a bad state, requiring a reboot latest version not (yet) usable because SRM does not advertise "gsidcap" protocol 38 open problems, 13 are major New dCache rpm’s received over the last weekend

27 GDA Jun/6, 2004/ZS 27 dCache integration – problem summary Status of dCache problems (2004/05/18) | Major (*) | Normal | Total ---------------------+-------------+-------------+------------ Fixed (#) | 3 | 3 | 6 ---------------------+-------------+-------------+------------ In progress (@) | 3 | 1 | 4 No news yet | 5 | 21 | 26 New | 5 | 3 | 8 ---------------------+-------------+-------------+------------ Total open | 13 | 25 | 38

28 GDA Jun/6, 2004/ZS 28 dCache integration – problem list (1/6) 1. *@ RPMs: should be cleaned and automatically released. We should not get TAR files. See also points 12, 13, 14, 16, 28, 35. Packaging almost OK now (2004/17/05) 2.# slow response time on SRM and GridFTP to be investigated (18/2/2004). Fix by David Smith has been incorporated in latest RPMs. 3.path too short (24/2/2004), supposed to be fixed, to be tested, important for GFAL filesystem 4.perror in dcap_url.c (24/2/2004) 5.gfalfs/fuse/dCache integration (24/2/2004) 6.O_TRUNC or overwrite of existing file (25/2/2004) 7.dcau (25/2/2004) 8.pinning to be tested (24/2/2004)

29 GDA Jun/6, 2004/ZS 29 dCache integration – problem list (2/6) 9.grid-map-file conversion (10/03/2004) --> The standard grid-map-file should be used, and any other parameters (e.g. VO root dir) should be put into a separate config. file 10.error message when missing VO directory (10/03/2004) 11.* hang when writing a file and the disk is full (12/03/2004) 12.*# version number (including libdcap) (16/03/2004). Fixed in latest RPMs (2004/05/17) 13. *@ templates should be provided, configuration files should not be overwritten (17/03/2004). Almost OK in latest RPMs (2004/05/17) 14.relocatable RPMs (17/03/2004) 15.file naming (18/03/2004) 16.*# some files still accessed thru their /usr/d-cache name (symbolic links currently needed). Fixed in latest RPMs (2004/05/17)

30 GDA Jun/6, 2004/ZS 30 dCache integration – problem list (3/6) 17.@ host proxy and srm-storage-element-info (29/03/2004) Latest code should fix it, to be tested 18.srmcp and X509_USER_PROXY (currently needs complicated command line options) 19. # pnfs config. scripts need non-interactive mode (31/03/2004). Fixed in latest RPMs (2004/05/17) 20.IOTunnel library for kdcap + port number for kdcap 21.core dump when port not specified (31/03/2004). Now getting obscure error message: "Failed to create a control line“ 22.# /opt/grid/gsint/gsint (01/04/2004). If it is not used, it should be removed from the RPM. If it is used, it should be moved to /opt/d- cache or /opt/gsint or... Fixed in latest RPMs (2004/05/17) 23.manual garbage collection (02/04/2004) 24. * missing entry points: dc_chmod, dc_mkdir, dc_rename, dc_rmdir and dc_unlink (02/04/2004)

31 GDA Jun/6, 2004/ZS 31 dCache integration – problem list (4/6) 25.* non working dc_opendir 26.* dCache SRM returns a TURL even if no space available (02/04/2004) 27.dCache totalSpace vs. usedSpace vs. availableSpace (05/04/2004) Feature? We have a work-around (2004/05/17). 28.*# better "srm" script (06/04/2004) please use Maarten's one. Fixed in latest RPMs (2004/05/17) 29.pnfs mountd incompatible with normal mountd 30.getFileMetaData srm://lxshare0282.cern.ch:8443/pnfs/cern.ch/data/cms gives java exception while the directory exists 31.getFileMetaData does not return ownership 32.Admin Guide + Installation Guide

32 GDA Jun/6, 2004/ZS 32 dCache integration – problem list (5/6) 33.dcap User Guide (only a few APIs are currently documented, protocols and port numbers should also be documented) 34. We propose that a hierarchy is implemented to set port numbers: user specified, environment variable, /etc/services, default set at compile time 35. *@ an object should be defined in one and only one RPM. This is currently not the case: dCache and dCache-pool RPM provide same objects. Almost OK in latest RPMs (2004/05/17) 36.* need of reboot after parameter change or sw change. The recipe of restarting all java services does not work.

33 GDA Jun/6, 2004/ZS 33 dCache integration – problem list (6/6) New since previous list: 37.* many (> ~15) parallel clients causes SRM to hang 38.* dcache-lcg-v1.2.2 SRM does not publish gsidcap protocol (2004/05/17). This makes that version unusable. 39.* SRM put error reporting if the file already exists (2004/04/14) 40.if unsupported protocol given for get/put, request state is failed, but file state remains pending (2004/04/23) 41.* libgsiTunnel.so needs globus_module_activate/deactivate gssapi module to work around a Globus bug (patch available) (2004/05/18) 42.* dcache stop script can leave /pnfs mounted, possibly causing an RPM upgrade or a shutdown to hang 43.* RPMs should come with release notes saying which bugs were fixed, what new functionality exists etc. 44.logfiles must be cleaned up: time stamps and request parameters must be added, harmless errors must be removed

34 GDA Jun/6, 2004/ZS 34 dCache integration - conclusion Considerable amount of time (~ 3 months) has already been spent on dCache integration into the LCG-2 Significant number of unresolved problems remain, problems remain unresolved for sometimes many weeks The support from dCache developers exists but it is very irregular Due to the many existing problems the dCache software could not yet be thoroughly tested on the CTB Consequently it cannot be deployed yet. Probably the best way to finish the dCache integration would be to bring relevant dCache developers to CERN for some period of time

35 GDA Jun/6, 2004/ZS 35 DWS: Developers Workstation Syndrome? Developer: It works in my environment so it must work everywhere. Reality: It works in my environment so there is a non-zero probability it will work elsewhere, too.

36 GDA Jun/6, 2004/ZS 36 LCG deployment Web page redesign (1/3) Current status: 1.Official Web page Template - done Documentation management implementation is ready 2.Internal Web page Upload files - done News - under construction Sections' web pages template - under construction

37 GDA Jun/6, 2004/ZS 37 LCG deployment Web page redesign (2/3) Issues: Upload file problems: How are we going to upload html files? They normally contain more then one file and the dll offered by "Web Services" at CERN is able to update only one file at a certain moment. Testing possible solutions. Permission problem - setting up permissions to the Documentation directory only for our group (for upload) seems impossible ??? Need CERN help.

38 GDA Jun/6, 2004/ZS 38 LCG deployment Web page redesign (3/3) Schedule: Documentation management ready till the end of next week (24.05 - 28.05). That will include news management too. (add news, delete, update) Sections web pages template - (31.05 - 04.06) Release of the static information on the official web site (07.06 - 11.06) Map with the participating institutes (~18/06) First internal release: middle of June for internal feedback. Public release: ~June end.

39 GDA Jun/6, 2004/ZS 39 LCG-2 – what’s next ? We think the LCG-2 is now fairly stable, we have no plans for major middleware upgrades, only the obvious bugfix maintenance. We wish to add new services, hopefully as add-on upgrades: R-GMA (including accounting) dCache VOMS – generate gridmapfile What else do YOU need? Tell us!

40 GDA Jun/6, 2004/ZS 40 What after RH 7.3? In the absence (hopefully temporary) of consensus, we have chosen to port LCG-2 to: RH Enterprise Server 3.0 IA32, the CERN variant This should be the original RH, recompiled by CERN (license issues), consequently with “CERN” logo It should be freely downloadable when certified by CERN Already well integrated in autobuild We have started to install a small testbed which we will later connect to bi C&T for interoperability testing we have a WN working (manual installation) RH Enterprise Server 3.0 IA64 Needed to support OpenLab External OpenLab partners involved (HP, IBM) Most of the work has already been done manually by OpenLab people, work is progressing to integrate it into the autobuild system

41 GDA Jun/6, 2004/ZS 41 Building system CVS server Rh73 i386 Server HTTP Cel3 ia64 cel3 i386 Cvs checkout Publishing of RPMs and reports of the build List of modules required Building

42 GDA Jun/6, 2004/ZS 42 After RH 7.3 The porting to other than RH 7.3 has become much higher priority We are working on some ports ourselves We have already started collaboration with Irish people (they have some experience with non-Linux systems such as IRIX) We will initialize another collaborations with QMUL and LeSC who offered their resources We provide anonymous access to our CVS server and will advice on how to setup the build process We will then introduce all changes into the CVS server for all LCG C&T tested architectures If we do not have a necessary hardware, we will solicit help in providing necessary access to such resources and help in certification

43 GDA Jun/6, 2004/ZS 43 After RH 7.3 Issues to consider for testing: HW availability (IRIX, Solaris, others) Interoperability between different O/S We will start with Worker Nodes first, leaving service nodes on IA32 (probably RH 7.3 for now) and adding CE, SE, and others services later

44 GDA Jun/6, 2004/ZS 44 QUESTIONS ??

45 GDA Jun/6, 2004/ZS 45 CVS/dev Compilation Web LCFGng install Manual install devellopers CVS/int Certification team Rpms list & configuration Configuration Rpms list Rpms Details in next slide Distribution process

46 GDA Jun/6, 2004/ZS 46 LCG Certification, Testing and Release Cycle CERTIFICATION TESTING DEPLOYMENT LCG C&T section add features fix problems transmit problems EGEE fix problems new releases VDT fix problems new releases Integrate Basic Functionality Tests Run C&T test suites site test suites Run Certification Matrix Release candidate tagged RELEASE PRE-DEPLOYMENT deployment feedback GENERAL RELEASE EXPERIMENTS INTEGRATION Experiments software installation Testing experiments specific features Certified release tag

47 GDA Jun/6, 2004/ZS 47 EGEE Certification, Testing and Release Cycle CERTIFICATION TESTING SERVICES Integrate Basic Functionality Tests Run tests C&T suites Site suites Run Certification Matrix Release candidate tag RELEASE PRE-PRODUCTION PRODUCTION EXPTS INTEGR Certified release tag DEVELOPMENT & INTEGRATION UNIT & FUNCTIONAL TESTING Dev Tag JRA1 LHC EXPTS MEDICAL OTHER TBD APPS SW Installation DEPLOYMENT PREPARATION Deployment release tag DEPLOY SA1 Production tag


Download ppt "GDA Jun/6, 2004/ZS 1 Next LCG-2 middleware release Zdenek Sekera (for the LCG GD-CT section) GDA 7 Jun 2004."

Similar presentations


Ads by Google