Presentation on theme: "GLite Status Stephen Burke RAL GridPP 13 - Durham."— Presentation transcript:
gLite Status Stephen Burke RAL GridPP 13 - Durham
July 6 th 2005gLite Status Overview gLite releases gLite deployment WMS DMS R-GMA VOMS Outstanding issues E&OE!
July 6 th 2005gLite Status gLite releases so far Release 1.0 on April 5 th –Released to meet deadline –WMS + CE + Fireman + gLite i/o + R-GMA + VOMS –AliEn, GAS and package manager gone –Several things missing or not working well No SE in gLite –Documentation is reasonable Release 1.1 on May 12 th –First versions of File Transfer Service (FTS), metadata catalogue –Secure file catalogues –Bug fixes
July 6 th 2005gLite Status Future releases Release 1.2 should have been on June 1 st –Delayed to end of June, now expected late July Was expected to be in LCG July release Have gLite R-GMA and VOMS as LCG upgrades Final gLite release (2.0) for EGEE 1 by end of the year –Updated architecture/design/workplan documents –Code freeze October (?) –Maybe a 1.3 release (August?), time is tight
July 6 th 2005gLite Status Timelines March 2006 December 2005 November 2005 October 2005 June 2005 End of EGEE 1 TODAY Release 1.2Release 2.0 Xmas Vacatio n Integrated 2.0 Func. freeze Final Report Mid Dec. Func. freeze ? Consequences ~ 2.5 months of development left probably only 1 or 2 releases between 1.2 and 2.0 Focus on consolidation of 1.2 and little improvements as requested from applications Very careful in introducing new services Review
July 6 th 2005gLite Status Release priorities Driven by service challenges –Especially data management –LCG Baseline Services document No time to change anything for EGEE 1 EGEE PTF disbanded –Not seen as effective –Who collects requirements? –Do non-LCG VOs have influence?
July 6 th 2005gLite Status gLite deployments – JRA1 gLite prototype system –Used by ARDA team, biomed, some others –Very small, basically just CERN –Not properly maintained JRA1 testing testbed –Was CERN, RAL and NIKHEF –Two sites + manpower added at Imperial One person subtracted at CERN –Still small and under-resourced –Releases are not sufficiently tested 928 open bugs in savannah, 84 critical 281 ready for test, but no time to test!
July 6 th 2005gLite Status gLite deployments - LCG Pre-production system now being installed –~8 sites so far – more coming None in UK? –Currently a pure gLite system Role seems to change from week to week! –Partly working but many problems –Some users allowed in soon (now?) Production system –Various plans considered –LCG 2.6 has R-GMA and VOMS –Next steps unclear (to me at least!)
Status as of release 1.1
July 6 th 2005gLite Status Workload management Broker is a development of the EDG/LCG RB –Seems to be largely backward-compatible –Main new feature is DAGMAN (composite jobs) –Push and pull job submission –No web services Hybrid info system (CEMON + BDII) –Static configuration of WMS-CE relationships –Should change to R-GMA (?) Condor-C replaces Globus gatekeeper on CE –Several security problems –Current performance is poor Submissions often fail Cryptic error messages
July 6 th 2005gLite Status Data Management First version of metadata catalogue –No command-line clients yet, MySQL only Fireman file catalogue –Competes with new LCG File Catalogue –Various experiment-specific solutions gLite i/o –Security model still under debate (delegation, file ownership) –Doesnt yet work with dCache or DPM SRMs, only Castor! FTS – developed for service challenges –Point-to-point reliable file transfer –No interaction with Fireman catalogue No File Placement Service (FPS) yet, hence no replication! No Data Scheduler Interaction with WMS still under discussion
July 6 th 2005gLite Status R-GMA Should be an information system –But both LCG and gLite still use BDII New Service Discovery API –Still discussing service types and names LCG now making substantial use of R-GMA for monitoring, accounting etc –Lots of pressure to fix bugs! –Some stability problems, needs more testing Not ideal to test in production, but … –Seems generally in a good state
July 6 th 2005gLite Status Security gLite VOMS server now used by LCG –Some problems with gLite installation scripts WMS and DMS have limited support for VOMS –SRM, Condor-C and R-GMA dont yet Many test VOMS servers exist, but still not in production –Will probably need a long learning period to get the best use of VOMS –Not a a panacea! Security requirements mostly still not being addressed –Most date back to the start of EDG Many known security vulnerabilities
July 6 th 2005gLite Status General Error messages, logging and fault-tolerance –Still very poor Proposal on common error handling by Steve Fisher Configuration –gLite has a common config tool (python/XML) –Underlying config not unified –Still complex, fragile and error prone –Not clear if LCG will switch May get many layers - YAIM -> XML -> m/w specific config files? Monitoring –Getting better – but all from LCG, not in gLite Single points of failure –Still have many, but some positive movement
July 6 th 2005gLite Status WMS Job submission rate too slow –Not tested (?), but probably no change Failover (RB goes down -> jobs lost) –No change so far Bulk job submission –Partial support via DAGs –Parameterised jobs coming Space management on WNs –Not being addressed Access to output from running jobs –Not yet Advance reservation –Some work, but not yet available Interaction with data management (pre-staging) –Discussion but nothing yet CPU speed, memory etc requirements not passed to batch system –May appear in future Job distribution is poor (ERT etc) –Partly addressed by new Glue schema –Still no direct support in broker
July 6 th 2005gLite Status DMS Need a metadata solution –Much discussion, seems to be converging File catalogue performance, bulk operations –Partly addressed by Fireman, LFC –LFC seems to have better performance but no bulk operations Catalogue replication –Oracle replication by LCG –gLite working towards local catalogues Small files –Not being addressed Reliable file replication –Partly addressed by FTS, need FPS as well File pinning –Not yet in SRMs or FTS Posix file access –May be addressed by gLite i/o –Security model unclear High level data management –Not yet (wait for Data Scheduler in 2.0)
July 6 th 2005gLite Status Information systems Not many issues! Glue schema not ideal –Minor update just released –Maybe new major version in ~ 1 year? Stability, scalability –Need to test in production - test systems too small
July 6 th 2005gLite Status Security VO management, groups and roles –Should come with VOMS VO policies for CEs –Some tools (LCAS, LCMAPS) –Needs experience ACLs on files –Should come with gLite File Access Service (FAS) –Not ready yet –Need to check security model satisfied sites –No support in SRM yet No outbound IP access –Some discussion, nothing yet Secure file management –Not needed for HEP, but strong need for biomed –Some work, not there yet Quotas –Some work on measurement –Enforcement? Vulnerabilities –Many known, little work –New group (Linda Cornwall)
July 6 th 2005gLite Status Summary First gLite releases are out, but are buggy and incomplete Next release is late, not much time to the end of EGEE 1 Many long-standing issues not addressed –Developers tend to follow their own interests rather than user/sysadmin needs –Functionality is less than at the end of EDG! Probably still >~ 1 year to get production quality –OK for EGEE if EGEE 2 is approved –Mismatch with LCG timescale LHC experiments are building their own Grids –How much of gLite do they need? Who decides requirements and priorities?