Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop, New York May 20 th 2012.

Slides:



Advertisements
Similar presentations
Data Management TEG Status Dirk Duellmann & Brian Bockelman WLCG GDB, 9. Nov 2011.
Advertisements

ITIL: Service Transition
Assessment of Core Services provided to USLHC by OSG.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Ian Bird LHCC Referee meeting 23 rd September 2014.
Cloud Use Cases, Required Standards, and Roadmaps Excerpts From Cloud Computing Use Cases White Paper
Tier 3 Data Management, Tier 3 Rucio Caches Doug Benjamin Duke University.
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.
WLCG Cloud Traceability Working Group face to face report Ian Collier 11 February 2015.
Evolution of Grid Projects and what that means for WLCG Ian Bird, CERN WLCG Workshop, New York 19 th May 2012.
Ian Bird GDB; CERN, 8 th May 2013 March 6, 2013
WLCG operations A. Sciabà, M. Alandes, J. Flix, A. Forti WLCG collaboration workshop July , Barcelona.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
2012 Objectives for CernVM. PH/SFT Technical Group Meeting CernVM/Subprojects The R&D phase of the project has finished and we continue to work as part.
Evolution of storage and data management Ian Bird GDB: 12 th May 2010.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Workload management, virtualisation, clouds & multicore Andrew Lahiff.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.
Data Placement Intro Dirk Duellmann WLCG TEG Workshop Amsterdam 24. Jan 2012.
Evolving Security in WLCG Ian Collier, STFC Rutherford Appleton Laboratory Group info (if required) 1 st February 2016, WLCG Workshop Lisbon.
Data management demonstrators Ian Bird; WLCG MB 18 th January 2011.
Storage Interfaces and Access pre-GDB Wahid Bhimji University of Edinburgh On behalf of all those who participated.
Handling of T1D0 in CCRC’08 Tier-0 data handling Tier-1 data handling Experiment data handling Reprocessing Recalling files from tape Tier-0 data handling,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
Wahid Bhimji (Some slides are stolen from Markus Schulz’s presentation to WLCG MB on 19 June Apologies to those who have seen some of this before)
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Evolution of WLCG infrastructure Ian Bird, CERN Overview Board CERN, 30 th September 2011 Accelerating Science and Innovation Accelerating Science and.
Traceability WLCG GDB Amsterdam, 7 March 2016 David Kelsey STFC/RAL.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.
Trusted Virtual Machine Images the HEPiX Point of View Tony Cass October 21 st 2011.
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
LHCbComputing Update of LHC experiments Computing & Software Models Selection of slides from last week’s GDB
HEPiX Virtualisation working group Andrea Chierici INFN-CNAF Workshop CCR 2010.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Evolution of storage and data management
WLCG IPv6 deployment strategy
Sviluppi in ambito WLCG Highlights
Virtualization and Clouds ATLAS position
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Working Group 4 Facilities and Technologies
Ian Bird GDB Meeting CERN 9 September 2003
How to enable computing
Taming the protocol zoo
Final summary Ian Bird Amsterdam, DAaM 18th June 2010.
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
WLCG Collaboration Workshop;
Presentation transcript:

Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop, New York May 20 th 2012

Comments This is far from being a comprehensive or complete summary Not discussed here: –Directions/decisions that are already taken Extracted here are essentially: –Action items –Items in need of further work/discussion –Unanswered questions –A few provocative comments I have sometimes made strong conclusions from tentative statements … 20th May

General needs Overall strategy –Robustness and simplicity of use:  move towards “Computing as a Service” particularly at smaller sites with limited effort Implies trivial set up and configuration of services essential Environments need to be self-describing (or job able to determine environment) - no complex info publishing or requirements Better monitoring: –Network monitoring, including traffic flows etc. Need to correlate with how DM is done. –Mechanism to do analysis on monitoring data –Better coordinate dashboards, availability tests, etc. ➔ Set up a WLCG monitoring group to coordinate and oversee this 20th May

Data and Storage Distinguish between tape archives and disk pools –Data on tape is moved explicitly to a disk pool, not invisibly migrated TBD: Distinguish between Tier 2s that really provide data storage and those that are merely caches –The latter could have a simple storage service, esp. if http as a protocol is usable (e.g. squid) –Determine what lower level of service is required at such Tier 2 caches 20th May

Data and storage – 2 Data federation with xrootd is a clear direction, for some part of the data –Later using http? Essential to have robustness of storage services at a site –Argument for smaller sites to act as “cache” rather than “storage” Use of remote i/o –Several use scenarios, but needs monitoring data to ensure efficiency –Hopefully most of this being integrated into xrootd 20th May

Data and storage - 3 And SRM?? –Keep as interface to archives and managed storage But useful functionality has been delineated –Not there for federated storage with xrootd –FTS-3 can talk directly to gridftp anyway –No specific need to replace SRM as an interface But may be an interest in cloud storage interfaces at some point (technology watch) Allow/encourage (??) sites to offer other interfaces 20th May

Data and Storage Conclusions: –Don’t question use of gridftp for now –Need all systems to support xrootd fully Anything actually missing here? –Eventual use of http is potentially interesting Continue work on plugins and testing at low (?) or high (?) priority (but limited effort?) –FTS-3 is high priority; Follow up on requirements, use for tape  disk movement; use of replicas if source file is missing –Storage accounting EMI StAR, but need an implementation –I/O benchmarking, requirements, monitoring To improve I/O perf and clarify statement of needs to vendors 20th May

Open Questions: Access Patterns Difference between staging data for I/O to and from the WN to: –I/O over the LAN to local storage –I/O over the WAN to remote storage Connected questions: –What fraction of each file is read ? how sparse are sparse reads? –How well is this fraction known wrt the type of file and the processing stage? –Impact of new vector read (TTreeCache) how many round-trips per GB used data

Open Questions: Federation Repair only mode –can we verify the TEG expected data volume ? –repair by catalogue-SE comparisons what is the difference to re-populate by FTS Caching –caching files or what has been read ? –caching and access control? caching for world readable (reduced AA) only?

Open Questions: WN Staging to WN –for read access: local disk I/O most efficient alternative, excellent clients –for writing: how to stage out data without loosing data due to running out of queue time? –discussion needs input from data access monitoring to understand role of sparse reads –Measurements needed to directly compare access strategies

Open Questions: World Readable Data with relaxed AA Expected benefits: Less round-trips, reduced computational overhead, much improved latency for access to many small files, simplicity for many operations ( caching, etc.) How to manage transition? –to be efficient has to work without moving the data How will clients be aware and suppress AA costs? Restricted to subset of access protocols? What fraction of the data and processing qualify? –results from data access studies needed as input

Data security Can we agree a model that distinguishes between: –Read-only data (that can be cached) Need to specify how caches are populated –Written data that needs to be stored –This model would allow simple AA for r-o (lower overhead) Can we agree to distinguish between sites –That store and manage data These need real data management systems –That cache data for analysis or processing These might need only off-the-shelf storage (or squids) accessible via xrootd Would benefit then from use of http as transport Also would need to define how such a site (or jobs on a site) move output files to real storage 20th May

Workload management Glexec: –Deploy fully in setuid mode. Define timescale now and follow up. No further need for WMS: decommission end 2012? Pilots: –Report is too conservative? –Support streamed submission: Requires modified CE; need to test at scale by 2013 (CE changes have taken years to reach production) –Common pilot framework? Based on glideinWMS? –So why do we still need a complex CE? No answer? Is there a simplification to be made? The above is “anti-CaaS”? 20th May

WLM – 2 Whole node and multi-core –Complex solution proposed including new JDL and new CE interfaces in order to allow experiments to make arbitrary requests. –Why? This goes against “CaaS”? ➔ Simplification: job wakes up, determines what is available, runs. ➔ Why not? 20th May

WLM – 3 CPU pinning + I/O bound vs CPU bound jobs –Why? is it really practical to think of optimisation at this level? –Adding complexity for undefined benefit? –Why expose it at the grid layer ➔ HEPiX; ➔ SFT concurrency project to address CPU efficiency in general 20th May

WLM – 4 Virtual CE: better support for “any” LRMS –Clear essential need Virtualisation use cases –Essentially a site decision –Consider performance issues Cloud use cases –Unresolved issues (AAA, etc.) –More work is required here 20th May HEPiX and/or WLCG WG

Information system Really distinguish between: –“Stable” information needed for service discovery –“Changing” information for monitoring etc no use case at all for info related to job brokering –Need a clear proposal for how to proceed ➔ Set up a small, rapid, wg to a)Make a clear statement of the status – some work has been done here b)Define the plan and clarify specific goals. 20th May

Databases Ensure support for COOL/CORAL+server: –Core support will continue in IT; ideally supplemented by some experiment effort –POOL no longer supported by IT Frontier/Squid as full WLCG service: –Should be done now; partly already –Needs to be added to GOCDB, monitoring etc –Who is responsible? Hadoop: (and NoSQL tools) –Not specifically a DB issue – broader use cases –CERN will (does) have a small instance; part of monitoring strategy ➔ Important to have a forum to share experiences etc. ➔ GDB 20th May

Operations & Tools WLCG service coordination team: –Should be set up/strengthened –Should include effort from the entire collaboration –Clarify roles of other meetings Strong desire for “Computing as a Service” at smaller sites Service commissioning/staged rollout –Needs to be formalised by WLCG as part of service coordination 20th May

Operations & tools – 2 Middleware –Before investing too much; see how much actual middleware still has a long term future –Simplify service management (goal of CaaS) Several different recommendations involved –Simplify software maintenance ➔ This requires continuing work Need to write a statement on software management policy for the future –Lifecycle model post EMI, and new OSG model Proposals very convergent! 20th May

Security – Risk Analysis Highlighted the need for fine-grained traceability –Essential to contain, investigate incidents, prevent re- occurrence Aggravating factor for every risk: –Publicity and press impact arising from security incidents 11 separate risks identified and scored 20th May

Security – areas needing work Fulfil traceability requirements on all services –Sufficient logging for middleware services –Improve logging of WNs and Uis –Too many sites simply opt-out of incident response: “no data, no investigation -> no work done!” –Prepare for future computing model (e.g. private clouds) Enable appropriate security controls (AuthZ) –Need to incorporate identity federations –Enable convenient central banning People issues: –Must improve our security patching and practices at the sites –Collaborate with external communities for incident response and policies –Building trust has proven extremely fruitful – needs to continue 20th May

Discussion/work group topics 20th May TEGWG / LiaisonPurpose WLMHEPiXLiaison(s) with HEPiX (and others) on CPU pinning and “cloud” computing WLM“CE”At least one WG to define CE extensions (and/or alternatives) in more detail: scoping work, defining timescales, testing and deployment plans SeveralISIS WG to (re-) define requirements, their implementation and deployment DSMTopical storage groups e.g. R/O placement layer; SRM alternates; liaison with ROOT I/O wg; Separation of R/O & R/W data incl. R/O caches; Federation as “repair mechanism” OPSm/w services & configuration WGs to review m/w services and m/w configuration tools / mechanisms (not clear how useful now) OPSCoordinationNot a WG per se, but still a body that will continue and will monitor / coordinate other efforts OPSService Commissioning A “virtual team” created (and disbanded) as required – and with targeted expertise – to validate, commission and trouble-shoot DB“user group”To share experiences AllMonitoringCoordinate all monitoring activities, including missing functions (e.g. network traffic), + monitoring analysis DSMData access securityDefine/agree data access/placement security model AllHEPiX?Technology watch: storage interfaces, protocols, etc., etc.

Some questions for the workshop What should be done to approach “Computing as a Service” for sites? Can we agree a strategy for a CE that does not add complexity but allows pilot factories, etc.? Can we agree a simplified subset of SRM? Can we separate archives and disk storage? Can we distinguish between sites that store and sites that cache data only? Can we agree a straightforward data security model? How far can we converge “middleware” across grid infrastructures? What are disruptive changes that must be done in LS1? (any?) 20th May

Need to do in LS1: Testing new concepts at scale: –FTS-3 scale testing –On large sites separation between archives and placement layer –Federation: run production with some fraction of data not local Needs good monitoring –Test reduced data access Auth z requirements –Testing use of multicore/whole node environments?  20th May

Hello, Good-bye: (to be completed…) CVMFS Frontier/Squid … 20th May POOL LFC … WMS

Effort? Re-iterate the need for more collaborative activities … 20th May