Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop, New York May 20 th 2012.

Similar presentations


Presentation on theme: "Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop, New York May 20 th 2012."— Presentation transcript:

1 Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop, New York May 20 th 2012

2 Comments This is far from being a comprehensive or complete summary Not discussed here: –Directions/decisions that are already taken Extracted here are essentially: –Action items –Items in need of further work/discussion –Unanswered questions –A few provocative comments I have sometimes made strong conclusions from tentative statements … 20th May 2012Ian.Bird@cern.ch2

3 General needs Overall strategy –Robustness and simplicity of use:  move towards “Computing as a Service” particularly at smaller sites with limited effort Implies trivial set up and configuration of services essential Environments need to be self-describing (or job able to determine environment) - no complex info publishing or requirements Better monitoring: –Network monitoring, including traffic flows etc. Need to correlate with how DM is done. –Mechanism to do analysis on monitoring data –Better coordinate dashboards, availability tests, etc. ➔ Set up a WLCG monitoring group to coordinate and oversee this 20th May 2012Ian.Bird@cern.ch3

4 Data and Storage Distinguish between tape archives and disk pools –Data on tape is moved explicitly to a disk pool, not invisibly migrated TBD: Distinguish between Tier 2s that really provide data storage and those that are merely caches –The latter could have a simple storage service, esp. if http as a protocol is usable (e.g. squid) –Determine what lower level of service is required at such Tier 2 caches 20th May 2012Ian.Bird@cern.ch4

5 Data and storage – 2 Data federation with xrootd is a clear direction, for some part of the data –Later using http? Essential to have robustness of storage services at a site –Argument for smaller sites to act as “cache” rather than “storage” Use of remote i/o –Several use scenarios, but needs monitoring data to ensure efficiency –Hopefully most of this being integrated into xrootd 20th May 2012Ian.Bird@cern.ch5

6 Data and storage - 3 And SRM?? –Keep as interface to archives and managed storage But useful functionality has been delineated –Not there for federated storage with xrootd –FTS-3 can talk directly to gridftp anyway –No specific need to replace SRM as an interface But may be an interest in cloud storage interfaces at some point (technology watch) Allow/encourage (??) sites to offer other interfaces 20th May 2012Ian.Bird@cern.ch6

7 Data and Storage Conclusions: –Don’t question use of gridftp for now –Need all systems to support xrootd fully Anything actually missing here? –Eventual use of http is potentially interesting Continue work on plugins and testing at low (?) or high (?) priority (but limited effort?) –FTS-3 is high priority; Follow up on requirements, use for tape  disk movement; use of replicas if source file is missing –Storage accounting EMI StAR, but need an implementation –I/O benchmarking, requirements, monitoring To improve I/O perf and clarify statement of needs to vendors 20th May 2012Ian.Bird@cern.ch7

8 Open Questions: Access Patterns Difference between staging data for I/O to and from the WN to: –I/O over the LAN to local storage –I/O over the WAN to remote storage Connected questions: –What fraction of each file is read ? how sparse are sparse reads? –How well is this fraction known wrt the type of file and the processing stage? –Impact of new vector read (TTreeCache) how many round-trips per GB used data

9 Open Questions: Federation Repair only mode –can we verify the TEG expected data volume ? –repair by catalogue-SE comparisons what is the difference to re-populate by FTS Caching –caching files or what has been read ? –caching and access control? caching for world readable (reduced AA) only?

10 Open Questions: WN Staging to WN –for read access: local disk I/O most efficient alternative, excellent clients –for writing: how to stage out data without loosing data due to running out of queue time? –discussion needs input from data access monitoring to understand role of sparse reads –Measurements needed to directly compare access strategies

11 Open Questions: World Readable Data with relaxed AA Expected benefits: Less round-trips, reduced computational overhead, much improved latency for access to many small files, simplicity for many operations ( caching, etc.) How to manage transition? –to be efficient has to work without moving the data How will clients be aware and suppress AA costs? Restricted to subset of access protocols? What fraction of the data and processing qualify? –results from data access studies needed as input

12 Data security Can we agree a model that distinguishes between: –Read-only data (that can be cached) Need to specify how caches are populated –Written data that needs to be stored –This model would allow simple AA for r-o (lower overhead) Can we agree to distinguish between sites –That store and manage data These need real data management systems –That cache data for analysis or processing These might need only off-the-shelf storage (or squids) accessible via xrootd Would benefit then from use of http as transport Also would need to define how such a site (or jobs on a site) move output files to real storage 20th May 2012Ian.Bird@cern.ch12

13 Workload management Glexec: –Deploy fully in setuid mode. Define timescale now and follow up. No further need for WMS: decommission end 2012? Pilots: –Report is too conservative? –Support streamed submission: Requires modified CE; need to test at scale by 2013 (CE changes have taken years to reach production) –Common pilot framework? Based on glideinWMS? –So why do we still need a complex CE? No answer? Is there a simplification to be made? The above is “anti-CaaS”? 20th May 2012Ian.Bird@cern.ch13

14 WLM – 2 Whole node and multi-core –Complex solution proposed including new JDL and new CE interfaces in order to allow experiments to make arbitrary requests. –Why? This goes against “CaaS”? ➔ Simplification: job wakes up, determines what is available, runs. ➔ Why not? 20th May 2012Ian.Bird@cern.ch14

15 WLM – 3 CPU pinning + I/O bound vs CPU bound jobs –Why? is it really practical to think of optimisation at this level? –Adding complexity for undefined benefit? –Why expose it at the grid layer ➔ HEPiX; ➔ SFT concurrency project to address CPU efficiency in general 20th May 2012Ian.Bird@cern.ch15

16 WLM – 4 Virtual CE: better support for “any” LRMS –Clear essential need Virtualisation use cases –Essentially a site decision –Consider performance issues Cloud use cases –Unresolved issues (AAA, etc.) –More work is required here 20th May 2012Ian.Bird@cern.ch16 HEPiX and/or WLCG WG

17 Information system Really distinguish between: –“Stable” information needed for service discovery –“Changing” information for monitoring etc no use case at all for info related to job brokering –Need a clear proposal for how to proceed ➔ Set up a small, rapid, wg to a)Make a clear statement of the status – some work has been done here b)Define the plan and clarify specific goals. 20th May 2012Ian.Bird@cern.ch17

18 Databases Ensure support for COOL/CORAL+server: –Core support will continue in IT; ideally supplemented by some experiment effort –POOL no longer supported by IT Frontier/Squid as full WLCG service: –Should be done now; partly already –Needs to be added to GOCDB, monitoring etc –Who is responsible? Hadoop: (and NoSQL tools) –Not specifically a DB issue – broader use cases –CERN will (does) have a small instance; part of monitoring strategy ➔ Important to have a forum to share experiences etc. ➔ GDB 20th May 2012Ian.Bird@cern.ch18

19 Operations & Tools WLCG service coordination team: –Should be set up/strengthened –Should include effort from the entire collaboration –Clarify roles of other meetings Strong desire for “Computing as a Service” at smaller sites Service commissioning/staged rollout –Needs to be formalised by WLCG as part of service coordination 20th May 2012Ian.Bird@cern.ch19

20 Operations & tools – 2 Middleware –Before investing too much; see how much actual middleware still has a long term future –Simplify service management (goal of CaaS) Several different recommendations involved –Simplify software maintenance ➔ This requires continuing work Need to write a statement on software management policy for the future –Lifecycle model post EMI, and new OSG model Proposals very convergent! 20th May 2012Ian.Bird@cern.ch20

21 Security – Risk Analysis Highlighted the need for fine-grained traceability –Essential to contain, investigate incidents, prevent re- occurrence Aggravating factor for every risk: –Publicity and press impact arising from security incidents 11 separate risks identified and scored 20th May 2012Ian.Bird@cern.ch21

22 Security – areas needing work Fulfil traceability requirements on all services –Sufficient logging for middleware services –Improve logging of WNs and Uis –Too many sites simply opt-out of incident response: “no data, no investigation -> no work done!” –Prepare for future computing model (e.g. private clouds) Enable appropriate security controls (AuthZ) –Need to incorporate identity federations –Enable convenient central banning People issues: –Must improve our security patching and practices at the sites –Collaborate with external communities for incident response and policies –Building trust has proven extremely fruitful – needs to continue 20th May 2012Ian.Bird@cern.ch22

23 Discussion/work group topics 20th May 2012Ian.Bird@cern.ch23 TEGWG / LiaisonPurpose WLMHEPiXLiaison(s) with HEPiX (and others) on CPU pinning and “cloud” computing WLM“CE”At least one WG to define CE extensions (and/or alternatives) in more detail: scoping work, defining timescales, testing and deployment plans SeveralISIS WG to (re-) define requirements, their implementation and deployment DSMTopical storage groups e.g. R/O placement layer; SRM alternates; liaison with ROOT I/O wg; Separation of R/O & R/W data incl. R/O caches; Federation as “repair mechanism” OPSm/w services & configuration WGs to review m/w services and m/w configuration tools / mechanisms (not clear how useful now) OPSCoordinationNot a WG per se, but still a body that will continue and will monitor / coordinate other efforts OPSService Commissioning A “virtual team” created (and disbanded) as required – and with targeted expertise – to validate, commission and trouble-shoot DB“user group”To share experiences AllMonitoringCoordinate all monitoring activities, including missing functions (e.g. network traffic), + monitoring analysis DSMData access securityDefine/agree data access/placement security model AllHEPiX?Technology watch: storage interfaces, protocols, etc., etc.

24 Some questions for the workshop What should be done to approach “Computing as a Service” for sites? Can we agree a strategy for a CE that does not add complexity but allows pilot factories, etc.? Can we agree a simplified subset of SRM? Can we separate archives and disk storage? Can we distinguish between sites that store and sites that cache data only? Can we agree a straightforward data security model? How far can we converge “middleware” across grid infrastructures? What are disruptive changes that must be done in LS1? (any?) 20th May 2012Ian.Bird@cern.ch24

25 Need to do in LS1: Testing new concepts at scale: –FTS-3 scale testing –On large sites separation between archives and placement layer –Federation: run production with some fraction of data not local Needs good monitoring –Test reduced data access Auth z requirements –Testing use of multicore/whole node environments?  20th May 2012Ian.Bird@cern.ch25

26 Hello, Good-bye: (to be completed…) CVMFS Frontier/Squid … 20th May 2012Ian.Bird@cern.ch26 POOL LFC … WMS

27 Effort? Re-iterate the need for more collaborative activities … 20th May 2012Ian.Bird@cern.ch27


Download ppt "Summary of TEG outcomes First cut of a prioritisation/categorisation Ian Bird, CERN WLCG Workshop, New York May 20 th 2012."

Similar presentations


Ads by Google