WLCG Collaboration Workshop;

Slides:



Advertisements
Similar presentations
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Advertisements

Project Status Report Ian Bird Computing Resource Review Board 30 th October 2012 CERN-RRB
Massive Computing at CERN and lessons learnt
Resources and Financial Plan Sue Foffano WLCG Resource Manager C-RRB Meeting, 12 th October 2010.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Emmanuel Tsesmelis 2 nd CERN School Thailand 2012 Suranaree University of Technology.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader WLCG Project Leader ISEF Students 18 th June 2012 Accelerating Science and Innovation.
User views from outside of Western Europe MarkoBonac, Arnes, Slovenia.
Rackspace Analyst Event Tim Bell
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.
Ian Bird LCG Project Leader OB Summary GDB 10 th June 2009.
Progress in Computing Ian Bird ICHEP th July 2010, Paris
Ian Bird GDB; CERN, 8 th May 2013 March 6, 2013
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
Ian Bird GDB CERN, 9 th September Sept 2015
Evolution of storage and data management Ian Bird GDB: 12 th May 2010.
Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
LHC Computing, CERN, & Federated Identities
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Frédéric Hemmer IT Department 26 th January 2010 Visit of Michael Dell 1 Frédéric.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
Ian Bird LCG Project Leader WLCG Status Report CERN-RRB th April, 2008 Computing Resource Review Board.
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 1 st March 2011 Visit of Dr Manuel Eduardo Baldeón.
GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.
WLCG after 1 year with data: Prospects for the future Ian Bird; WLCG Project Leader openlab BoS meeting CERN4 th May 2011.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Evolution of WLCG infrastructure Ian Bird, CERN Overview Board CERN, 30 th September 2011 Accelerating Science and Innovation Accelerating Science and.
WLCG: The 1 st year with data & looking to the future WLCG: Ian Bird, CERN WLCG Project Leader WLCG Project LeaderLCG-France; Strasbourg; 30 th May 2011.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.
WLCG – Status and Plans Ian Bird WLCG Project Leader openlab Board of Sponsors CERN, 23 rd April 2010.
Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.
LHCbComputing Update of LHC experiments Computing & Software Models Selection of slides from last week’s GDB
Ian Bird LCG Project Leader Summary of EGI workshop.
LHC collisions rate: Hz New PHYSICS rate: Hz Event selection: 1 in 10,000,000,000,000 Signal/Noise: Raw Data volumes produced.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Role and Challenges of the Resource Centre in the EGI Ecosystem Tiziana Ferrari,
Evolution of storage and data management
The CMS Experiment at LHC
Status of WLCG FCPPL project
Ian Bird WLCG Workshop San Francisco, 8th October 2016
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
David Kelsey CCLRC/RAL, UK
Computing models, facilities, distributed computing
The LHC Computing Grid Visit of Mtro. Enrique Agüera Ibañez
Ian Bird GDB Meeting CERN 9 September 2003
LHC Computing Grid Status of Resources Financial Plan and Sue Foffano
Luděk Matyska CESNET ERIC Survey Results Luděk Matyska CESNET
Data Challenge with the Grid in ATLAS
Christos Markou Institute of Nuclear Physics NCSR ‘Demokritos’
Project Status Report Computing Resource Review Board Ian Bird
Dagmar Adamova, NPI AS CR Prague/Rez
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
The LHC Computing Grid Visit of Her Royal Highness
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
Project Status Report Computing Resource Review Board Ian Bird
WLCG Collaboration Workshop;
Connecting the European Grid Infrastructure to Research Communities
New strategies of the LHC experiments to meet
Input on Sustainability
LHC Data Analysis using a worldwide computing grid
WLCG Status – 1 Use remains consistently high
The LHC Computing Grid Visit of Professor Andreas Demetriou
Presentation transcript:

WLCG Collaboration Workshop; The 1st 18 months with data & looking to the future Ian Bird, CERN WLCG Collaboration Workshop; DESY; 11th July 2011 Accelerating Science and Innovation 1

Outline Results from 1st 18 months with data Some lessons learned Evolution of WLCG distributed computing Relations with EGI & EMI Middleware support & etc. Ian.Bird@cern.ch

Success ! WLCG recognised as having enabled rapid physics output LHCC CERN Scientific Policy Committee CERN Council Resource Review Boards (i.e. Funding Agencies) As well as many other places/articles http://cern.ch/lcg  Repository  Dissemination  LCGNews Thanks to huge and sustained collaborative effort by all sites and experiments !!

Collaboration: Worldwide resources Additional MoU signatures ~ few per year WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations This is also a truly worldwide undertaking. WLG has computing sites in almost every continent, and today provides significant levels of resources – computing clusters, storage (today we have close to 100 PB of disk available to the experiments), and networking. Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. Today >140 sites >250k CPU cores >150 PB disk Ian.Bird@cern.ch

Evolution of implementation and technology What is WLCG today? Service coordination Service management Operational security World-wide trust federation for CA’s and VO’s Complete Policy framework Framework Support processes & tools Common tools Monitoring & Accounting Collaboration Coordination & management & reporting Common requirements Coordinate resources & funding Memorandum of Understanding Coordination with service & technology providers Physical resources: CPU, Disk, Tape, Networks Distributed Computing services Evolution of implementation and technology Ian.Bird@cern.ch

Status of Tier 0 Data to tape/month 2 PB/month Stored ~ 15 PB in 2010 LHCb Compass CMS ATLAS AMS ALICE 2 PB/month Data to tape/month Stored ~ 15 PB in 2010 p-p data to tape at close to 2 PB/month HI Peak rate: 225TB/day Data traffic in Tier 0 and to grid larger than 2010 values: Up to 4 GB/s from DAQs to tape Tier 0 Storage traffic (GB/s) Tier 0 storage: Accepts data at average of 2.6 GB/s; peaks > 11 GB/s Serves data at average of 7 GB/s; peaks > 25 GB/s CERN Tier 0 moves > 1 PB data per day Ian.Bird@cern.ch

CPU used at Tier 1s + Tier 2s (HS06.hrs/month) – last 12 months Grid Usage Use remains consistently high: >1 M jobs/day; ~150k CPU 1 M jobs/day CPU used at Tier 1s + Tier 2s (HS06.hrs/month) – last 12 months 100k CPU-days/day At the end of 2010 we saw all Tier 1 and Tier 2 job slots being filled CPU usage now >> double that of mid-2010 (inset shows build up over previous years) In terms of the workload on the grid, here we see a load of in excess of 1 million jobs per day being run (and this is still increasing). This is notable since this was the rate anticipated for a nominal year of data taking. This translates into significant amounts of computer time. In fact, towards the end of 2010 we reached the situation where all of the available resources in the Tier 1 and Tier 2 centres were often fully occupied. We anticipate this problem becoming more interesting in 2011 and 2012. The other notable success in 2010 was the number of individuals actually using the grid to do physics. This had been a point of concern as the usability of grids has always been difficult, but the experiments had really invested a lot of effort to provide a seamless integration with their tools. We see some 800 different individuals per month using the service in the large experiments, and some 250 in each of the smaller collaborations. Large numbers of analysis users: ATLAS, CMS ~800 LHCb,ALICE ~250 As well as LHC data, large simulation productions always ongoing Ian.Bird@cern.ch

CPU – around the Tiers The grid really works All sites, large and small can contribute And their contributions are needed! Significant use of Tier 2s for analysis Tier 0 usage peaks when LHC running – average is much less Jan 2011 was highest use month ever … now exceeded The distribution of delivered CPU between the various tiers has also been according to the design, with tier 2s providing in excess of 50% of the total. If this is broken down we see that countries deliver close to what they pledged. Thus the grid concept really does work on this scale, and enables institutes worldwide to collaborate, share data, and provide resources to the common goals of the collaboration. Early worries that people would simply ignore the grid and use CERN for analysis were unfounded – Tier 2s are used extensively and exclusively. Ian.Bird@cern.ch

Data transfers LHC data transfers: April 2010 – May 2011 Rates >> higher than planned/tested Nominal: 1.3 GB/s Achieved: up to 5 GB/s World-wide: ~10 GB/s per large experiment 2011 data  Tier 1s 2010 pp data  Tier 1s & re-processing ALICE HI data  Tier 1s Re-processing 2010 data CMS HI data zero suppression &  FNAL Ian.Bird@cern.ch

70 / 110 GB/s ! Significant levels of network traffic observed in 2010 Traffic on OPN up to 70 Gb/s! - ATLAS reprocessing campaigns Significant levels of network traffic observed in 2010 Caused no network problems, but: Reasons understood (mostly ATLAS data management) Data popularity/on demand placement improve this Ian Bird, CERN

LHC Networking Relies on OPN, GEANT, US-LHCNet NRENs & other national & international providers Ian.Bird@cern.ch

Successes: Experiments have truly distributed models We have a working grid infrastructure Experiments have truly distributed models Has enabled physics output in a very short time Network traffic in excess of that planned – and the network is extremely reliable Significant numbers of people doing analysis (at Tier 2s) In 2010 resources were plentiful, now start to see contention … Support levels manageable ... just Ian.Bird@cern.ch

Lessons learned Complexity & Sustainability Took a lot of effort to get to the level of today Ongoing effort for support is probably too high Experiments had to invest significant effort to hide grid complexity from users Distributed nature of the infrastructure Was not really understood in the computing models Evolution of data distribution and management Evolution of networking Ian.Bird@cern.ch

Lessons: Data Computing models based on MONARC model of 2000 – reliance on data placement Jobs sent to datasets resident at a site Multiple copies of data hosted across the infrastructure Concern that the network would be insufficient or unreliable However: A lot of data “placed” at sites was never touched Refreshing large disk caches uses a lot of networking The network is extremely reliable (…with redundancy) Ian.Bird@cern.ch

Lessons: Services Until now many services have been distributed Databases, Data placement, etc. Because of lack of trust that networks would be reliable enough Increased the complexity of the middleware and applications Now we see centralization of those services again 1 failover copy of a db is much simpler than a distributed db schema! Ian Bird, CERN

Computing model evolution Evolution of computing models Hierarchy Mesh Ian.Bird@cern.ch

Network evolution - LHCONE Evolution of computing models also require evolution of network infrastructure - Enable any Tier 2, 3 to easily connect to any Tier 1 or 2 Use of Open Exchange Points Do not overload the general R&E IP infrastructure with LHC data Connectivity to T1s, T2s, and T3s, and to aggregation networks: NRENs, GÉANT, etc. Ian.Bird@cern.ch

Change of the data model… Data placement will now be based on Dynamic placement when jobs are sent to a site Data popularity – popular data is replicated – unused data is removed Analysis disk becomes a more dynamic cache Also start to use remote (WAN) I/O: Fetch a file missing from a dataset Read a file remotely over the network Can be less network traffic Ian.Bird@cern.ch

Other Challenges: Move away from “special” solutions Resource efficiency Behaviour with resource contention Efficient use – experiments struggle to live within resource expectations, physics is potentially limited by resources now! Changing models – to more effectively use what we have Evolving data management Evolving network model Integrating other federated identity management schemes Sustainability Grid middleware – has it a future? Sustainability of operations Is (commodity) hardware reliable enough? Changing technology Using “clouds” Other things - NoSQL, etc. Move away from “special” solutions Ian Bird, CERN

Relationship to EGI and EMI MoU between WLCG and EGI in progress Has been presented to WLCG OB Now with CERN legal dept. Initially was too complex, and had no clearly explained benefit for WLCG Important that NGIs provide the services that WLCG needs EGI can help in coordinating that But note that EGEE/EGI-style grid is only useful for certain sciences (like WLCG) Some communities don’t see it as appropriate Pressure for EGI to adapt – how does it affect WLCG? Thus we need a good understanding of our future needs Ian.Bird@cern.ch

Middleware support Process to be discussed this week … Environment (in Europe) more complex today: EMI – EGI – WLCG Needs of WLCG vs EGI/NGIs and other scientific communities Some caution needed …

Technical working group Consider that: Computing models have evolved Far better understanding of requirements now than 10 years ago Even evolved since large scale challenges Experiments have developed (different!) workarounds to manage weaknesses in middleware Pilot jobs and central task queues (almost) ubiquitous Operational effort often too high; lots of services were not designed for redundancy, fail-over, etc. Technology evolves rapidly, rest of world also does (large scale) distributed computing – don’t need entirely home grown solutions Must be concerned about long term support and where it will come from

Technical working group But remember: Whatever we do, we must evolve whilst not disrupting the ongoing operation We have a grid for a very good reason – we need to integrate the resources provided to us – but we can make use of other technologies We have the world’s largest (only?) worldwide trust federation and a single sign-on scheme covering both authentication and authorization We have also developed a very strong set of policies that are exemplars for all other communities trying to use distributed computing In parallel we have developed the operational security teams that have brought real benefit to the HEP community We have also developed the operational and support frameworks and tools that are able to manage this large infrastructure

Technical working group WLCG must have an agreed, clear, and documented vision for the future; to: Better communicate needs to EMI/EGI, OSG,… Be able to improve our middleware stack to address the concerns Attempt to re-build common solutions where possible Take into account lessons learned (functional, operational, deployment, management…) Understand the long term support needs Focus our efforts where we must (e.g. data management), use off-the-shelf solutions where possible Must balance the needs of the experiments and the sites

Group proposed To address these issues and start technical discussions on key topics Some first discussions already in May and June GDB’s – essentially to gauge the importance Jointly chaired – Markus and Jeff Membership to be discussed: Need to balance experiment and site views Not fixed – need different people for different topics Should not be exclusive, but need to limit size  Should produce proposals for wider discussion and agreement These agreed proposals should then form the strategy document Needs to happen quickly … taking 1 year over this will not be helpful

Summary Conclusions WLCG has built a true distributed infrastructure The LHC experiments have used it to rapidly deliver physics results Experience with data has initiated new models for the future Additional technical discussions are needed to plan future evolution Ian.Bird@cern.ch