WLCG: The 1 st year with data & looking to the future WLCG: Ian Bird, CERN WLCG Project Leader WLCG Project LeaderLCG-France; Strasbourg; 30 th May 2011.

Slides:

Advertisements

Similar presentations

 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.

Advertisements

High Performance Computing Course Notes Grid Computing.

Sue Foffano LCG Resource Manager WLCG – Resources & Accounting LHCC Comprehensive Review November, 2007 LCG.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Project Status Report Ian Bird Computing Resource Review Board 30 th October 2012 CERN-RRB

Massive Computing at CERN and lessons learnt

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

EXperimental Infrastructures for the Future Internet Process for Joining Infrastructure Owners Training - Basic.

LHC: An example of a Global Scientific Community Sergio Bertolucci CERN 5 th EGEE User Forum Uppsala, 14 th April 2010.

Resources and Financial Plan Sue Foffano WLCG Resource Manager C-RRB Meeting, 12 th October 2010.

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Emmanuel Tsesmelis 2 nd CERN School Thailand 2012 Suranaree University of Technology.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader WLCG Project Leader ISEF Students 18 th June 2012 Accelerating Science and Innovation.

User views from outside of Western Europe MarkoBonac, Arnes, Slovenia.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Rackspace Analyst Event Tim Bell

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.

Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.

Ian Bird LCG Project Leader OB Summary GDB 10 th June 2009.

Progress in Computing Ian Bird ICHEP th July 2010, Paris

Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.

The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

LHCbComputing Manpower requirements. Disclaimer m In the absence of a manpower planning officer, all FTE figures in the following slides are approximate.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.

Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.

Procedure to follow for proposed new Tier 1 sites Ian Bird CERN, 27 th March 2012.

Evolution of storage and data management Ian Bird GDB: 12 th May 2010.

Ian Bird LCG Project Leader WLCG Update 6 th May, 2008 HEPiX – Spring 2008 CERN.

Procedure for proposed new Tier 1 sites Ian Bird WLCG Overview Board CERN, 9 th March 2012.

The (IMG) Systems for Comparative Analysis of Microbial Genomes & Metagenomes: N America: 1,180 Europe: 386 Asia: 235 Africa: 6 Oceania: 81 S America:

LHC Computing, CERN, & Federated Identities

The LHC Computing Grid – February 2008 The Challenges of LHC Computing Frédéric Hemmer IT Department 26 th January 2010 Visit of Michael Dell 1 Frédéric.

Ian Bird WLCG Networking workshop CERN, 10 th February February 2014

Ian Bird LCG Project Leader WLCG Status Report CERN-RRB th April, 2008 Computing Resource Review Board.

Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013

The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 1 st March 2011 Visit of Dr Manuel Eduardo Baldeón.

GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.

WLCG after 1 year with data: Prospects for the future Ian Bird; WLCG Project Leader openlab BoS meeting CERN4 th May 2011.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

Evolution of WLCG infrastructure Ian Bird, CERN Overview Board CERN, 30 th September 2011 Accelerating Science and Innovation Accelerating Science and.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.

WLCG – Status and Plans Ian Bird WLCG Project Leader openlab Board of Sponsors CERN, 23 rd April 2010.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

EGI-InSPIRE EGI-InSPIRE RI The European Grid Infrastructure Steven Newhouse Director, EGI.eu Project Director, EGI-InSPIRE 29/06/2016CoreGrid.

Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.

LHC collisions rate: Hz New PHYSICS rate: Hz Event selection: 1 in 10,000,000,000,000 Signal/Noise: Raw Data volumes produced.

Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.

Evolution of storage and data management

Ian Bird WLCG Workshop San Francisco, 8th October 2016

Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017

Project Status Report Computing Resource Review Board Ian Bird

Project Status Report Computing Resource Review Board Ian Bird

WLCG Collaboration Workshop;

WLCG Collaboration Workshop;

Connecting the European Grid Infrastructure to Research Communities

Input on Sustainability

LHC Data Analysis using a worldwide computing grid

Presentation transcript:

WLCG: The 1 st year with data & looking to the future WLCG: Ian Bird, CERN WLCG Project Leader WLCG Project LeaderLCG-France; Strasbourg; 30 th May 2011 Accelerating Science and Innovation Accelerating Science and Innovation

Status of the WLCG collaboration – Future of the collaboration Results from 1 st year with data – Some lessons learned Evolution of the WLCG distributed computing – Computing models – Networks – Virtualization & clouds – Job management – Information service Relations with EGI & EMI – Middleware support & etc. Outline

Collaboration: Worldwide resources Today >140 sites >250k CPU cores >150 PB disk Today >140 sites >250k CPU cores >150 PB disk Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations

The WLCG collaboration & MoU is the long-term framework for LHC computing More MoU signatures: – Several countries in the process of joining – Several have expressed intentions – Expect this to continue at a low level: a few / year This framework will remain (& expand) However the technical implementation of distributed computing for LHC will change and evolved – New technologies, new models, lessons learned … Future of the collaboration

What is WLCG today? Service coordination Service management Operational security World-wide trust federation for CA’s and VO’s Complete Policy framework Framework Support processes & tools Common tools Monitoring & Accounting Collaboration Coordination & management & reporting Common requirements Coordinate resources & funding Memorandum of Understanding Coordination with service & technology providers Physical resources: CPU, Disk, Tape, Networks Distributed Computing services

1 st year of LHC data: Tier 0 Disk Servers (GB/s) Tier 0 storage: Accepts data at average of 2.6 GB/s; peaks > 11 GB/s Serves data at average of 7 GB/s; peaks > 25 GB/s CERN Tier 0 moves > 1 PB data per day Tier 0 storage: Accepts data at average of 2.6 GB/s; peaks > 11 GB/s Serves data at average of 7 GB/s; peaks > 25 GB/s CERN Tier 0 moves > 1 PB data per day Stored ~ 15 PB in 2010 >5GB/s to tape during HI ~ 2 PB/month to tape pp ~ 4 PB to tape in HI >5GB/s to tape during HI ~ 2 PB/month to tape pp ~ 4 PB to tape in HI 2 PB/month LHCb (compass) CMS ATLAS ALICE Data written to tape (GB/month): HI 2010 Reprocessing p-p data to tape at close to 2 PB/month Peak rate: 225TB/day

Large numbers of analysis users:  ATLAS, CMS ~800  LHCb,ALICE ~250 Large numbers of analysis users:  ATLAS, CMS ~800  LHCb,ALICE ~250 Use remains consistently high:  >1 M jobs/day;  ~150k CPU Use remains consistently high:  >1 M jobs/day;  ~150k CPU Grid Usage 100k CPU-days/day As well as LHC data, large simulation productions always ongoing CPU used at Tier 1s + Tier 2s (HS06.hrs/month) – last 12 months At the end of 2010 we saw all Tier 1 and Tier 2 job slots being filled CPU usage now >> double that of mid (inset shows build up over previous years) 1 M jobs/day In 2010 WLCG delivered ~ CPU-millennia! In 2010 WLCG delivered ~ CPU-millennia!

The grid really works All sites, large and small can contribute – And their contributions are needed! Significant use of Tier 2s for analysis Tier 0 usage peaks when LHC running – average is much less The grid really works All sites, large and small can contribute – And their contributions are needed! Significant use of Tier 2s for analysis Tier 0 usage peaks when LHC running – average is much less CPU – around the Tiers Jan 2011 was highest use month ever … so far

World-wide: ~10 GB/s per large experiment Ian Bird, CERN9 Data transfers CMS HI data zero suppression &  FNAL 2011 data  Tier 1s Re-processing 2010 data ALICE HI data  Tier 1s LHC data transfers: April 2010 – May pp data  Tier 1s & re-processing Rates >> higher than planned/tested Nominal: 1.3 GB/s Achieved: up to 5 GB/s Rates >> higher than planned/tested Nominal: 1.3 GB/s Achieved: up to 5 GB/s

Ian Bird, CERN10 70 / 110 GB/s ! Traffic on OPN up to 70 Gb/s! - ATLAS reprocessing campaigns Traffic on OPN up to 70 Gb/s! - ATLAS reprocessing campaigns Significant levels of network traffic observed in 2010 Caused no network problems, but:  Reasons understood (mostly ATLAS data management)  Data popularity/on demand placement improve this Significant levels of network traffic observed in 2010 Caused no network problems, but:  Reasons understood (mostly ATLAS data management)  Data popularity/on demand placement improve this

Relies on – OPN, GEANT, US-LHCNet – NRENs & other national & international providers Ian Bird, CERN11 LHC Networking

Successes: We have a working grid infrastructure Experiments have truly distributed models Has enabled physics output in a very short time Network traffic in excess of that planned – and the network is extremely reliable Significant numbers of people doing analysis (at Tier 2s) ? In 2010 resources were plentiful, now start to see contention … ? Support levels manageable... just Successes: We have a working grid infrastructure Experiments have truly distributed models Has enabled physics output in a very short time Network traffic in excess of that planned – and the network is extremely reliable Significant numbers of people doing analysis (at Tier 2s) ? In 2010 resources were plentiful, now start to see contention … ? Support levels manageable... just

Complexity & Sustainability – Took a lot of effort to get to the level of today – Ongoing effort for support is probably too high – Experiments had to invest significant effort to hide grid complexity from users Distributed nature of the infrastructure – Was not really understood in the computing models Evolution of data distribution and management Evolution of networking Lessons learned

Computing models based on MONARC model of 2000 – reliance on data placement – Jobs sent to datasets resident at a site – Multiple copies of data hosted across the infrastructure – Concern that the network would be insufficient or unreliable However: – A lot of data “placed” at sites was never touched – Refreshing large disk caches uses a lot of networking – The network is extremely reliable (…with redundancy) Lessons: Data

Until now many services have been distributed – Databases, Data placement, etc. – Because of lack of trust that networks would be reliable enough – Increased the complexity of the middleware and applications Now we see centralization of those services again – 1 failover copy of a db is much simpler than a distributed db schema! Ian Bird, CERN15 Lessons: Services

Computing model evolution Evolution of computing models Hierarchy Mesh

Network evolution - LHCONE Evolution of computing models also require evolution of network infrastructure - Enable any Tier 2, 3 to easily connect to any Tier 1 or 2 Evolution of computing models also require evolution of network infrastructure - Enable any Tier 2, 3 to easily connect to any Tier 1 or 2 Use of Open Exchange Points Do not overload the general R&E IP infrastructure with LHC data Connectivity to T1s, T2s, and T3s, and to aggregation networks: NRENs, GÉANT, etc.

Data placement will now be based on – Dynamic placement when jobs are sent to a site – Data popularity – popular data is replicated – unused data is removed Analysis disk becomes a more dynamic cache Also start to use remote (WAN) I/O: – Fetch a file missing from a dataset – Read a file remotely over the network Can be less network traffic Change of the data model…

Resource efficiency – Behaviour with resource contention – Efficient use – experiments struggle to live within resource expectations, physics is potentially limited by resources now! Changing models – to more effectively use what we have – Evolving data management – Evolving network model – Integrating other federated identity management schemes Resource efficiency – Behaviour with resource contention – Efficient use – experiments struggle to live within resource expectations, physics is potentially limited by resources now! Changing models – to more effectively use what we have – Evolving data management – Evolving network model – Integrating other federated identity management schemes Ian Bird, CERN19 Other Challenges: Sustainability – Grid middleware – has it a future? – Sustainability of operations – Is (commodity) hardware reliable enough? Changing technology – Using “clouds” – Other things - NoSQL, etc. Sustainability – Grid middleware – has it a future? – Sustainability of operations – Is (commodity) hardware reliable enough? Changing technology – Using “clouds” – Other things - NoSQL, etc.  Move away from “special” solutions

We have a grid because: – We need to collaborate and share resources – Thus we will always have a “grid” – Our network of trust is of enormous value for us and for (e-)science in general We also need distributed data management – That supports very high data rates and throughputs – We will continually work on these tools But, the rest can be more mainstream (open source, commercial, … ) – We use message brokers more and more as inter-process communication – Virtualisation of our grid sites is happening many drivers: power, dependencies, provisioning, … – Remote job submission … could be cloud-like – Interest in making use of commercial cloud resources, especially for peak demand Grids  clouds??

Data Management Services Job Management ServicesSecurity Services Information Services Certificate Management Service VO Membership Service Authentication Service Authorization Service Information System Messaging Service Site Availability Monitor Accountin g Service Monitoring tools: experiment dashboards; site monitoring Storage Element File Catalogue Service File Transfer Service Grid file access tools GridFTP service Database and DB Replication Services POOL Object Persistency Service Compute Element Workload Management Service VO Agent Service Application Software Install Service Grid Services Experiments invested considerable effort into integrating their software with grid services; and hiding complexity from users

MoU between WLCG and EGI in progress – Has been presented to WLCG OB – Now with CERN legal dept. – Initially was too complex, and had no clearly explained benefit for WLCG Important that NGIs provide the services that WLCG needs – EGI can help in coordinating that EGEE/EGI-style grid is only useful for certain sciences (like WLCG) – Other communities don’t see it as usable – Pressure for EGI to adapt – how does it affect WLCG? Thus we need a good understanding of our future needs Relationship to EGI and EMI

Process is (too?) complex: – WLCG makes requirements to (asked by): EGI EMI – EMI delivers software to EGI (EMI-x) – EGI includes this in UMD release, and adds more – WLCG has software not included in EMI or UMD WLCG has to take software from: – WLCG, EMI, EGI This is probably OK in practice – WLCG specifies minimum versions (as now) and points to relevant repositories Middleware support

Data management (ongoing) Virtualisation (HEPiX, + etc) Information system Security – is there a simpler model than glexec etc? Job management – Pilot jobs mean less need for WMS-like functions – Can we simplify the CE? Performance/scale? Need to describe the batch system? Give a throttle to site manager? How do batch schedulers evolve in a virtualized infrastructure? Technical discussions that we need to have:

Conclusions WLCG has built a true distributed infrastructure The LHC experiments have used it to rapidly deliver physics results Experience with data has initiated new models for the future Additional technical discussions are needed to plan future evolution 25 Summary