WLCG after 1 year with data: Prospects for the future Ian Bird; WLCG Project Leader openlab BoS meeting CERN4 th May 2011.

Slides:



Advertisements
Similar presentations
Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.
Advertisements

T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 15 th April 2009 Visit of Spanish Royal Academy.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Project Status Report Ian Bird Computing Resource Review Board 30 th October 2012 CERN-RRB
Assessment of Core Services provided to USLHC by OSG.
13 October 2014 Eric Grancher, head of database services, CERN IT Manuel Martin Marquez, data scientist, CERN openlab.
Massive Computing at CERN and lessons learnt
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
LHC: An example of a Global Scientific Community Sergio Bertolucci CERN 5 th EGEE User Forum Uppsala, 14 th April 2010.
Resources and Financial Plan Sue Foffano WLCG Resource Manager C-RRB Meeting, 12 th October 2010.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Emmanuel Tsesmelis 2 nd CERN School Thailand 2012 Suranaree University of Technology.
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
Windows Azure Global Footprint video Inside a Datacenter 
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
Experience with the WLCG Computing Grid 10 June 2010 Ian Fisk.
Take on messages from Lecture 1 LHC Computing has been well sized to handle the production and analysis needs of LHC (very high data rates and throughputs)
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader WLCG Project Leader ISEF Students 18 th June 2012 Accelerating Science and Innovation.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Rackspace Analyst Event Tim Bell
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
Workshop summary Ian Bird, CERN WLCG Workshop; DESY, 13 th July 2011 Accelerating Science and Innovation Accelerating Science and Innovation.
Progress in Computing Ian Bird ICHEP th July 2010, Paris
Ian Bird LHC Computing Grid Project Leader LHC Grid Fest 3 rd October 2008 A worldwide collaboration.
Tim 18/09/2015 2Tim Bell - Australian Bureau of Meteorology Visit.
…building the next IT revolution From Web to Grid…
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
WLCG and the India-CERN Collaboration David Collados CERN - Information technology 27 February 2014.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
Bert van Pinxteren GA 24, Budapest, 21 October TERENA Compendium 2005: Summary of key findings Bert van Pinxteren TERENA
Future computing strategy Some considerations Ian Bird WLCG Overview Board CERN, 28 th September 2012.
LHC Computing, CERN, & Federated Identities
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Frédéric Hemmer IT Department 26 th January 2010 Visit of Michael Dell 1 Frédéric.
Ian Bird WLCG Networking workshop CERN, 10 th February February 2014
Ian Bird Overview Board; CERN, 8 th March 2013 March 6, 2013
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 1 st March 2011 Visit of Dr Manuel Eduardo Baldeón.
GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.
The Worldwide LHC Computing Grid Frédéric Hemmer IT Department Head Visit of INTEL ISEF CERN Special Award Winners 2012 Thursday, 21 st June 2012.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
WLCG: The 1 st year with data & looking to the future WLCG: Ian Bird, CERN WLCG Project Leader WLCG Project LeaderLCG-France; Strasbourg; 30 th May 2011.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.
1 June 11/Ian Fisk CMS Model and the Network Ian Fisk.
Alessandro De Salvo CCR Workshop, ATLAS Computing Alessandro De Salvo CCR Workshop,
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
Pinger and IEPM-BW activity at FNAL By Frank Nagy FTP/CCF Computing Division Fermilab.
Computing Fabrics & Networking Technologies Summary Talk Tony Cass usual disclaimers apply! October 2 nd 2010.
WLCG – Status and Plans Ian Bird WLCG Project Leader openlab Board of Sponsors CERN, 23 rd April 2010.
EGI-InSPIRE EGI-InSPIRE RI The European Grid Infrastructure Steven Newhouse Director, EGI.eu Project Director, EGI-InSPIRE 29/06/2016CoreGrid.
Computing infrastructures for the LHC: current status and challenges of the High Luminosity LHC future Worldwide LHC Computing Grid (WLCG): Distributed.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
LHC collisions rate: Hz New PHYSICS rate: Hz Event selection: 1 in 10,000,000,000,000 Signal/Noise: Raw Data volumes produced.
Ian Bird, CERN WLCG Project Leader Amsterdam, 24 th January 2012.
Evolution of storage and data management
The CMS Experiment at LHC
Ian Bird WLCG Workshop San Francisco, 8th October 2016
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Dagmar Adamova, NPI AS CR Prague/Rez
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
Thoughts on Computing Upgrade Activities
Project Status Report Computing Resource Review Board Ian Bird
WLCG Collaboration Workshop;
New strategies of the LHC experiments to meet
Presentation transcript:

WLCG after 1 year with data: Prospects for the future Ian Bird; WLCG Project Leader openlab BoS meeting CERN4 th May 2011

Quick review of WLCG Summary of 1 st year with data – Achievements, successes, lessons Outlook for the next 3 years – What are our challenges? Overview

Ian Bird, CERN3 The LHC Computing Challenge  Signal/Noise: (10 -9 offline)  Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year  Compute power Event complexity * Nb. events * thousands users  200 k of (today's) fastest CPUs  45 PB of disk storage  Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere  GRID technology >250 k cores today 100 PB disk today!!!

A distributed computing infrastructure to provide the production and analysis environments for the LHC experiments Managed and operated by a worldwide collaboration between the experiments and the participating computer centres The resources are distributed – for funding and sociological reasons Our task was to make use of the resources available to us – no matter where they are located Ian Bird, CERN4 WLCG – what and why? Tier-0 (CERN): Data recording Initial data reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (~130 centres): Simulation End-user analysis

Worldwide resources Today >140 sites >250k CPU cores >100 PB disk Today >140 sites >250k CPU cores >100 PB disk Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations

1 st year of LHC data Writing up to 70 TB / day to tape (~ 70 tapes per day) Writing up to 70 TB / day to tape (~ 70 tapes per day) Data written to tape (GB/day) Disk Servers (GB/s) Tier 0 storage: Accepts data at average of 2.6 GB/s; peaks > 11 GB/s Serves data at average of 7 GB/s; peaks > 25 GB/s CERN Tier 0 moves > 1 PB data per day Tier 0 storage: Accepts data at average of 2.6 GB/s; peaks > 11 GB/s Serves data at average of 7 GB/s; peaks > 25 GB/s CERN Tier 0 moves > 1 PB data per day Stored ~ 15 PB in 2010 >5GB/s to tape during HI ~ 2 PB/month to tape pp ~ 4 PB to tape in HI >5GB/s to tape during HI ~ 2 PB/month to tape pp ~ 4 PB to tape in HI

Large numbers of analysis users:  ATLAS, CMS ~800  LHCb,ALICE ~250 Large numbers of analysis users:  ATLAS, CMS ~800  LHCb,ALICE ~250 Use remains consistently high:  >1 M jobs/day;  ~150k CPU Use remains consistently high:  >1 M jobs/day;  ~150k CPU Grid Usage 100k CPU-days/day As well as LHC data, large simulation productions always ongoing CPU used at Tier 1s + Tier 2s (HS06.hrs/month) – last 12 months At the end of 2010 we saw all Tier 1 and Tier 2 job slots being filled CPU usage now >> double that of mid (inset shows build up over previous years) 1 M jobs/day In 2010 WLCG delivered ~ CPU-millennia! In 2010 WLCG delivered ~ CPU-millennia!

The grid really works All sites, large and small can contribute – And their contributions are needed! Significant use of Tier 2s for analysis Tier 0 usage peaks when LHC running – average is much less The grid really works All sites, large and small can contribute – And their contributions are needed! Significant use of Tier 2s for analysis Tier 0 usage peaks when LHC running – average is much less CPU – around the Tiers Jan 2011 was highest use month ever … so far

Ian Bird, CERN9 Data transfers LHC running: April – Sept 2010 & the academic/research networks for Tier1/2! CMS HI data zero suppression &  FNAL 2011 data  Tier 1s Re-processing 2010 data ALICE HI data  Tier 1s

Successes: We have a working grid infrastructure Experiments have truly distributed models Has enabled physics output in a very short time Network traffic close to that planned – and the network is extremely reliable Significant numbers of people doing analysis (at Tier 2s) Today resources are plentiful, and no contention seen... yet Support levels manageable... just Successes: We have a working grid infrastructure Experiments have truly distributed models Has enabled physics output in a very short time Network traffic close to that planned – and the network is extremely reliable Significant numbers of people doing analysis (at Tier 2s) Today resources are plentiful, and no contention seen... yet Support levels manageable... just

LHC schedule now has continuous running – expected high integrated luminosity (== lots of interesting data) Impacts: – Resources – funding agencies asked to fund more resources in 2012 (had previously expected an “off” year) – Push back upgrades or upgrade during running Oracle 11g, network switches, online clusters, OS versions, etc. Mostly an issue for accelerator or experiment control-related; for WLCG there is NO downtime, ever. … and – The no. events /collision much higher than anticipated for now  larger event sizes (hence more data volume), more processing time running

Evolution of requirements

Some areas where openlab partners have contributed to this success … (in no particular order ) Ian Bird, CERN13

Databases Databases everywhere (LHC, experiments, offline, remote) – large scale deployment and distributed databases: e.g. Streams for data replication Databases everywhere (LHC, experiments, offline, remote) – large scale deployment and distributed databases: e.g. Streams for data replication

CPU & performance CPU/machines: evaluation of new generations Performance optimisation – how to use many-core machines CPU/machines: evaluation of new generations Performance optimisation – how to use many-core machines

Monitoring New ways to view monitoring data Gridmaps now appear everywhere New ways to view monitoring data Gridmaps now appear everywhere This was a good example of tapping into expertise and experience within the company

Networking Technology evaluations (e.g. 10 Gb) Campus networking and security – essential for physics analysis at CERN Technology evaluations (e.g. 10 Gb) Campus networking and security – essential for physics analysis at CERN

and some challenges for the future … Ian Bird, CERN18

Resource efficiency – Behaviour with resource contention – Efficient use – experiments struggle to live within resource expectations, physics is potentially limited by resources now! Changing models – to more effectively use what we have – Evolving data management – Evolving network model – Integrating other federated identity management schemes Sustainability – Grid middleware – has it a future? – Sustainability of operations – Is (commodity) hardware reliable enough? Changing technology – Using “clouds” – Other things - NoSQL, etc.  Move away from “special” solutions Ian Bird, CERN19 Challenges:

We have a grid because: – We need to collaborate and share resources – Thus we will always have a “grid” – Our network of trust is of enormous value for us and for (e-)science in general We also need distributed data management – That supports very high data rates and throughputs – We will continually work on these tools But, the rest can be more mainstream (open source, commercial, … ) – We use message brokers more and more as inter-process communication – Virtualisation of our grid sites is happening many drivers: power, dependencies, provisioning, … – Remote job submission … could be cloud-like – Interest in making use of commercial cloud resources, especially for peak demand  We should invest effort only where we need to Grids  clouds??

Is clearly of great interest CERN has several threads: – Service consolidation of “VO managed services” – “kiosk” – request a VM via a web interface – Batch service: Tested Platform ISF and OpenNebula Did very large scaling tests Very interested in Openstack – Both for cluster management and storage system – Potentially a large community behind – Could be leading towards (de-facto) standards for clouds Questions: – is S3 a possible alternative as a storage interface? – Can we virtualise (most of) our computing infrastructure? Have much less types of hardware purchase? Remove distinction between CPU and Disk servers? – Do we still need a traditional batch scheduler? – How easy to burst out to commercial clouds? – How feasible to use cloud interfaces for distributed job management between grid(cloud) sites? – How much grid middleware can we obsolete? Virtualisation and clouds

Resource contention (see also sustainable ops) – Need better “monitoring”; we have lots of information, but: Really need the ability to mine and analyse monitoring data: within and across services: trends, correlations Need warnings of problems before they happen – Can this lead to automated actions/reactions/recovery? Efficiency of use – Many-core CPU & other architectures – CPU efficiency – jobs wait for data? How important is it? (CPU is cheap…) – Does a virtualised infrastructure help? Resource efficiency

Recognise network as a resource Data on-demand will augment data pre-placement Storage systems will become more dynamic caches Allow remote data access – fetch files when needed – I/O over WAN Network usage will (eventually) increase & be more dynamic (less predictable) Computing model evolution Evolution of computing models

A consequence of the computing model evolution Data caching rather than organised data placement Distinguish between data archives and data caches – Only allow organised access to archives – Simplifies interfaces – no need for full SRM – Potential to replace archives with commercial back-up solutions (that scale sufficiently!) Tools to support: – Remote data access (all aspects) – Reliable transfer (we have this, but clearly needs reworking) – Cache management – Low latency, high throughput file access (for reading) Evolution of data management

Network evolution Evolution of computing models also require evolution of network infrastructure Open exchange points built in carrier-neutral facilities: any connector can connect with their own fiber or using circuits provided by any telecom provider enables T2s and T3s to obtain their data from any T1 or T2 Use of LHCONE will alleviate the general R&E IP infrastructure LHCONE provides connectivity directly to T1s, T2s, and T3s, and to various aggregation networks, such as the European NRENs, GÉANT, etc.

Service incidents – last 2 quarters – any service degradation generates a Service Incident Report (SIR == post-mortem) – This illustrates quite strongly that the majority (>~75%) of the problems experienced are not related to the distributed nature of the WLCG at all (or grid middleware) How can we make the effect of outages less intrusive? Can we automate recovery (or management)? Do user community have reasonable expectations? (no…) Not unique to WLCG !!!  Sustainability: Service incidents (outage/degradation) IncidentsType 11Infrastructure related 6Database problems (some also infrastructure caused) 4Storage related (~all infrastructure caused) 2Network problems

Everyone has service failures… Some inform their customers.. … and some don’t! Failures can and do happen … but these incidents raise many questions for cloud services: How safe is my data? … where is it? Privacy? Who checks? Dependencies? Some inform their customers.. … and some don’t! Failures can and do happen … but these incidents raise many questions for cloud services: How safe is my data? … where is it? Privacy? Who checks? Dependencies? Forgot the part about keeping their customers informed … Where are the SIRs???

Conclusions WLCG has been a great success and been key in the rapid delivery of physics from LHC Challenge now is to be more effective and efficient – computing should limit physics as little as possible 28 Summary