Presentation is loading. Please wait.

Presentation is loading. Please wait.

Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader WLCG Project Leader ISEF Students 18 th June 2012 Accelerating Science and Innovation.

Similar presentations


Presentation on theme: "Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader WLCG Project Leader ISEF Students 18 th June 2012 Accelerating Science and Innovation."— Presentation transcript:

1 Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader WLCG Project Leader ISEF Students 18 th June 2012 Accelerating Science and Innovation Accelerating Science and Innovation

2 Enter a New Era in Fundamental Science Start-up of the Large Hadron Collider (LHC), one of the largest and truly global scientific projects ever, is the most exciting turning point in particle physics. Exploration of a new energy frontier LHC ring: 27 km circumference CMS ALICE LHCb ATLAS data

3 DateCollaboration sizesData volume, archive technology Late 1950’s2-3Kilobits, notebooks 1960’s10-15kB, punchcards 1970’s~35MB, tape 1980’s~100GB, tape, disk 1990’s700-800TB, tape, disk 2010’s~3000PB, tape, disk CERN / January 20113 Some history of scale… For comparison: 1990’s: Total LEP data set ~few TB Would fit on 1 tape today Today: 1 year of LHC data ~25 PB For comparison: 1990’s: Total LEP data set ~few TB Would fit on 1 tape today Today: 1 year of LHC data ~25 PB Where does all this data come from? CERN has about 60,000 physical disks to provide about 20 PB of reliable storage

4 CERN / May 2011

5 150 million sensors deliver data … … 40 million times per second

6 Raw data: – Was a detector element hit? – How much energy? – What time? Reconstructed data: – Momentum of tracks (4-vectors) – Origin – Energy in clusters (jets) – Particle type – Calibration information – … CERN / January 20116 What is this data?

7 HEP data are organized as Events (particle collisions) Simulation, Reconstruction and Analysis programs process “one Event at a time” – Events are fairly independent  Trivial parallel processing Event processing programs are composed of a number of Algorithms selecting and transforming “raw” Event data into “processed” (reconstructed) Event data and statistics Ian Bird, CERN 7 Data and Algorithms 26 June 2009 RAW Detector digitisation ~2 MB/event ESD/RECO Pseudo-physical information: Clusters, track candidates ~100 kB/event AOD ~10 kB/event TAG ~1 kB/event Relevant information for fast event selection Triggered events recorded by DAQ Reconstructed information Analysis information Classification information Physical information: Transverse momentum, Association of particles, jets, id of particles

8 simulation reconstruction analysis interactive physics analysis batch physics analysis batch physics analysis detector event summary data raw data event reprocessing event reprocessing event simulation event simulation analysis objects (extracted by physics topic) Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) event filter (selection & reconstruction) processed data les.robertson@cern.ch 26 June 20098Ian Bird, CERN

9 9 The LHC Computing Challenge  Signal/Noise: 10 -13 (10 -9 offline)  Data volume High rate * large number of channels * 4 experiments  15 PetaBytes of new data each year  Compute power Event complexity * Nb. events * thousands users  200 k CPUs  45 PB of disk storage  Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere  GRID technology  22 PB in 2011  150 PB  250 k CPU

10 10 A collision at LHC 26 June 2009Ian Bird, CERN

11 11 The Data Acquisition 26 June 2009

12 Tier 0 at CERN: Acquisition, First pass reconstruction, Storage & Distribution Ian.Bird@cern.ch 1.25 GB/sec (ions) 12 2011: 400-500 MB/sec 2011: 4-6 GB/sec

13 A distributed computing infrastructure to provide the production and analysis environments for the LHC experiments Managed and operated by a worldwide collaboration between the experiments and the participating computer centres The resources are distributed – for funding and sociological reasons Our task was to make use of the resources available to us – no matter where they are located Ian Bird, CERN13 WLCG – what and why? Tier-0 (CERN): Data recording Initial data reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (~130 centres): Simulation End-user analysis

14 e-Infrastructure

15 Tier 0 Tier 1 Tier 2 WLCG Grid Sites Today >140 sites >250k CPU cores >150 PB disk Today >140 sites >250k CPU cores >150 PB disk

16 Lyon/CCIN2P3 Barcelona/PIC De-FZK US-FNAL Ca- TRIUMF NDGF CERN US-BNL UK-RAL Taipei/ASGC Ian Bird, CERN1626 June 2009 Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations WLCG Collaboration Status Tier 0; 11 Tier 1s; 68 Tier 2 federations Amsterdam/NIKHEF-SARA Bologna/CNAF

17 Ian.Bird@cern.ch17 Original Computing model

18 Ian Bird, CERN18 From testing to data: Independent Experiment Data Challenges Service Challenges proposed in 2004 To demonstrate service aspects: -Data transfers for weeks on end -Data management -Scaling of job workloads -Security incidents (“fire drills”) -Interoperability -Support processes Service Challenges proposed in 2004 To demonstrate service aspects: -Data transfers for weeks on end -Data management -Scaling of job workloads -Security incidents (“fire drills”) -Interoperability -Support processes 2004 2005 2006 2007 2008 2009 2010 SC1 Basic transfer rates SC2 Basic transfer rates SC3 Sustained rates, data management, service reliability SC4 Nominal LHC rates, disk  tape tests, all Tier 1s, some Tier 2s CCRC’08 Readiness challenge, all experiments, ~full computing models STEP’09 Scale challenge, all experiments, full computing models, tape recall + analysis Focus on real and continuous production use of the service over several years (simulations since 2003, cosmic ray data, etc.) Data and Service challenges to exercise all aspects of the service – not just for data transfers, but workloads, support structures etc. Focus on real and continuous production use of the service over several years (simulations since 2003, cosmic ray data, etc.) Data and Service challenges to exercise all aspects of the service – not just for data transfers, but workloads, support structures etc. e.g. DC04 (ALICE, CMS, LHCb)/DC2 (ATLAS) in 2004 saw first full chain of computing models on grids

19 In 2010+2011 ~38 PB of data have been accumulated, expect about 30 PB more in 2012 Data rates to tape in excess of original plans : up to 6 GB/s in HI running (cf. nominal 1.25 GB/s) WLCG: Data in 2010;11;12 HI: ALICE data into Castor > 4 GB/s (red) HI: Overall rates to tape > 6 GB/s (r+b) 23 PB data written in 2011 …and 2012, 3 PB/month

20 Large numbers of analysis users:  ATLAS, CMS ~1000  LHCb,ALICE ~250 Large numbers of analysis users:  ATLAS, CMS ~1000  LHCb,ALICE ~250 Use remains consistently high:  >1.5 M jobs/day;  ~150k CPU Use remains consistently high:  >1.5 M jobs/day;  ~150k CPU Grid Usage As well as LHC data, large simulation productions always ongoing CPU used at Tier 1s + Tier 2s (HS06.hrs/month) – last 24 months At the end of 2010 we saw all Tier 1 and Tier 2 job slots being filled CPU usage now >> double that of mid- 2010 (inset shows build up over previous years) In 2011 WLCG delivered ~ 150 CPU-millennia! In 2011 WLCG delivered ~ 150 CPU-millennia! 1.5M jobs/day 10 9 HEPSPEC-hours/month (~150 k CPU continuous use) 10 9 HEPSPEC-hours/month (~150 k CPU continuous use)

21 Ian.Bird@cern.ch21 Tiers usage vs pledges We use everything we are given!

22 The grid really works All sites, large and small can contribute – And their contributions are needed! The grid really works All sites, large and small can contribute – And their contributions are needed! Ian.Bird@cern.ch22 CPU – around the Tiers

23 Data transfers Global transfers > 10 GB/s (1 day) Global transfers (last month) CERN  Tier 1s (last 2 weeks)

24 Relies on – OPN, GEANT, US-LHCNet – NRENs & other national & international providers Ian Bird, CERN24 LHC Networking

25 e-Infrastructure

26 Data Management Services Job Management ServicesSecurity Services Information Services Certificate Management Service VO Membership Service Authentication Service Authorization Service Information System Messaging Service Site Availability Monitor Accountin g Service Monitoring tools: experiment dashboards; site monitoring Storage Element File Catalogue Service File Transfer Service Grid file access tools GridFTP service Database and DB Replication Services POOL Object Persistency Service Compute Element Workload Management Service VO Agent Service Application Software Install Service Today’s Grid Services Experiments invested considerable effort into integrating their software with grid services; and hiding complexity from users

27 Consider that: Computing models have evolved Far better understanding of requirements now than 10 years ago – Even evolved since large scale challenges Experiments have developed various workarounds to manage shortcomings in middleware Pilot jobs and central task queues (almost) ubiquitous Operational effort often too high; – lots of services were not designed for redundancy, fail-over, etc. Technology evolves rapidly, rest of world also does (large scale) distributed computing – don’t need entirely home grown solutions Must be concerned about long term support and where it will come from Technical evolution: Background

28 Ian.Bird@cern.ch28 Computing model evolution Evolution of computing models Hierarchy Mesh

29 Not just bandwidth We are a Global collaboration … but well connected countries do better Not just bandwidth We are a Global collaboration … but well connected countries do better Ian.Bird@cern.ch29 Connectivity challenge Need to effectively connect everyone that wants to participate in LHC science Large actual and potential communities in Middle East, Africa, Asia, Latin America … but also on the edges of Europe Need to effectively connect everyone that wants to participate in LHC science Large actual and potential communities in Middle East, Africa, Asia, Latin America … but also on the edges of Europe

30 Ian.Bird@cern.ch WLCG has been leveraged on both sides of the Atlantic, to benefit the wider scientific community – Europe: Enabling Grids for E-sciencE (EGEE) 2004-2010 European Grid Infrastructure (EGI) 2010-- – USA: Open Science Grid (OSG) 2006-2012 (+ extension?) Many scientific applications  30 Impact of the LHC Computing Grid Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences …

31 Spectrum of grids, clouds, supercomputers, etc. 31 Grids Collaborative environment Distributed resources (political/sociological) Commodity hardware (also supercomputers) (HEP) data management Complex interfaces (bug not feature) Supercomputers Expensive Low latency interconnects Applications peer reviewed Parallel/coupled applications Traditional interfaces (login) Also SC grids (DEISA, Teragrid) Clouds Proprietary (implementation) Economies of scale in management Commodity hardware Virtualisation for service provision and encapsulating application environment Details of physical resources hidden Simple interfaces (too simple?) Volunteer computing Simple mechanism to access millions CPUs Difficult if (much) data involved Control of environment  check Community building – people involved in Science Potential for huge amounts of real work Many different problems: Amenable to different solutions No right answer Many different problems: Amenable to different solutions No right answer Consider ALL as a combined e-Infrastructure ecosystem Aim for interoperability and combine the resources into a consistent whole Keep applications agile so they can operate in many environments

32 Grid: Is a distributed computing service – Integrates distributed resources – Global single-sign-on (use same credential everywhere) – Enables (virtual) collaboration Cloud: Is a large (remote) data centre – Economy of scale – centralize resources in large centres – Virtualisation – enables dynamic provisioning of resources Technologies are not exclusive – In the future our collaborative grid sites will use cloud technologies (virtualisation etc) – We will also use cloud resources to supplement our own Ian Bird, CERN32 Grid Cloud?? 26 June 2009

33 We have a grid because: – We need to collaborate and share resources – Thus we will always have a “grid” – Our network of trust is of enormous value for us and for (e- )science in general We also need distributed data management – That supports very high data rates and throughputs – We will continually work on these tools But, the rest can be more mainstream (open source, commercial, … ) – We use message brokers more and more as inter-process communication – Virtualisation of our grid sites is happening many drivers: power, dependencies, provisioning, … – Remote job submission … could be cloud-like – Interest in making use of commercial cloud resources, especially for peak demand Ian.Bird@cern.ch33 Grids  clouds??

34 Several strategies: Use of virtualisation in the CERN & other CCs: – Lxcloud pilot + CVI  dynamic virtualised infrastructure (which may include “bare-metal” provisioning) – No change to any grid or service interfaces (but new possibilities) – Likely based on Openstack – Other WLCG sites also virtualising their infrastructure Investigating use of commercial clouds – “bursting” – Additional resources; – Potential of outsourcing some services? – Prototype with Helix Nebula project; – Experiments have various activities (with Amazon, etc) Can cloud technology replace/supplement some grid services? – More speculative: Feasibility? Timescales? Ian Bird, CERN34 Clouds & Virtualisation

35 CERN Infrastructure Evolution35 CERN Data Centre Numbers Systems7,899Hard disks62,023 Processors14,972Raw disk capacity (TiB)62,660 Cores64,623Tape capacity (PiB)47 Memory (TiB)165Ethernet 1Gb ports16,773 Racks1,070Ethernet 10Gb ports622 Power Consumption (KiW)2,345 From http://sls.cern.ch/sls/service.php?id=CCBYNUMhttp://sls.cern.ch/sls/service.php?id=CCBYNUM

36 Ian.Bird@cern.ch36 Evolution of capacity: CERN & WLCG


Download ppt "Petabyte-scale computing for LHC Ian Bird, CERN WLCG Project Leader WLCG Project Leader ISEF Students 18 th June 2012 Accelerating Science and Innovation."

Similar presentations


Ads by Google