Grid Computing in HIGH ENERGY Physics Challenges and Opportunities Dr. Ian Bird LHC Computing Grid Project Leader Göttingen Tier 2 Inauguration 13th May 2008
The scales Ian.Bird@cern.ch
High Energy Physics machines and detectors pp @ √s=14 TeV L : 1034/cm2/s L: 2.1032 /cm2/s Chambres à muons Calorimètre - 2,5 million collisions per second LVL1: 10 KHz, LVL3: 50-100 Hz 25 MB/sec digitized recording 40 million collisions per second LVL1: 1 kHz, LVL3: 100 Hz 0.1 to 1 GB/sec digitized recording Ian.Bird@cern.ch
LHC: 4 experiments … ready! First physics expected in autumn 2008 Is the computing ready ? Ian.Bird@cern.ch
The LHC Computing Challenge Signal/Noise: 10-9 Data volume High rate * large number of channels * 4 experiments 15 PetaBytes of new data each year Compute power Event complexity * Nb. events * thousands users 100 k of (today's) fastest CPUs Worldwide analysis & funding Computing funding locally in major regions & countries Efficient analysis everywhere GRID technology Ian.Bird@cern.ch
A collision at LHC Luminosity : 1034cm-2 s-1 40 MHz – every 25 ns 20 events overlaying Ian.Bird@cern.ch
The Data Acquisition Ian.Bird@cern.ch
Tier 0 at CERN: Acquisition, First pass reconstruction, Storage & Distribution I would stress the continuous operations over several months and the link between collisions, experiments and computing (and the fact that the reliability of the computing could -if not good enough- impact the daq and ultimately the quality of science. This will bring you nicely onto foil 10. 1.25 GB/sec (ions) Ian.Bird@cern.ch
Tier 0 – Tier 1 – Tier 2 Tier-0 (CERN): Data recording First-pass reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (>200 centres): Simulation End-user analysis Ian.Bird@cern.ch
Evolution of requirements ATLAS (or CMS) requirements for first year at design luminosity ATLAS&CMS CTP 107 MIPS 100 TB disk LHC start LHC approved “Hoffmann” Review 7x107 MIPS 1,900 TB disk Computing TDRs 55x107 MIPS 70,000 TB disk (140 MSi2K) 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 LHCb approved ATLAS & CMS approved ALICE approved Ian.Bird@cern.ch
Evolution of CPU Capacity at CERN Tape & disk requirements: >10 times CERN possibility SC (0.6GeV) PS (28GeV) ISR (300GeV) SPS (400GeV) ppbar (540GeV) LEP (100GeV) LEP II (200GeV) LHC (14 TeV) Costs (2007 Swiss Francs) Includes infrastructure costs (comp.centre, power, cooling, ..) and physics tapes
Evolution of Grids WLCG GriPhyN, iVDGL, PPDG GRID 3 OSG EU DataGrid EGEE 1 EGEE 2 EGEE 3 LCG 1 LCG 2 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 Service Challenges Cosmics First physics Data Challenges Ian.Bird@cern.ch
The Worldwide LHC Computing Grid Purpose Develop, build and maintain a distributed computing environment for the storage and analysis of data from the four LHC experiments Ensure the computing service … and common application libraries and tools Phase I – 2002-05 - Development & planning Phase II – 2006-2008 – Deployment & commissioning of the initial services Ian.Bird@cern.ch
WLCG Collaboration MoU Signing Status The Collaboration Tier 1 – all have now signed Tier 2: MoU Signing Status The Collaboration 4 LHC experiments ~250 computing centres 12 large centres (Tier-0, Tier-1) 56 federations of smaller “Tier-2” centres Growing to ~40 countries Grids: EGEE, OSG, Nordugrid Technical Design Reports WLCG, 4 Experiments: June 2005 Memorandum of Understanding Agreed in October 2005 Resources 5-year forward look Australia Belgium Canada * China Czech Rep. * Denmark Estonia Finland France Germany (*) Hungary * Italy India Israel Japan JINR Korea Netherlands Norway * Pakistan Poland Potugal Romania Russia Slovenia Spain Sweden * Switzerland Taipei Turkey * UK Ukraine USA Still to sign: Austria Brazil (under discussion) * Recent additions Ian.Bird@cern.ch
WLCG Service Hierarchy Tier-0 – the accelerator centre Data acquisition & initial processing Long-term data curation Distribution of data Tier-1 centres Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF/SARA (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1 – “online” to the data acquisition process high availability Managed Mass Storage – grid-enabled data service Data-heavy analysis National, regional support Tier-2: ~130 centres in ~35 countries End-user (physicist, research group) analysis – where the discoveries are made Simulation Ian.Bird@cern.ch
Recent grid use Across all grid infrastructures EGEE, OSG, Nordugrid The grid concept really works – all contributions – large & small are essential! CERN: 11% Tier 2: 54% Tier 1: 35%
Recent grid activity WLCG ran ~ 44 M jobs in 2007 – workload has continued to increase 29M in 2008 – now at ~ >300k jobs/day Distribution of work across Tier0/Tier1/Tier 2 really illustrates the importance of the grid system Tier 2 contribution is around 50%; > 85% is external to CERN 300k /day 230k /day These workloads (reported across all WLCG centres) are at the level anticipated for 2008 data taking
LHCOPN Architecture Ian.Bird@cern.ch
Data Transfer out of Tier-0 Target: 2008/2009 1.3 GB/s Ian.Bird@cern.ch
Production Grids WLCG relies on a production quality infrastructure Requires standards of: Availability/reliability Performance Manageability Will be used 365 days a year ... (has been for several years!) Tier 1s must store the data for at least the lifetime of the LHC - ~20 years Not passive – requires active migration to newer media Vital that we build a fault-tolerant and reliable system That can deal with individual sites being down and recover Ian.Bird@cern.ch
The EGEE Production Infrastructure Operations Coordination Centre Regional Operations Centres Global Grid User Support EGEE Network Operations Centre (SA2) Operational Security Coordination Team Support Structures & Processes Production Service Pre-production service Certification test-beds (SA3) Test-beds & Services Training infrastructure (NA4) Training activities (NA3) Operations Advisory Group (+NA4) Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Security & Policy Groups
Site Reliability Sep 07 Oct 07 Nov 07 Dec 07 Jan 08 Feb 08 All 89% 86% 92% 87% 84% 8 best 93% 95% 96% Above target (+>90% target) 7 + 2 5 + 4 9 + 2 6 + 4 7 + 3 Ian.Bird@cern.ch
Improving Reliability Monitoring Metrics Workshops Data challenges Experience Systematic problem analysis Priority from software developers
Gridmap
Middleware: Baseline Services The Basic Baseline Services – from the TDR (2005) Storage Element Castor, dCache, DPM Storm added in 2007 SRM 2.2 – deployed in production – Dec 2007 Basic transfer tools – Gridftp, .. File Transfer Service (FTS) LCG File Catalog (LFC) LCG data mgt tools - lcg-utils Posix I/O – Grid File Access Library (GFAL) Synchronised databases T0T1s 3D project Information System Scalability improvements Compute Elements Globus/Condor-C – improvements to LCG-CE for scale/reliability web services (CREAM) Support for multi-user pilot jobs (glexec, SCAS) gLite Workload Management in production VO Management System (VOMS) VO Boxes Application software installation Job Monitoring Tools Focus now on continuing evolution of reliability, performance, functionality, requirements For a production grid the middleware must allow us to build fault-tolerant and scalable services: this is more important than sophisticated functionality
Database replication In full production Several GB/day user data can be sustained to all Tier 1s ~100 DB nodes at CERN and several 10’s of nodes at Tier 1 sites Very large distributed database deployment Used for several applications Experiment calibration data; replicating (central, read-only) file catalogues
LCG depends on two major science grid infrastructures …. EGEE - Enabling Grids for E-Science OSG - US Open Science Grid Interoperability & interoperation is vital significant effort in building the procedures to support it
Grid infrastructure project co-funded by the European Commission - now in 2nd phase with 91 partners in 32 countries 240 sites 45 countries 45,000 CPUs 12 PetaBytes > 5000 users > 100 VOs > 100,000 jobs/day Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences …
EGEE: Increasing workloads ⅓ non-LHC
Grid Applications Medical Seismology Chemistry Astronomy Particle Physics Fusion Ian.Bird@cern.ch
Share of EGEE resources HEP 5/07 – 4/08: 45 Million jobs Ian.Bird@cern.ch
HEP use of EGEE: May 07 – Apr 08 Ian.Bird@cern.ch
The next step
Sustainability: Beyond EGEE-II Need to prepare permanent, common Grid infrastructure Ensure the long-term sustainability of the European e-infrastructure independent of short project funding cycles Coordinate the integration and interaction between National Grid Infrastructures (NGIs) Operate the European level of the production Grid infrastructure for a wide range of scientific disciplines to link NGIs Expand the idea and problems of the JRU
EGI – European Grid Initiative www.eu-egi.org EGI Design Study proposal to the European Commission (started Sept 07) Supported by 37 National Grid Initiatives (NGIs) 2 year project to prepare the setup and operation of a new organizational model for a sustainable pan-European grid infrastructure after the end of EGEE-3
Summary We have an operating production quality grid infrastructure that: Is in continuous use by all 4 experiments (and many other applications); Is still growing in size – sites, resources (and still to finish ramp up for LHC start-up); Demonstrates interoperability (and interoperation!) between 3 different grid infrastructures (EGEE, OSG, Nordugrid); Is becoming more and more reliable; Is ready for LHC start up For the future we must: Learn how to reduce the effort required for operation; Tackle upcoming issues of infrastructure (e.g. Power, cooling); Manage migration of underlying infrastructures to longer term models; Be ready to adapt the WLCG service to new ways of doing distributed computing Ian.Bird@cern.ch
Ian.Bird@cern.ch