A Grid For Particle Physics From testbed to production Jeremy Coles 3 rd September 2004 All Hands Meeting – Nottingham, UK.

Slides:



Advertisements
Similar presentations
An open source approach for grids Bob Jones CERN EU DataGrid Project Deputy Project Leader EU EGEE Designated Technical Director
Advertisements

S.L.LloydATSE e-Science Visit April 2004Slide 1 GridPP – A UK Computing Grid for Particle Physics GridPP 19 UK Universities, CCLRC (RAL & Daresbury) and.
1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.
Tony Doyle GridPP – From Prototype To Production, HEPiX Meeting, Edinburgh, 25 May 2004.
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
WP2: Data Management Gavin McCance University of Glasgow November 5, 2001.
GridPP9 – 5 February 2004 – Data Management DataGrid is a project funded by the European Union GridPP is funded by PPARC GridPP2: Data and Storage Management.
1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.
Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Tony Doyle Executive Summary, PPARC, MRC London, 15 May 2003.
Your university or experiment logo here What is it? What is it for? The Grid.
B A B AR and the GRID Roger Barlow for Fergus Wilson GridPP 13 5 th July 2005, Durham.
Metadata Progress GridPP18 20 March 2007 Mike Kenyon.
ATLAS/LHCb GANGA DEVELOPMENT Introduction Requirements Architecture and design Interfacing to the Grid Ganga prototyping A. Soroko (Oxford), K. Harrison.
GridPP Building a UK Computing Grid for Particle Physics A PPARC funded project.
Slide 1 of 24 Steve Lloyd NW Grid Seminar - 11 May 2006 GridPP and the Grid for Particle Physics Steve Lloyd Queen Mary, University of London NW Grid Seminar.
Particle physics – the computing challenge CERN Large Hadron Collider –2007 –the worlds most powerful particle accelerator –10 petabytes (10 million billion.
UK Agency for the support of: High Energy Physics - the nature of matter and mass Particle Astrophysics - laws from natural phenomena Astronomy - the.
Tony Doyle GridPP2 Proposal, BT Meeting, Imperial, 23 July 2003.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
GridPP Deployment Status Steve Traylen 28th October 2004 GOSC Face to Face, NESC, UK.
31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.
The LHC experiments AuthZ Interoperation requirements GGF16, Athens 16 February 2006 David Kelsey CCLRC/RAL, UK
Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
Data Management Expert Panel - WP2. WP2 Overview.
GridPP From Prototype to Production David Britton 21/Sep/06 1.Context – Introduction to GridPP 2.Performance of the GridPP/EGEE/wLCG Grid 3.Some Successes.
LHCb Computing Activities in UK Current activities UK GRID activities RICH s/w activities.
Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
Enabling e-Research over GridPP Dan Tovey University of Sheffield.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Plateforme de Calcul pour les Sciences du Vivant SRB & gLite V. Breton.
Tony Doyle “GridPP2 Proposal”, GridPP7 Collab. Meeting, Oxford, 1 July 2003.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
3 Sept 2001F HARRIS CHEP, Beijing 1 Moving the LHCb Monte Carlo production system to the GRID D.Galli,U.Marconi,V.Vagnoni INFN Bologna N Brook Bristol.
Tony Doyle GridPP – From Prototype To Production, GridPP10 Meeting, CERN, 2 June 2004.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Nick Brook Current status Future Collaboration Plans Future UK plans.
3 June 2004GridPP10Slide 1 GridPP Dissemination Sarah Pearce Dissemination Officer
ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.
John Gordon CCLRC e-Science Centre LCG Deployment in the UK John Gordon GridPP10.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.
…building the next IT revolution From Web to Grid…
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
ATLAS is a general-purpose particle physics experiment which will study topics including the origin of mass, the processes that allowed an excess of matter.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
Andrew McNabSecurity Middleware, GridPP8, 23 Sept 2003Slide 1 Security Middleware Andrew McNab High Energy Physics University of Manchester.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
WP3 Information and Monitoring Rob Byrom / WP3
EGEE is a project funded by the European Union under contract IST Information and Monitoring Services within a Grid R-GMA (Relational Grid.
J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
J Jensen / WP5 /RAL UCL 4/5 March 2004 GridPP / DataGrid wrap-up Mass Storage Management J Jensen
18/12/03PPD Christmas Lectures 2003 Grid in the Department A Guide for the Uninvolved PPD Computing Group Christmas Lecture 2003 Chris Brew.
Bob Jones EGEE Technical Director
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
Understanding the nature of matter -
UK GridPP Tier-1/A Centre at CLRC
Building a UK Computing Grid for Particle Physics
WP7 objectives, achievements and plans
The LHCb Computing Data Challenge DC06
Presentation transcript:

A Grid For Particle Physics From testbed to production Jeremy Coles 3 rd September 2004 All Hands Meeting – Nottingham, UK

Contents The middleware components of the testbed Lessons learnt from the project Status of the current operational Grid Future plans and challenges Summary Review of GridPP1 and the European Data Grid Project

CMSLHCbATLASALICE 1 Megabyte (1MB) A digital photo 1 Gigabyte (1GB) = 1000MB A DVD movie 1 Terabyte (1TB) = 1000GB World annual book production 1 Petabyte (1PB) = 1000TB Annual production of one LHC experiment 1 Exabyte (1EB) = 1000 PB World annual information production The physics driver 40 million collisions per second After filtering, collisions of interest per second 1-10 Megabytes of data digitised for each collision = recording rate of Gigabytes/sec collisions recorded each year = ~10 Petabytes/year of data The LHC

simulation reconstruction analysis interactive physics analysis batch physics analysis batch physics analysis detector event summary data raw data event reprocessing event reprocessing event simulation event simulation analysis objects (extracted by physics topic) event filter (selection & reconstruction) event filter (selection & reconstruction) processed data CER N Data handling

The UK response GridPP GridPP – A UK Computing Grid for Particle Physics 19 UK Universities, CCLRC (RAL & Daresbury) and CERN Funded by the Particle Physics and Astronomy Research Council (PPARC) GridPP1 - Sept £17m "From Web to Grid GridPP2 – Sept £16(+1)m "From Prototype to Production"

GridPP1 project structure

Software > 65 use cases 7 major software releases (> 60 in total) > 1,000,000 lines of code People 500 registered users 12 Virtual Organisations 21 Certificate Authorities >600 people trained 456 person-years of effort Application Testbed ~20 regular sites > 60,000 jobs submitted ( since 09/03, release 2.0 ) Peak >1000 CPUs 6 Mass Storage Systems Scientific Applications 5 Earth Obs institutes 10 bio-medical apps 6 HEP experiments The project

Contents The middleware components of the testbed Lessons learnt from the project

The infrastructure developed Job submission Python – default Java – GUI APIs (C++,J,P) Batch workers Storage Element Gatekeeper (Perl script) + Scheduler gridFTP NFS, Tape, Castor User Interface Computing Element Resource broker (C++ Condor MM libraries, Condor-G for submission) Replica catalogue per VO (or equiv.) Berkely Database Information Index AA server (VOMS) UI JDL Logging & Book keeping MySQL DB – stores job state info

Integration Much time spent on –Controlling the direct and indirect interplay of the various integrated components –Addressing stability issues (often configuration linked) and bottlenecks in a non-linear system –Predicting (or failing to predict) where the next bottleneck will appear in the job processing network (MDS +) BDII Or R-GMA Data services -RLS -RC

The Grid Storage Element interfaces Handlers TAPE storage (or disk) Access Control File Metadata Manages storage and provides common interfaces to Grid clients. Higher level data management tools use replica catalogues & metadata about files to locate, and optimise which replica to use Since EDG work has provided the SE with an SRM 1 Interface. SRM 2.1 with added functionality will be available soon. The SRM interface is a file control interface, there is also an interface for publishing information. Internally, handlers ensure modularity and flexibility. The storage element Lessons learnt Separating file control (e.g. staging, pinning) from data transfer is useful (different nodes better performance) –Can be used for load balancing, redirection, etc –Easy to add new data transfer protocols –However, files in cache must be released by the client or time out

Based on the (simple model of the) Grid Monitoring Architecture (GMA) from the GGF For Relational Grid Monitoring Architecture (R-GMA): hide Registry mechanism from the user –Producer registers on behalf of user –Mediator (in Consumer) transparently selects the correct Producer(s) to answer a query u Use relational model (R of R-GMA) n Facilitate expression of queries over all the published information Producer Registry/ Schema Consumer u Users just think in terms of Producers and Consumers Information & monitoring Lessons learnt Release working code early Distributed Software System testing is hard – private WP3 testbed was very useful Automate as much as possible (CruiseControl always runs all tests!)

The security model VO-VOMS user service Mutual authentication & authorization info user cert (long life ) VO-VOMS CA low frequency high frequency host cert (long life ) authz cert (short life) service cert (short life) authz cert (short life) proxy cert (short life) voms-proxy-init crl update registration LCAS Local Centre Authorisation Service

The security model (2) Lessons learned Be careful collecting requirements (integration is difficult) Security must be an integral part of all development (from the start) Building and maintaining trust between projects and continents takes time Integration of security into existing systems is complex There must be a dedicated activity dealing with security EGEE benefited greatly – now has separate activity Authentication - GridPP led the EDG/LCG CA infrastructure (trust) Authorisation VOMS for global policy LCAS for local site policy GACL (fine grained access control) and GridSite for http LCG/EGEE security policy led by GridPP

Networking A network transfer cost estimation service to provide applications and middleware with the costs of data transport –Used by RBs for optimized matchmaking (getAccessCost), and also directly by applications (getBestFile) GEANT network tests campaign –Network Quality Of Service –High-Throughput Transfers Close collaboration with DANTE –Set-up of the testbed –Analysis of results –Access granted to all internal GEANT monitoring tools Network monitoring is a key activity, both for provisioning and to provide accurate aggregate function for global grid schedulers. The investigations on network QoS carried out have led to a much greater understanding of how to utilise the network to benefit Grid operations Benefits resulted from close contact with DANTE and DataTAG, both at technical and management level

Project lessons learnt Formation of Task Forces (applications+middleware) was a very important step midway in project. Applications should have played a larger role in architecture discussions from the start Loose Cannons (team of 5) were crucial to all developments. Worked across experiments and work packages Site certification needs to be improved. and validation needs to be automated and run regularly. Misconfigured sites may cause many failures Important to provide a stable environment to attract users but get at the start get working code out to known users as quickly as possible Quality should start at the beginning of the project for all activities with defined Procedures, standards and metrics Security needs to be an integrated part from the very beginning

Contents Status of the current operational Grid

Our grid is working … NorthGrid **** Daresbury, Lancaster, Liverpool, Manchester, Sheffield SouthGrid * Birmingham, Bristol, Cambridge, Oxford, RAL PPD, Warwick ScotGrid * Durham, Edinburgh, Glasgow LondonGrid *** Brunel, Imperial, QMUL, RHUL, UCL

… and is part of LCG Rutherford Laboratory together with a site in Taipei is currently providing the Grid Operations Centre. It will also run the UK/I EGEE Regional Operations Centre and Core Infrastructure Centre Resources are being used for data challenges Within the UK we have some VO/experiment Memorandum of Understandings in place Tier-2 structure is working well

Scale GridPP prototype Grid > 1,000 CPUs –500 CPUs at the Tier-1 at RAL > 500 CPUs at 11 sites across UK organised in 4 Regional Tier-2s > 500 TB of storage > 800 simultaneous jobs Integrated with international LHC Computing Grid (LCG) > 5,000 CPUs > 4,000 TB of storage > 70 sites around the world > 4,000 simultaneous jobs monitored via Grid Operations Centre (RAL) CPUsFree CPUs Run Jobs Wait Jobs Avail TBUsed TBMax CPUAve. CPU Total Picture yesterday (hyperthreading enabled on some sites)

Past upgrade experience at RAL Previously utilisation of new resources grew steadily over weeks or months.

Tier-1 update th July 2004 Hardware Upgrade With the Grid we see a much more rapid utilisation of newly deployed resources.

Contents Future plans and challenges

Current context of GridPP

GridPP2 management Collaboration Board Project Management Board Project Leader Project Manager Deployment Board User Board Production Manager Dissemination Officer GGF, LCG, EGEE, UK e- Science, Liaison Project Map Risk Register

There are still challenges Middleware validation Improving Grid efficiency Meeting experiment requirements with the Grid Provision of work group computing Distributed file (and sub-file) management Experiment software distribution Provision of distributed analysis functionality Production accounting Encouraging an open sharing of resources Security

Middleware validation CERTIFICATION TESTING Integrate Basic Functionality Tests Run tests C&T suites Site suites Run Certification Matrix Release candidate tag APP INTEGR Certified release tag DEVELOPMENT & INTEGRATION UNIT & FUNCTIONAL TESTING Dev Tag JRA1 HEP EXPTS BIO-MED OTHER TBD APPS SW Installation DEPLOYMENT PREPARATION Deployment release tag DEPLOY SA1 SERVICES PRE-PRODUCTION PRODUCTION Production tag Is starting to be addressed through a Certification and Testing testbed…

Work Group Computing

1.AliEn (ALICE Grid) provided a pre- Grid implementation [Perl scripts] 2.ARDA provides a framework for PP application middleware Distributed analysis

ATLAS Data Challenge to validate world-wide computing model Packaging, distribution and installation: Scale: one release build takes 10 hours produces 2.5 GB of files Complexity: 500 packages, Mloc, 100s of developers and 1000s of users –ATLAS collaboration is widely distributed: 140 institutes, all wanting to use the software –needs push-button easy installation.. Physics Models Monte Carlo Truth Data MC Raw Data Reconstruction MC Event Summary Data MC Event Tags Detector Simulation Raw Data Reconstruction Data Acquisition Level 3 trigger Trigger Tags Event Summary Data ESD Event Summary Data ESD Event Tags Calibration Data Run Conditions Trigger System Step 1: Monte Carlo Data Challenges Step 1: Monte Carlo Data Challenges Step 2: Real Data Software distribution

GOC aggregates data across all sites. Production accounting

Deployment SecurityStable fabricMiddleware P r o c e d u r e s D o c u m e n t a t i o n Metrics Accounting and Monitoring S u p p o r t Porting to new platforms…

Current status

BaBar D0 CDF ATLAS CMS LHCb ALICE 19 UK Institutes RAL Computer Centre CERN Computer Centre SAMGrid BaBarGrid LCG EDG GANGA EGEE UK Prototype Tier-1/A Centre CERN Prototype Tier-0 Centre 4 UK Tier-2 Centres LCG UK Tier-1/A Centre CERN Tier-0 Centre UK Prototype Tier-2 Centres ARDA Separate Experiments, Resources, Multiple Accounts 'One' Production Grid Prototype Grids Grevolution

Contents Summary

The Large Hadron Collider data volumes make Grid computing a necessity GridPP1 with EDG developed a successful Grid prototype GridPP members have played a critical role in most areas – security, work load management, monitoring & operations GridPP involvement continues with the Enabling Grids for e-Science in Europe (EGEE) project – driving the federating of Grids As we move towards a full production service we face many challenges in areas such as deployment, accounting and true open sharing of resources Or to see a possible analogy of developing a Grid follow this link!

Useful links GRIDPP and LCG: GridPP collaboration Grid Operations Centre (inc. maps) The LHC Computing Grid Others PPARC The EGEE project The European Data Grid final review