Massive Computing at CERN and lessons learnt

Name: Massive Computing at CERN and lessons learnt
Uploaded: 2017-12-01T10:31:47+00:00
Duration: PTM13S23
Channel: Lawrence Baker
Description: Massive Computing at CERN and lessons learnt

Massive Computing at CERN and lessons learnt
Bob Jones CERN Bob.Jones <at> CERN.ch

WLCG – what and why? A distributed computing infrastructure to provide the production and analysis environments for the LHC experiments Managed and operated by a worldwide collaboration between the experiments and the participating computer centres The resources are distributed – for funding and sociological reasons Our task is to make use of the resources available to us – no matter where they are located Ian Bird, CERN

What is WLCG today? Collaboration
Coordination & management & reporting Coordinate resources & funding Coordination with service & technology providers Common requirements Memorandum of Understanding Framework Service management Service coordination Operational security Support processes & tools Common tools Monitoring & Accounting World-wide trust federation for CA’s and VO’s Complete Policy framework Distributed Computing services Physical resources: CPU, Disk, Tape, Networks

WLCG data processing model
Tier-0 (CERN): Data recording Initial data reconstruction Data distribution Tier-1 (11 centres): Permanent storage Re-processing Analysis Tier-2 (~130 centres): Simulation End-user analysis

CERN NDGF US-FNAL UK-RAL De-FZK Barcelona/PIC Lyon/CCIN2P3 US-BNL
Amsterdam/NIKHEF-SARA Taipei/ASGC Bologna/CNAF NDGF WLCG Collaboration Status Tier 0; 11 Tier 1s; 64 Tier 2 federations Ca- TRIUMF Today we have 49 MoU signatories, representing 34 countries: Australia, Austria, Belgium, Brazil, Canada, China, Czech Rep, Denmark, Estonia, Finland, France, Germany, Hungary, Italy, India, Israel, Japan, Rep. Korea, Netherlands, Norway, Pakistan, Poland, Portugal, Romania, Russia, Slovenia, Spain, Sweden, Switzerland, Taipei, Turkey, UK, Ukraine, USA. US-FNAL UK-RAL De-FZK Lyon/CCIN2P3 Barcelona/PIC 26 June 2009 Ian Bird, CERN

Redundancy meant no interruption
Fibre cut during 2009: Redundancy meant no interruption Ian Bird, CERN

Worldwide resources >140 sites ~250k CPU cores ~100 PB disk

Service quality: defined in MoU
MoU defines key performance and support metrics for Tier 1 and Tier 2 sites Reliabilities are an approximation for some of these Also metrics on response times, resources, etc. The MoU has been an important tool in bringing services to an acceptable level Ian Bird, CERN

From testing to data: e.g. DC04 (ALICE, CMS, LHCb)/DC2 (ATLAS) in 2004 saw first full chain of computing models on grids Independent Experiment Data Challenges 2004 SC1 Basic transfer rates Service Challenges proposed in 2004 To demonstrate service aspects: Data transfers for weeks on end Data management Scaling of job workloads Security incidents (“fire drills”) Interoperability Support processes 2005 SC2 Basic transfer rates SC3 Sustained rates, data management, service reliability 2006 SC4 Nominal LHC rates, disk tape tests, all Tier 1s, some Tier 2s 2007 Focus on real and continuous production use of the service over several years (simulations since 2003, cosmic ray data, etc.) Data and Service challenges to exercise all aspects of the service – not just for data transfers, but workloads, support structures etc. 2008 Service testing and data challenges started many years before the accelerator started to produce data. This was important to ensure the functionality and quality of service in order to provide a continuous service. The users (LHC experiments) were involved at all stages of this implementation, deployment and testing. CCRC’08 Readiness challenge, all experiments, ~full computing models 2009 STEP’09 Scale challenge, all experiments, full computing models, tape recall + analysis 2010 Ian Bird, CERN

Large scale = long times
LHC, the experiments, & computing have taken ~20 years to build and commission They will run for at least 20 years We must be able to rely on long term infrastructures Global networking Strong and stable NGIs (or their evolution) That should be eventually self-sustaining Long term sustainability - must come out of the current short term project funding cycles LHC uses a continuously running production service that will be required for several decades. It must evolve with technology and be supported by several generations of developers and operational staff. What is the origin of life aiming for? A short-term objectives or a long-term service? This will have an impact on the decision to be made. Ian Bird, CERN

Grids & HEP: Common history
CERN and the HEP community have been involved with grids from the beginning Recognised as a key technology for implementing the LHC computing model HEP work with EC-funded EDG/EGEE in Europe, iVDGL/Grid3/OSG etc. in US has been of clear mutual benefit Infrastructure development driven by HEP needs Robustness needed by WLCG is benefitting other communities Transfer of technology from HEP Ganga, AMGA, etc used by many communities now Ian Bird, CERN

European Grid Infrastructure
European Data Grid (EDG) Explore concepts in a testbed Enabling Grid for E-sciencE (EGEE) Moving from prototype to production European Grid Infrastructure (EGI) Routine usage of a sustainable e-infrastructure

European Grid Infrastructure (Status April 2011 – yearly increase)
Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences … 13319 end-users (+9%) 186 VOs (+6%) ~30 active VOs: constant Logical CPUs (cores) 207,200 EGI (+8%) 308,500 All 90 MPI sites 101 PB disk 80 PB tape 25.7 million jobs/month 933,000 jobs/day (+91%) 320 sites (1.4%) 58 countries (+11.5%) Figures are from D4.2 , increase computed using figures from Slide 1 (April 2010), unless differently specified 13319 users, 9.5% increase: as reference value in April 2010 I have taken the number of users documented in SA1.2.2 (EGEE deliverable), where the number of users reported was much more than units as reported in slide 1! Million job/month: the increase percentage is computed using the *average* million job/month from May 2009 to April 2010 Non-HEP users ~ 3.3M jobs / month EGI - The First Year

Grids, clouds, supercomputers, etc.
Collaborative environment Distributed resources (political/sociological) Commodity hardware (HEP) data management Complex interfaces (bug not feature) Communities expected to contribute resources Supercomputers Scarce Low latency interconnects Applications peer reviewed Parallel/coupled applications Also SC grids (DEISA/PRACE, Teragrid/XD) Clouds Proprietary (implementation) Economies of scale in management Commodity hardware Pay-as-you-go usage model Details of physical resources hidden Simple interfaces Volunteer computing Simple mechanism to access millions CPUs Difficult if (much) data involved Control of environment  check Community building – people involved in Science Potential for huge amounts of real work Grids: Could make a community-based request to EGI. Would probably need some contribution of resources from the community itself Supercomputers: Request time from DEISA/PRACE Clouds: Needs money but some companies (e.g. Amazon) make initial donations of free time well justified scientific challenges Volunteer computing: contact citizen cyberscience centre, make a request to google exacycle grant:

Collaboration with the General Public: Citizen Cyberscience Centre
Philosophy: promote web-based citizen participation in science projects as an appropriate low cost technology for scientists in the developing world. Partners: CERN, UN Institute for Training and Research, University of Geneva Sponsors: IBM, HP Labs, Shuttleworth Foundation Technology: open source platforms for internet-based distributed collaboration Projects: Computing for Clean Water optimizing nanotube based water filters by large scale simulation on volunteer PCs AfricaMap volunteer thinking to generate maps of regions of Africa from satellite images, with UNOSAT new volunteer project for public participation in LHC collision simulations, using VM technology Plans: Training workshops in 2011 in India, China, Brazil and South Africa Frédéric Hemmer

Some more questions to be answered
Computing model How many computing models exist in the community and can they all use the same computing infrastructure? Continuous load or periodic campaigns? How intensely and frequently will the community use the computing infrastructure? Manpower Do you have enough geeks to port the code and support it? How committed is the community? Are you prepared to contribute and share computing resources? Computing model: Kauffman background note shows there are many theories about the origins of life – when these are implemented as computer applications do they have different computing models? Are they compatible or different? What are the characteristics? Are they suitable for grids or supercomputers? After speaking to Wim Hordijk from Lausanne at least one model does fit well to grid computing. Bob Jones – May 2011

Massive Computing at CERN and lessons learnt

Similar presentations

Presentation on theme: "Massive Computing at CERN and lessons learnt"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Massive Computing at CERN and lessons learnt

Similar presentations

Presentation on theme: "Massive Computing at CERN and lessons learnt"— Presentation transcript:

Similar presentations

About project

Feedback