Presentation is loading. Please wait.

Presentation is loading. Please wait.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid.

Similar presentations


Presentation on theme: "José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid."— Presentation transcript:

1 José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid

2 José Hernández The CMS Experiment at the LHC 20 January 2012 Grid Computing in CMS 2 The Large Hadron Collider p-p collisions, 7 TeV, 40 MHz The Compact Muon Solenoid Precision measurements Search for new phenomena

3 José Hernández LHC: a challenge for computing  The Large Hadron Collider at CERN is the largest scientific instrument on the planet  Unprecedented data handling scale  40 MHz event rate (~1 GHz collision rate) → 100TB/s → online filtering to ~300 Hz (~300 MB/s) → ~3 PB/year (10 7 secs data taking/year)  Need large computing power to process data  Complex events  Many interesting signals << Hz  Thousands of scientists around the world access and analyze the data  Need computing infrastructure able to store, move around the globe, process, simulate and analyze data at the Petabyte scale [O(10) PB/year] 3

4 José Hernández The LHC Computing Grid 4 LCG: 300+ centers, 50+ countries, ~100k CPUs, ~ 100PB disk/tape, 10k users  The LHC Computing Grid provides the distributed computing infrastructure  Computing resources (CPU, storage, networking)  Computing services (data and job management, monitoring, etc)  Integrated to provide a single LHC computing service  Using Grid technologies  Transparent and reliable access to heterogeneous computing resources geographically distributed via internet  High capacity wide area networking

5 José Hernández The CMS Computing Model  Distributed computing model for data storage, processing and analysis  Grid technologies (Worldwide LHC Computing Grid, WLCG)  Tiered architecture of computing resources  ~20 Petabytes of data (real and simulated) every year  About 200k jobs (data processing, simulation production and analysis) per day

6 José Hernández WLCG network infrastructure 20 January 2012 Grid Computing in CMS 6  T0-T1 and T1-T1 interconnected via LHCOPN (10 Gpbs links)  T1-T2 and T2-T2 using general research networks  Dedicated network infrastructure (LHCONE) being deployed

7 José Hernández Grid services in WLCG  Middleware providers: gLite/EMI, OSG, ARC  Global services: data transfers and job management, authentication / authorization, information system  Compute (gateway, local batch system, WNs) and storage (gridftp servers, disk servers, mass storage system) elements at the sites  Experiment specific services 20 January 2012 Grid Computing in CMS 7

8 José Hernández CMS Data and Workload Management  Experiment-specific DMWM services on top of basic Grid services  Pilot-based WMS  Data bookkeeping, location and transfer systems  Data pre-located  Jobs go to data  Experiment software pre-installed at sites 20 January 2012 Grid Computing in CMS 8 Production System (WMAgent) Analysis System (CRAB) Data Bookkeeping & location system (DBS) Data Transfer System (PhEDEx) gLite WMS File Transfer System CE SE Local batch system Mass storage system CMS Services Grid Services Sites Operators Users Pilot-based WMS

9 José Hernández CMS Grid Operations - Jobs  Large scale data processing & analysis  ~50k used slots, 300k jobs/day  Plots correspond Aug 2011 – Jan 2012 20 January 2012 Grid Computing in CMS 9

10 José Hernández Spanish contribution to CMS Computing Resources 10  Spain contributes with ~ 5% of the CMS computing resources PIC Tier-1  ~1/2 average Tier-1  3000 cores, 4 PB disk, 6 PB tape IFCA Tier-2  ~ 2/3 average Tier-2 (~3% T2 resources)  1000 CPUs, 600 TB disk CIEMAT Tier-2  ~ 2/3 average Tier-2 (~3% T2 resources)  1000 cores, 600 TB disk

11 José Hernández Contribution from Spanish sites 20 January 2012 Grid Computing in CMS 11  ~5 % of total CPU delivered for CMS CPU delivered Feb 2011 – Jan 2012

12 José Hernández CMS Grid Operations - Data  Large scale data replication  1-2 GB/s throughput CMS-wide  ~1 PB/week data transfers  Full mesh 50+ sites T0 T1 T1 T2 T2 20 January 2012 Grid Computing in CMS 12 1 GB/s Production transfers debug transfers 1 GB/s

13 José Hernández Site monitoring/readiness 20 January 2012 Grid Computing in CMS 13

14 José Hernández Lessons learnt 20 January 2012 Grid Computing in CMS 14  Porting the production and analysis applications to the Grid was easy  Package job wrapper and user libraries into input sandbox  Experiment software pre-installed at the sites  Job wrapper sets up environment, runs the job, stages out output  When running at large scale in WLCG, additional services are needed  Job and data management services on top of Grid services  Data bookkeeping and location  Monitoring

15 José Hernández Lessons learnt 20 January 2012 Grid Computing in CMS 15  Monitoring is essential  Multi-layer complex system (experiment, Grid, site layers)  Monitor workflows, services, sites  Experiment services should be robust  Deal with (inherent) Grid unreliability  Be prepared for retries, cool-off  Pilot-based WMS  gLite BDII and WMS not reliable enough  Smaller overhead, verify node environment, global priorities, etc  Isolating users from the Grid; Grid operations team  Lots of manpower needed to operate the system  Central operations team (~20 FTE)  Contacts at sites (50+)

16 José Hernández Future developments 20 January 2012 Grid Computing in CMS 16  Dynamic data placement/deletions  Most of the pre-located data not really accessed much  Investigating automatic replication of hot data, deletion of cold data  Replicate data when accessed by jobs and cache locally  Remote data access  Jobs go to free slots and access data remotely  CMS has improved a lot read performance over WAN  At the moment only used as fail-over and overflow  Service to asynchronously copy user data  Remote stage out from WN is a bad idea  Multi-core processing  More efficient use of multi-core nodes, savings in RAM, many less jobs to handle

17 José Hernández Future developments 20 January 2012 Grid Computing in CMS 17  Virtualization of WNs/Cloud computing  Decouple node OS and application environment using VMs or chroot  Allow use of opportunistic resources  CERN VMFS for experiment software

18 José Hernández Summary 20 January 2012 Grid Computing in CMS 18  CMS has been very successful in using the LHC Computing Grid at large scale  Lot of work to make the system efficient, reliable and scalable  Some developments in the pipeline to make CMS distributed computing more dynamic and transparent


Download ppt "José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid 19-20 January 2012, CIEMAT, Madrid."

Similar presentations


Ads by Google