Xiaomei Zhang On behalf of CEPC software & computing group Nov 6, 2017

Slides:



Advertisements
Similar presentations
Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
Advertisements

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep ,
Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep , 2014 Draft.
ALICE Offline Week | CERN | November 7, 2013 | Predrag Buncic AliEn, Clouds and Supercomputers Predrag Buncic With minor adjustments by Maarten Litmaath.
WebFTS File Transfer Web Interface for FTS3 Andrea Manzi On behalf of the FTS team Workshop on Cloud Services for File Synchronisation and Sharing.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Status of BESIII Distributed Computing BESIII Collaboration Meeting, Nov 2014 Xiaomei Zhang On Behalf of the BESIII Distributed Computing Group.
LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
CEPC software & computing study group report
CernVM-FS vs Dataset Sharing
Dynamic Extension of the INFN Tier-1 on external resources
WLCG IPv6 deployment strategy
Status of WLCG FCPPL project
WLCG Workshop 2017 [Manchester] Operations Session Summary
Status of BESIII Distributed Computing
(Prague, March 2009) Andrey Y Shevel
The advances in IHEP Cloud facility
Distributed Computing in IHEP
Regional Operations Centres Core infrastructure Centres
Xiaomei Zhang CMS IHEP Group Meeting December
Database Replication and Monitoring
AWS Integration in Distributed Computing
Virtualization and Clouds ATLAS position
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
Virtualisation for NA49/NA61
StoRM: a SRM solution for disk based storage systems
Computing models, facilities, distributed computing
Overview of the Belle II computing
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Belle II Physics Analysis Center at TIFR
ATLAS Cloud Operations
LHC experiments Requirements and Concepts ALICE
Provisioning 160,000 cores with HEPCloud at SC17
DIRAC services.
Virtualisation for NA49/NA61
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Status of Storm+Lustre and Multi-VO Support
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
Computing at CEPC Xiaomei Zhang Xianghu Zhao
Simulation use cases for T2 in ALICE
Discussions on group meeting
WLCG Collaboration Workshop;
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Grid Canada Testbed using HEP applications
ExaO: Software Defined Data Distribution for Exascale Sciences
Status and plans for bookkeeping system and production tools
Exploit the massive Volunteer Computing resource for HEP computation
Exploring Multi-Core on
The LHCb Computing Data Challenge DC06
Presentation transcript:

Xiaomei Zhang On behalf of CEPC software & computing group Nov 6, 2017 Computing at CEPC Xiaomei Zhang On behalf of CEPC software & computing group Nov 6, 2017

Content Progress on software framework and management Distributed computing status for R&D phase New technologies explored for future challenges Summary

Software management git has been used for distributed version control CMake is used as main compiling management tool cepcenv toolkit is being developed to simplify Installation of CEPC offline software Set-up of CEPC software env, easy usage of CMake A standard software release procedure is in plan, will include Auto integration test Physics validation Final data production

Software framework Current CEPC software uses Marlin, adopted from ILC CEPC software group are built, including current CEPC software group, IHEPCC, SDU, SYSU, JINR……for future CEPC software framework Consider uncertain official support of Marlin, future CEPC software framework are investigated Several existing framework are studied Gaudi is preferred with wider community, possible long-term official support, more experts available in hand, keep improved with parallel computing International review meeting is in consideration for final decision of framework Marlin Gaudi ROOT ART SNiPER User Interface XML Python, TXT Root script FHiCL Python Community ILC Atlas, BES3, DYB,LHCb Phenix, Alice Mu2e, NOvA, LArSoft, LBNF JUNO, LHAASO

Computing requirements CEPC is expected to have very large data volume in its data taking period, comparable to LHC and BelleII Not doubt one single data center can’t meet challenges Distributed computing is the best way to organize worldwide resources from CEPC collaborations and other possible ways In current R&D phase, CEPC simulation needs at least 2K dedicated CPU cores and 1PB each year Currently no direct funding to meet requirements, no dedicated computing resources 500TB storage in Lustre locally, but close to full Distributed computing becomes the main way to collect free resources for its massive simulation on this stage

Distributed computing First prototype of distributed computing system for CEPC have been built up in 2014 The system have considered current CEPC computing requirements, resource situation and manpower Use existing grid solutions as much as possible from WLCG Keep system as simple as possible for users and admins Now the distributed computing is taking full tasks of CEPC massive simulation almost three years Completed simulation of 120M signal events, 4 Fermions and 2 Fermions SM background events with 165TB produced

Resource Active Site: 6 from England,Taiwan, China Universities(4) QMUL from England and IPAS from Taiwan plays a great role Cloud technology used to share free resource from other experiments in IHEP Resource: ~2500 CPU cores, shared resources with other experiments Resource types include Cluster, Grid ,Cloud Network: 10Gb/s to USA and Europe, to TaiWan and China University Joining LHCONE is in plan to future improve international network connection Site Name CPU Cores CLOUD.IHEP-OPENSTACK.cn 96 CLOUD.IHEP-OPENNEBULA.cn 24 CLOUD.IHEPCLOUD.cn 200 GRID.QMUL.uk 1600 CLUSTER.IPAS.tw 500 CLUSTER.SJTU.cn 100 Total (Active) 2520 Job input and output directly from remote SE QMUL: Queen Mary University of London IPAS: Institute of Physics, Academia Sinica

Current computing model With limited manpower and small scale, make current computing model as simple as possible IHEP as central site Event Generation(EG) and analysis, small scale of simulation Hold central storage for all experiment data Hold central database for detector geometry Remote sites MC production including Mokka simu + Marlin recon Data flow IHEP -> Sites, stdhep files from EG distributed to Sites Sites -> IHEP, output MC data directly transfer back to IHEP from jobs In future, with more resources and tasks, it will be extended to multi-tier infrastructure, avoid single-point failure with multi data servers……

Central grid services in IHEP (1) Job management uses DIRAC (Distributed Infrastructure with Remote Agent Control) Hide complexity from heterogeneous resources Provide global job scheduling service Central SE built up based on StoRM Lustre /cefs as its backend Frontend provide SRM, HTTP, xrootd access Export and share experiment data with sites Lustre: /cefs CVMFS S0 DIRAC WMS StoRM CVMFS S1 Job Cluster/grid Site Cloud Site Data Software

Central grid services in IHEP (2) CEPC VO (Virtual Organization) built for user authentication in grid VOMS server hosted in IHEP Software distribution via CVMFS (CernVM File System) CVMFS Stratum0(S0) and Stratum1(S1) created in IHEP Simple squid proxy spread among sites Plan to have S1 in Europe and U.S. to speed up software access outside China CNAF S1 have synchronized IHEP S0

Production status (1) 400M massive production events is planned to simulate this year 11 types of 4 fermions and 1 type of 2 fermions have already finished in distributed computing zzorww_h_udud and zzorww_h_cscs are both simulated and reconstructed Others only do reconstruction 30M events have been successfully produced since March But we still have a hard work to do, need more resources to complete Final State Simulated Events 4 Fermions zz_h_utut 419200 zz_h_dtdt 1135600 zz_h_uu_nodt 481000 zz_h_cc_nots 482800 ww_h_cuxx 1709400 ww_h_uubd 100000 ww_h_uusd 205800 ww_hccbs ww_h_ccds 60600 zzorww_h_udud 3970000 zzorww_h_cscs 2 Fermions e2e2 14908000 Total 27542400

Production status (2) Totally 385K CEPC jobs processed in 2017, 5 sites joined Cloud in IHEP contribute 50%, QMUL site 30%, IPAS site 15% The peek resource used is ~1300 CPU cores, the average is only ~400 CPU cores Sites are busy with other local tasks in some periods The system is running well, but need more resources to join us WLCG has more than 170 sites, we encourage more sites to join us!

New technologies explored Besides the running system for current R&D, more new technologies are being explored to meet with future software needs and possible bottlenecks Elastic integration with private and commercial cloud Offline DB access with frontier/squid cache technology Multi-core for parallel computing HPC federation with DIRAC Data federation based on Cache for faster data access ……

Elastic cloud integration Cloud already becomes a popular resource in HEP Private cloud has been well integrated in CEPC production system Cloud resource can be used in an elastic way according to real CEPC job requirements All the Magic is done with VMDIRAC, extension to DIRAC Commercial cloud would be a good potential resource for urgent and CPU-intensive tasks With the support of Amazon AWS China region, trials have been done successfully with CEPC “sw_sl” simulation jobs Current network connection can support 2000 in parallel The same logic implemented with AWS API The cost of producing 100 CEPC events is about 1.22 CNY

FroNtier/squid for offline DB access For current scale, only one central database used Would be a bottleneck with bigger scale of jobs FroNtier/Squid based on Cache tech is considered FroNtier detect changes in DB and forward changes to squid cache Good release pressure of central servers to multi-layer caches Status Testbed has been set up, more close work with CEPC software needed to provide transparent interface to CEPC users

Multi-core supports Multi-process/thread is being considered in CEPC software framework Best explore multicore CPU architectures and improve performance Decrease memory usage per core Multi-core scheduling in distributed computing system is being studied Testbed has been successfully set up Two ways of multi-core scheduling with different pilot modes are investigated Tests showed that scheduling efficiency of multi-core mode is lower than that of single-mode, need to be improved

HPC federation HPC resource becomes more and more important in HEP data processing Already used in CEPC detector design Many HPC computing centers are being built up in recent years among HEP data centers IHEP, JINR, IN2P3…… HPC federation with DIRAC is in plan to build a “grid” of HPC computing resources Integrate HTC and HPC resources as a whole

Summary Software management and framework are in progress Distributed computing is working well for current scale of resources But it still need more sites to join for current tasks of CEPC massive simulation More advanced techs are investigated to meet future challenges and potential bottleneck Thank QMUL and IPAS for their contributions Here we would like to encourage more sites to join in distributed computing!

Thank you!