Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.

Slides:



Advertisements
Similar presentations
Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
Advertisements

CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
IHEP Site Status Jingyan Shi, Computing Center, IHEP 2015 Spring HEPiX Workshop.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Private Cloud or Dedicated Hosts Mason Mabardy & Matt Maples.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep ,
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
BESIII distributed computing and VMDIRAC
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Status of StoRM+Lustre and Multi-VO Support YAN Tian Distributed Computing Group Meeting Oct. 14, 2014.
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
The ALICE short-term use case DataGrid WP6 Meeting Milano, 11 Dec 2000Piergiorgio Cerello 1 Physics Performance Report (PPR) production starting in Feb2001.
Quick Introduction to NorduGrid Oxana Smirnova 4 th Nordic LHC Workshop November 23, 2001, Stockholm.
9 th Weekly Operation Report on DIRAC Distributed Computing YAN Tian From to
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep , 2014 Draft.
BES III Computing at The University of Minnesota Dr. Alexander Scott.
V.Ilyin, V.Gavrilov, O.Kodolova, V.Korenkov, E.Tikhonenko Meeting of Russia-CERN JWG on LHC computing CERN, March 14, 2007 RDMS CMS Computing.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen CERN
Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Tools and techniques for managing virtual machine images Andreas.
The GridPP DIRAC project DIRAC for non-LHC communities.
NA61/NA49 virtualisation: status and plans Dag Toppe Larsen Budapest
LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
OpenStack Chances and Practice at IHEP Haibo, Li Computing Center, the Institute of High Energy Physics, CAS, China 2012/10/15.
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
OpenNebula: Experience at SZTAKI Peter Kacsuk, Sandor Acs, Mark Gergely, Jozsef Kovacs MTA SZTAKI EGI CF Helsinki.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
The GridPP DIRAC project DIRAC for non-LHC communities.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
ATLAS TIER3 in Valencia Santiago González de la Hoz IFIC – Instituto de Física Corpuscular (Valencia)
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
WP5 – Infrastructure Operations Test and Production Infrastructures StratusLab kick-off meeting June 2010, Orsay, France GRNET.
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
Progress of Work on SE and DMS YAN Tian April. 16, 2014.
PRIN STOA-LHC: STATUS BARI BOLOGNA-18 GIUGNO 2014 Giorgia MINIELLO G. MAGGI, G. DONVITO, D. Elia INFN Sezione di Bari e Dipartimento Interateneo.
EGI-InSPIRE RI EGI Compute and Data Services for Open Access in H2020 Tiziana Ferrari Technical Director, EGI.eu
Status of BESIII Distributed Computing BESIII Collaboration Meeting, Nov 2014 Xiaomei Zhang On Behalf of the BESIII Distributed Computing Group.
Status of BESIII Distributed Computing
AWS Integration in Distributed Computing
Virtualisation for NA49/NA61
Dag Toppe Larsen UiB/CERN CERN,
Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław
Working With Azure Batch AI
Dag Toppe Larsen UiB/CERN CERN,
Report of Dubna discussion
ATLAS Cloud Operations
Heterogeneous Computation Team HybriLIT
Provisioning 160,000 cores with HEPCloud at SC17
Work report Xianghu Zhao Nov 11, 2014.
INFN-GRID Workshop Bari, October, 26, 2004
DIRAC services.
Virtualisation for NA49/NA61
Discussions on group meeting
WLCG Collaboration Workshop;
Haiyan Meng and Douglas Thain
Presentation transcript:

Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group

Outline System and site status Performance tests for jobs and system Cloud computing – newly added resources User support Summary 2

SYSTEM AND SITE STATUS 3

System Status GangaBOSS upgrade to – Support reconstruction using local random trigger files – Optimize the job workflow efficiency (e.g. Job immediately exit while download random trigger failed) – More detailed job logging in the monitoring – Allow users to set random seed in configuration file DIRAC improvement – Greatly improved DFC metadata massive query performance. Query time reduced from 8 s to < 0.5 s – Fix bugs of missing replicas in File Catalog – Enable PBS pilot log retrieval in DIRAC job monitoring system DIRAC server is going to upgrade before Oct 1st – From SL5 to SL6 – From gLite to EMI3 – DIRAC version from v6r10pre17 to v6r10p25 4

Improvement of Sim+Rec Failure rate of simulation + reconstruction jobs greatly increase with the number of concurrent running jobs – Concurrent random trigger downloading cause high pressure of site gridftp server Problem solved by using random trigger files in the local file system directly instead of downloading from site SE Whole job failure rate reduced from more than 5% to less than 1% after improvement 5 100% successful Failure rate increase

CVMFS System CVMFS is a network file system based on HTTP used to deploy BOSS software remotely Current CVMFS server is located in CERN Available BOSS versions deployed – BOSS p02, p01, 6.6.4, p01, 6.6.3, New CVMFS server is deployed in IHEP – Some sites are slow to connect to CERN – Can be a backup server – Can be shared by multi experiments 6 CVMFS Server web proxy work node Repositories Cache (optional) load data only on access CVMFS client

Computing Resources #ContributorsCE TypeCPU CoresSE TypeSE CapacityStatus 1IHEPCluster + Cloud144dCache214 TBActive 2Univ. of CASCluster152Active 3USTCCluster200 ~ 1280dCache24 TBActive 4Peking Univ.Cluster100Active 5Wuhan Univ.Cluster100 ~ 300StoRM39 TBActive 6Univ. of MinnesotaCluster768BeStMan50 TBActive 7JINRgLite + Cloud100 ~ 200dCache8 TBActive 8INFN & Torino Univ.gLite + Cloud264StoRM50 TBActive Total1828 ~ TB 9Shandong Univ.Cluster100In progress 10BUAACluster256In progress 11SJTUCluster192In progress Total548 7

Site Status Several pressure and performance tests show that most of sites are in good status Recent pressure test on simulation and reconstruction gives 99.7% success rate – PKU in the migration of machine room – Torino suffered from power cut jobs 5000 events each psi(4260) hadron decay

Data Transfer System Data transferred from March to July 2014, total 85.9 TB DataSource SEDestination SEPeak SpeedAverage Speed randomtrg r04USTC, WHUUMN96 MB/S76.6 MB/s (6.6 TB/day) randomtrg r07IHEPUSTC, WHU191 MB/s115.9 MB/s (10.0 TB/day) Data TypeDataData SizeSource SEDestination SE DST xyz24.5 TBIHEPUSTC psippscan2.5 TBIHEPUMN Random trigger data round TBIHEPUSTC, WHU, UMN, JINR round TBIHEPUSTC, WHU, UMN round TBIHEPUSTC, WHU, UMN round TBIHEPUSTC, WHU, UMN round TBIHEPUSTC, WHU, UMN, JINR round TBIHEPUSTC, WHU High quality ( > 99% one-time success rate) High transfer speed ( ~ 1 Gbps to USTC, WHU, UMN; 300Mbps to JINR): 9

USTC, WHU  6.6 TB/day IHEP  USTC, 10.0 TB/day one-time success > 99% 10

PERFORMANCE TESTS 11

System Performance Test System performance test include – GangaBOSS job submission – DIRAC job scheduler – Job running efficiency Test shows that the system performance is adequate for the current task load 12

GangaBOSS Performance Main processes – Splitting – Generate submission files to disk – Submitting jobs to DIRAC Performance is reasonable – Time is proportional to job numbers – Most time are spent in communication with DIRAC 13 Splitting Generate submission files to disk

DIRAC Performance Main processes – Job scheduled to task queue – Pilot submission to work nodes – Job accepted by sites The schedule performance is reasonable, proportional to job number – 824 jobs scheduled less than one and a half minutes 14

Job Performance MC production processes – Simulation – Download random trigger – Reconstruction Performance result – Depend on work node CPU performance – Downloading time is very short and can be ignored, they all use local SEs – UMN and USTC have local random trigger files, no need to download Monitoring based on this can be used for smart scheduling 15 Average time for jobs Job Execution Time (s)

USER SUPPORT 16

User Support Ready and welcome more users to do MC production IHEP BESIII private users are well supported and more feedback is needed Users outside IHEP are also supported – BESDIRAC client and Ganga are deployed over CVMFS – Able to submit jobs outside IHEP with CVMFS client installed – SDIRAC_Client_in_CVMFS SDIRAC_Client_in_CVMFS New requirement from users – Production with customized generator are needed 17

User Job Status Jobs Total 80 GB Data Transferred 98.7% Success Results Transferred back to IHEP from different sites

CLOUD COMPUTING 19

Cloud Computing Cloud computing can provide extremely flexible and extensible resources based on modern virtualization technology Advantages – Easier for sites to manage and maintain resources than local resources – Easy for users to customize the OS and computing environment – Able to expand and shrink resources according to user requirement in real time Cloud solutions – Private cloud OpenStack, OpenNebula, CloudStack … – Commercial or public cloud Amazon Web Services, Google Compute Engine, Microsoft Azure … 20

Integrate Cloud to BESDIRAC How to integrate – Job scheduling scheme remains unchanged – Instead of site Director for cluster and grid, VM scheduler is introduced to support cloud Workflow – Start new virtual machine with one CPU core when there are waiting jobs – One job scheduled on one virtual machine at the same time – Delete the virtual machine after no more jobs for a certain period of time 21 Tasks DIRAC Site Director VM Scheduler Pilot Cluster Pilot VM Grid Cloud User

Cloud to End Users Transparent to the end users in job submission – Users can choose cloud sites same as other sites Standard image for BOSS production has been already provided – Scientific Linux 6.5 – BOSS software (distributed by CVMFS) – Register image on cloud site Users can customize image for their own special purpose – Select OS, software, libraries… 22

Cloud Test Bed OpenStack and OpenNebula test beds are built in IHEP Cloud test bed from other institutes and universities – INFN, Torino – JINR, Dubna – CERN 23 SiteCloud ManagerCPU CoresMemory CLOUD.IHEP-OPENSTACK.cnOpenStack2448 GB CLOUD.IHEP-OPENNEBULA.cnOpenNebula2448 GB CLOUD.CERN.chOpenStack2040 GB CLOUD.TORINO.itOpenNebula GB CLOUD.JINR.ruOpenNebula510 GB

Testing on Cloud Resource Test with 913 simulation and reconstruction jobs (psi(4260) hadron decay, 5000 events each) 24 Success rate of jobs is 100% Larger scale of tests are needed

Cloud Performance Average time of different processes for cloud sites 25 The performance for cloud sites is comparable with other sites Cloud Sites Job Running Time (s)

Preliminary Physics Validation Comparison between physical machine and cloud virtual machine (100,000 events, 59 jobs) Use the same input (decay card, random seed) The results are highly consistent Full validation should be done by physics group 26 P_pi V_x0 V_y0 V_z0

Future Plan Support more new sites Upgrade the DIRAC server SE capacity extension Support private production with user customized generator Larger scale and further tests of cloud More work to make cloud resources easier to manage 27

Summary Distributed computing system is in good status Private user production is ready Performance tests show that system and job efficiency are reasonable BOSS jobs successfully tested on cloud, which shows that cloud resources can be a promising choice for BESIII experiment More work to make cloud resources into production 28

29 Thanks for your attention! Thanks to resource contributors! Thanks to all site administrators for the help and participation!