Presentation is loading. Please wait.

Presentation is loading. Please wait.

Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.

Similar presentations


Presentation on theme: "Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group."— Presentation transcript:

1 Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group

2 Outline System and site status Performance tests for jobs and system Cloud computing – newly added resources User support Summary 2

3 SYSTEM AND SITE STATUS 3

4 System Status GangaBOSS upgrade to 1.0.7 – Support reconstruction using local random trigger files – Optimize the job workflow efficiency (e.g. Job immediately exit while download random trigger failed) – More detailed job logging in the monitoring – Allow users to set random seed in configuration file DIRAC improvement – Greatly improved DFC metadata massive query performance. Query time reduced from 8 s to < 0.5 s – Fix bugs of missing replicas in File Catalog – Enable PBS pilot log retrieval in DIRAC job monitoring system DIRAC server is going to upgrade before Oct 1st – From SL5 to SL6 – From gLite to EMI3 – DIRAC version from v6r10pre17 to v6r10p25 4

5 Improvement of Sim+Rec Failure rate of simulation + reconstruction jobs greatly increase with the number of concurrent running jobs – Concurrent random trigger downloading cause high pressure of site gridftp server Problem solved by using random trigger files in the local file system directly instead of downloading from site SE Whole job failure rate reduced from more than 5% to less than 1% after improvement 5 100% successful Failure rate increase

6 CVMFS System CVMFS is a network file system based on HTTP used to deploy BOSS software remotely Current CVMFS server is located in CERN Available BOSS versions deployed – BOSS 6.6.4.p02, 6.6.4.p01, 6.6.4, 6.6.3.p01, 6.6.3, 6.6.2 New CVMFS server is deployed in IHEP – Some sites are slow to connect to CERN – Can be a backup server – Can be shared by multi experiments 6 CVMFS Server web proxy work node Repositories Cache (optional) load data only on access CVMFS client

7 Computing Resources #ContributorsCE TypeCPU CoresSE TypeSE CapacityStatus 1IHEPCluster + Cloud144dCache214 TBActive 2Univ. of CASCluster152Active 3USTCCluster200 ~ 1280dCache24 TBActive 4Peking Univ.Cluster100Active 5Wuhan Univ.Cluster100 ~ 300StoRM39 TBActive 6Univ. of MinnesotaCluster768BeStMan50 TBActive 7JINRgLite + Cloud100 ~ 200dCache8 TBActive 8INFN & Torino Univ.gLite + Cloud264StoRM50 TBActive Total1828 ~ 3208385 TB 9Shandong Univ.Cluster100In progress 10BUAACluster256In progress 11SJTUCluster192In progress Total548 7

8 Site Status Several pressure and performance tests show that most of sites are in good status Recent pressure test on simulation and reconstruction gives 99.7% success rate – PKU in the migration of machine room – Torino suffered from power cut 8 10724 jobs 5000 events each psi(4260) hadron decay

9 Data Transfer System Data transferred from March to July 2014, total 85.9 TB DataSource SEDestination SEPeak SpeedAverage Speed randomtrg r04USTC, WHUUMN96 MB/S76.6 MB/s (6.6 TB/day) randomtrg r07IHEPUSTC, WHU191 MB/s115.9 MB/s (10.0 TB/day) Data TypeDataData SizeSource SEDestination SE DST xyz24.5 TBIHEPUSTC psippscan2.5 TBIHEPUMN Random trigger data round 021.9 TBIHEPUSTC, WHU, UMN, JINR round 032.8 TBIHEPUSTC, WHU, UMN round 043.1 TBIHEPUSTC, WHU, UMN round 053.6 TBIHEPUSTC, WHU, UMN round 064.4 TBIHEPUSTC, WHU, UMN, JINR round 075.2 TBIHEPUSTC, WHU High quality ( > 99% one-time success rate) High transfer speed ( ~ 1 Gbps to USTC, WHU, UMN; 300Mbps to JINR): 9

10 USTC, WHU  UMN @ 6.6 TB/day IHEP  USTC, WHU @ 10.0 TB/day one-time success > 99% 10

11 PERFORMANCE TESTS 11

12 System Performance Test System performance test include – GangaBOSS job submission – DIRAC job scheduler – Job running efficiency Test shows that the system performance is adequate for the current task load 12

13 GangaBOSS Performance Main processes – Splitting – Generate submission files to disk – Submitting jobs to DIRAC Performance is reasonable – Time is proportional to job numbers – Most time are spent in communication with DIRAC 13 Splitting Generate submission files to disk

14 DIRAC Performance Main processes – Job scheduled to task queue – Pilot submission to work nodes – Job accepted by sites The schedule performance is reasonable, proportional to job number – 824 jobs scheduled less than one and a half minutes 14

15 Job Performance MC production processes – Simulation – Download random trigger – Reconstruction Performance result – Depend on work node CPU performance – Downloading time is very short and can be ignored, they all use local SEs – UMN and USTC have local random trigger files, no need to download Monitoring based on this can be used for smart scheduling 15 Average time for 10724 jobs Job Execution Time (s)

16 USER SUPPORT 16

17 User Support Ready and welcome more users to do MC production IHEP BESIII private users are well supported and more feedback is needed Users outside IHEP are also supported – BESDIRAC client and Ganga are deployed over CVMFS – Able to submit jobs outside IHEP with CVMFS client installed – http://docbes3.ihep.ac.cn/~offlinesoftware/index.php/Using_BE SDIRAC_Client_in_CVMFS http://docbes3.ihep.ac.cn/~offlinesoftware/index.php/Using_BE SDIRAC_Client_in_CVMFS New requirement from users – Production with customized generator are needed 17

18 User Job Status 18 3469 Jobs Total 80 GB Data Transferred 98.7% Success Results Transferred back to IHEP from different sites

19 CLOUD COMPUTING 19

20 Cloud Computing Cloud computing can provide extremely flexible and extensible resources based on modern virtualization technology Advantages – Easier for sites to manage and maintain resources than local resources – Easy for users to customize the OS and computing environment – Able to expand and shrink resources according to user requirement in real time Cloud solutions – Private cloud OpenStack, OpenNebula, CloudStack … – Commercial or public cloud Amazon Web Services, Google Compute Engine, Microsoft Azure … 20

21 Integrate Cloud to BESDIRAC How to integrate – Job scheduling scheme remains unchanged – Instead of site Director for cluster and grid, VM scheduler is introduced to support cloud Workflow – Start new virtual machine with one CPU core when there are waiting jobs – One job scheduled on one virtual machine at the same time – Delete the virtual machine after no more jobs for a certain period of time 21 Tasks DIRAC Site Director VM Scheduler Pilot Cluster Pilot VM Grid Cloud User

22 Cloud to End Users Transparent to the end users in job submission – Users can choose cloud sites same as other sites Standard image for BOSS production has been already provided – Scientific Linux 6.5 – BOSS software (distributed by CVMFS) – Register image on cloud site Users can customize image for their own special purpose – Select OS, software, libraries… 22

23 Cloud Test Bed OpenStack and OpenNebula test beds are built in IHEP Cloud test bed from other institutes and universities – INFN, Torino – JINR, Dubna – CERN 23 SiteCloud ManagerCPU CoresMemory CLOUD.IHEP-OPENSTACK.cnOpenStack2448 GB CLOUD.IHEP-OPENNEBULA.cnOpenNebula2448 GB CLOUD.CERN.chOpenStack2040 GB CLOUD.TORINO.itOpenNebula6058.5 GB CLOUD.JINR.ruOpenNebula510 GB

24 Testing on Cloud Resource Test with 913 simulation and reconstruction jobs (psi(4260) hadron decay, 5000 events each) 24 Success rate of jobs is 100% Larger scale of tests are needed

25 Cloud Performance Average time of different processes for cloud sites 25 The performance for cloud sites is comparable with other sites Cloud Sites Job Running Time (s)

26 Preliminary Physics Validation Comparison between physical machine and cloud virtual machine (100,000 events, 59 jobs) Use the same input (decay card, random seed) The results are highly consistent Full validation should be done by physics group 26 P_pi V_x0 V_y0 V_z0

27 Future Plan Support more new sites Upgrade the DIRAC server SE capacity extension Support private production with user customized generator Larger scale and further tests of cloud More work to make cloud resources easier to manage 27

28 Summary Distributed computing system is in good status Private user production is ready Performance tests show that system and job efficiency are reasonable BOSS jobs successfully tested on cloud, which shows that cloud resources can be a promising choice for BESIII experiment More work to make cloud resources into production 28

29 29 Thanks for your attention! Thanks to resource contributors! Thanks to all site administrators for the help and participation!


Download ppt "Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group."

Similar presentations


Ads by Google