Presentation is loading. Please wait.

Presentation is loading. Please wait.

Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.

Similar presentations


Presentation on theme: "Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group."— Presentation transcript:

1 Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group

2 Outline System and site status Private production status Central storage solutions Monitoring system VM performance study in cloud computing Cloud storage Summary 2

3 Resources and Sites 3 #Site NameTypeOSCPU CoresSE TypeSE CapacityStatus 1CLOUD.IHEP.cnCloudSL6264dCache214 TBActive 2CLUSTER.UCAS.cnClusterSL5152Active 3CLUSTER.USTC.cnClusterSL6200 ~ 1280dCache24 TBActive 4CLUSTER.PKU.cnClusterSL5100Active 5CLUSTER.WHU.cnClusterSL6100 ~ 300StoRM39 TBActive 6CLUSTER.UMN.usClusterSL5/SL6768BeStMan50 TBActive 7CLUSTER.SJTU.cnCluster100Active 8GRID.JINR.ruGridSL6100 ~ 200dCache30 TBActive 9GRID.INFN-Torino.itGridSL200StoRM30 TBActive 10CLUSTER.SDU.cnClusterTesting 11CLUSTER.BUAA.cnClusterTesting Total1864 ~ TB CPU resources are about 2000 cores, storage about 387TB Some CPU resources are shared with site local users

4 BOSS Software Deployment Currently the following BOSS versions are available for distributed computing – 6.6.2, 6.6.3, p01, 6.6.4, p01, p02, p03, – Version 6.6.2, 6.6.3, p01, are updated to accommodate the distributed computing – The result of verification could be found under directory /besfs/users/zhaoxh/verify_dist/boss The following random trigger files are deployed – round02, round03, round04, round05, round06, round07 4

5 BOSS Support BOSS is already supported by distributed computing system Site with sl6 5

6 Cloud Status Cloud computing is already opened to private users More storage is extended on the cloud computing nodes – Allow more virtual machines to run steadily Database backend of OpenNebula cloud is switched from sqlite to mysql – Improve the performance – Avoid the situation of no response 6

7 PRIVATE PRODUCTION STATUS 7

8 User Job Status More users are using distributed computing Totally more than 93,000 user jobs are done successfully since last collaboration meeting 8

9 User Jobs Data Transfer 7T data transferred to IHEP Reconstruction jobs need more data transferred than analysis jobs 9

10 Improvement for GangaBoss Speed up the job submission to distributed computing – Can support more jobs submitted one time on the lxslc login nodes – Submission speed is much faster Simplify the way for using custom boss packages Support SL6 and BOSS – Will soon provided in the next version 10

11 New Function in GangaBoss User can specify more than one output file types – If file type is not specified, the output file will be from the last step Output of.rec file is also supported in reconstruction jobs – Do not need to change anything in the script All output files can be downloaded with the “besdirac-dms-dataset-get” command 11

12 Support These job types are supported now – Simulation – Simulation + Reconstruction – Simulation + Reconstruction + Analysis User custom package are supported – Custom generator – User analysis packages – … Detailed user guide has been provided in wiki – How to submit a BOSS job to distributed computing: torial torial – How to submit different type of BOSS job: S_Job_Guide S_Job_Guide 12

13 Plan to Do Support analysis of existing DST jobs Full upload of user package – Reduce the difficulty to find out which files to upload particularly 13

14 Job Splitter to Choose There are two kind of splitters – Split by run – Split by event Split by run is recommended for users – More sites can be used (Currently only UMN support split by event job) – Job running time is shorter than by event jobs – Lower pressure of storage for sites (UMN has encountered performance problem when there are too many by event jobs) 14

15 BESDIRAC Task Manager 15

16 CENTRAL STORAGE SOLUTIONS 16

17 Data Transfer Using StoRM+Lustre On Dec. 10th, local users at UMN produced a DsHi dataset of size 3.3 TB and 36,328 files. It’s difficult to transfer such amount of data to IHEP by scp or rsync. This dataset was transferred from UMN to IHEP by our SE transferring system. On IHEP side, the destination SE is IHEP-STORM (StoRM+Lustre testbed) The data is accessable on Lustre right after it’s transferred. No upload/download needed The transfer speed is 35 MB/s, one-time success rate is > 99%. It shows the feasibility of transfering data from Lustre at one site to Lustre at another site 17

18 Job Read/Write Using StoRM+Lustre From Jan. 19 th to Mar. 4 th, 103k CEPC MC production jobs using StoRM+Lustre as central storage. Total 11 TB input data read from /cefs, and 41 TB output data writed to /cefs, with only 4% failure. From user’s point of view, jobs read input data from /cefs, write output data to /cefs. Data operation (upload and download) is not need % success rate 4.01% SE read/write error 41 TB output data write to Lustre 11 TB input data read from Lustre

19 MONITORING SYSTEM 19

20 Site Summary A site summary page is added to the monitoring system More detailed information will be added 20

21 Tests by Submitting Job Easier to add new test Also a history graph is available for each test 21

22 CLOUD STORAGE 22

23 Introduction Suitable for sites without SE Possibly supporting split by event job for sites which can not mount all the random trigger for each computing node 23

24 Test for Cloud Storage MucuraFS client is deployed on 5 cloud computing testbeds Random trigger files of round06 are prepared on the cloud storage 1000 reconstruction jobs split by event with run range [30616, 31279]. 10,000 events in each job Test results – High success rate – The CPU efficiency is much lower and the execution time is much longer for nodes outside IHEP % success IHEP Cloud

25 Future Plan Further strengthen user support – User tutorial will be provided regularly if needed – More improvements will be done according to user feedback – Supporting analysis jobs with existing DST file – Upload the full user workarea for simplicity and integrity Make cloud resources easier to be centrally managed Improve the monitoring system Develop an accounting system More efforts will be done to make system more robust – Push usage of mirror offline database, implementing real-time synchronization – Consider redundant central server to avoid one point failure 25

26 Summary Distributed computing system is in good status with the user jobs Private user production is well supported with several improvements In central storage tests, StoRM+Lustre is tested in good status and could be used for real jobs Monitoring system is upgraded and new page is developed Cloud storage is tested and could be an alternative choice for providing random trigger file access 26

27 27 Thanks for your attention! Thank you for your feedback! Welcome to use and send your feedbacks!


Download ppt "Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group."

Similar presentations


Ads by Google