Presentation is loading. Please wait.

Presentation is loading. Please wait.

Status of BESIII Computing

Similar presentations


Presentation on theme: "Status of BESIII Computing"— Presentation transcript:

1 Status of BESIII Computing
Yan Xiaofei IHEP-CC

2 Outline 1 Local Cluster and Distributing Computing 2 Storage Status 3
Network Status 4 Others

3 HTCondor Cluster Upgrade new HTCondor version Problems (after upgrade)
“hung” problem, leading scheduler process to be restart, disappeared. Upgrade new HTCondor version From to Bugs fixed and better performance provided Completed in summer maintenance Problems (after upgrade) “/ihepbatch/bes” with heavy pressure, leading jobs “hold” Some users run a lot of analysis jobs in the directory Cause low performance Bugs about maximum jobs in the queue found in the new version Cause the scheduler unstable Limit 10,000 jobs per user

4 HTCondor Cluster New feature ---- submit bulk jobs
Submit bulk jobs quickly Less submitting time Decrease submitting failure rate Commands for BES users “boss.condor” for batch jobs Job option files should exist $ boss.condor -n 300 jobopt%{ProcId}.txt

5 Jobs statistics (Jun,2017 - Sep,2017)
BES jobs occupied most of CPU time Experiment Job count Walltime(hr) User count BES 6,334,436 14,164,509.61 305 JUNO 2,754,221 1,729,473.14 51 DYW 861,264 1,136,054.19 39 HXMT 614,257 793,408.08 11 LHAASO 624,279 699,813.88 41 CEPC 1,208,345 592,382.61 24 ATLAS 406,818 202,126.59 23 CMS 1,096,388 142,736.3 15 YBJ 323,144 140,068.56 36 OTHER 311,202 56,856.52 8

6 SLURM Cluster Resources Jobs(2017.6~2017.8) Research 1 master node
1 accounting & monitoring node 16 login nodes 131 work nodes: 2,752 CPU cores, 8 GPU cards Jobs(2017.6~2017.8) # Jobs : 1,834 CPU hours : 2,001,800 Research hep_job tools : unified User Interface with the HTCondor Cluster Malva : a resource monitoring system Jasmine : a multi-purpose job test suite

7 BESIII distributed computing(1)

8 BESIII distributed computing(2)
In the last three months, about 67.5K BESIII jobs have been done in the platform 11 sites joined the production The central DIRAC server has been successfully upgraded from v6r15 to v6r18 during summer maintenance Accordingly the distributed computing client in AFS has been upgraded to the latest version All the changes are transparent to end users

9 BESIII distributed computing(3)
The distributed computing client is expected to migrate from AFS to CVMFS soon More convenient for users outside IHEP, better performance than AFS The BOSS software repository used for sites has been successfully migrated to newly built CVMFS infrastructure which is with latest version and new hardware The CNAF CVMFS Stratum1 server in Italy has replicated IHEP Stratum0 server This help facilitate access to BOSS for users from Europe collaborations OSG S1 CNAF S1 IHEP S1 IHEP S0 CERN S0 Proxy Europe Asia America

10 Outline 1 Local Cluster and Distributing Computing 2 Storage Status 3
Network Status 4 Others

11 user/group data, raw data
Storage Usage Guide Age(year) 1 2 5 5,0 Usage public data user/group data, raw data user data temporary user data BESIII dedicated Yes No Quota 50GB 5GB,50k files 500GB Backup

12 Disk Usage During Last 3 months
BESIII computing is the major source of disk throughput in IHEP Cluster, >90% R/W throughput, peak read 15GB/s, peak write 6 GB/s /besfs increased 200 TB disk usage after summer maintenance Summer Maintenance /besfs write

13 Storage Evaluation The density of disk increase quickly in last few years 8 TB/10TB disk is the mainstream product Saves budget for better storage hardware The Storage capacity managed by single server increase Evolutions of new hardware and storage architecture SSD disk array (better IOPS ) Storage Area Network (Flexibility and Sharing of high-end SSD disk array) Bonding of 10 Gbit Ethernet (Upgrade the Network of each server from 10Gb to 20Gb or more) Multipath of disk Fiber channel (Upgrade the theoretical channel throughput from 16Gb to 32Gb) 1. Switch Single path to multiple paths aims: a. more stable without single failure point , b. double the bandwidth from 10 Gb to 20Gb which increase the I/O performance

14 Outline 1 Local Cluster and Distributing Computing 2 Storage Status 3
Network Status 4 Others

15 Network-WAN IHEP-USA(Internet2) IHEP-EUR(GEANT) IHEP-Asia
IHEP-CSTNet-CERNET-USA Bandwidth: 10Gbps IHEP-EUR(GEANT) IHEP-CSTNet-CERNET-LONDON-Euro IHEP-Asia IHEP-CSTNet-HONGKONG-Asia Bandwidth: 2.5Gbps IHEP-*.edu.cn IHEP-CSTNet-CERNET-University Bandwidth:10Gbps 国际国内链路; 1、(加入lhc1 qfz),进展情况; 2、中欧链路问题;

16 New Network Architecture
Separate wireless network from cable network Add an interconnect switch Collect separated networks Dual Firewall supported Both in data center network and campus network Add two DMZs in campus network OA services Public cloud services IPv6 supported Both in campus network and data center network The Grid data transfer is currently 8770 更换问题;

17 Outline 1 Local Cluster and Distributing Computing 2 Storage Status 3
Network Status 4 Others

18 New Monitoring at IHEP Motivation Goal
Various cluster monitoring tools are independent Integration of the multiple monitoring data can provide more information Improve the availability of computing cluster Goal Correlation among the monitoring sources Analyze various sources monitoring data Unified display system provides health status from multiple levels Show the trend of error and abnormity. 改字体

19 Remote Computing Sites Service
Maintenance services for remote computing center Ustc, Buaa, and Clas sites are running well USTC site increased 448 cores and 1.1P lustre storage Site Name CPU Cores Storage(TB) Ustc 1088 1843 buaa 160 81 clas 224 37 We setup central morntorning system for remote site,

20 Public services SSO IHEP APP IHEPBox Vidyo
Total users:7036 Bes3:642 (IHEP:227, others:415) IHEP APP Support phone book, can call directly News、Academic events、Media… Support Personal agenda later IHEPBox Updated to the latest version Total users: 1000+ Space usage: 3T/154T Vidyo Users:692 Conferences:1059

21 Others Password self reset: Helpdesk
Helpdesk Tel:

22 Summary Computing platform keeps running well.
New feature added for users. New storage architecture is under evaluation. Unified monitoring system developed. New Network architecture is ready.

23 Thank you! Question?


Download ppt "Status of BESIII Computing"

Similar presentations


Ads by Google