Status of BESIII Distributed Computing BESIII Collaboration Meeting, Nov 2014 Xiaomei Zhang On Behalf of the BESIII Distributed Computing Group.

Slides:



Advertisements
Similar presentations
ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
Advertisements

Status of BESIII Distributed Computing BESIII Workshop, Mar 2015 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep ,
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
CHEP – Mumbai, February 2006 The LCG Service Challenges Focus on SC3 Re-run; Outlook for 2006 Jamie Shiers, LCG Service Manager.
YAN, Tian On behalf of distributed computing group Institute of High Energy Physics (IHEP), CAS, China CHEP-2015, Apr th, OIST, Okinawa.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Status of StoRM+Lustre and Multi-VO Support YAN Tian Distributed Computing Group Meeting Oct. 14, 2014.
BESIII Production with Distributed Computing Xiaomei Zhang, Tian Yan, Xianghu Zhao Institute of High Energy Physics, Chinese Academy of Sciences, Beijing.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Distributed Computing for CEPC YAN Tian On Behalf of Distributed Computing Group, CC, IHEP for 4 th CEPC Collaboration Meeting, Sep , 2014 Draft.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
PHENIX and the data grid >400 collaborators 3 continents + Israel +Brazil 100’s of TB of data per year Complex data with multiple disparate physics goals.
AliEn AliEn at OSC The ALICE distributed computing environment by Bjørn S. Nilsen The Ohio State University.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
The GridPP DIRAC project DIRAC for non-LHC communities.
Large scale data flow in local and GRID environment Viktor Kolosov (ITEP Moscow) Ivan Korolko (ITEP Moscow)
Status of BESIII Distributed Computing BESIII Workshop, Sep 2014 Xianghu Zhao On Behalf of the BESIII Distributed Computing Group.
OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.
LHCbDirac and Core Software. LHCbDirac and Core SW Core Software workshop, PhC2 Running Gaudi Applications on the Grid m Application deployment o CVMFS.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
David Adams ATLAS ATLAS Distributed Analysis (ADA) David Adams BNL December 5, 2003 ATLAS software workshop CERN.
11/01/20081 Data simulator status CCRC’08 Preparatory Meeting Radu Stoica, CERN* 11 th January 2007 * On leave from IFIN-HH.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
The GridPP DIRAC project DIRAC for non-LHC communities.
Software framework and batch computing Jochen Markert.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
CDF SAM Deployment Status Doug Benjamin Duke University (for the CDF Data Handling Group)
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Progress of Work on SE and DMS YAN Tian April. 16, 2014.
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
LHCb Computing 2015 Q3 Report Stefan Roiser LHCC Referees Meeting 1 December 2015.
Status of BESIII Distributed Computing
BESIII data processing
Real Time Fake Analysis at PIC
The Beijing Tier 2: status and plans
Xiaomei Zhang CMS IHEP Group Meeting December
AWS Integration in Distributed Computing
SuperB and its computing requirements
Report of Dubna discussion
Work report Xianghu Zhao Nov 11, 2014.
The Status of Beijing site, and CMS local DBS
Submit BOSS Jobs on Distributed Computing System
Computing at CEPC Xiaomei Zhang Xianghu Zhao
Discussions on group meeting
Xiaomei Zhang On behalf of CEPC software & computing group Nov 6, 2017
The CMS Beijing Site: Status and Application
Presentation transcript:

Status of BESIII Distributed Computing BESIII Collaboration Meeting, Nov 2014 Xiaomei Zhang On Behalf of the BESIII Distributed Computing Group

Outline Major upgrade for central server Private production supports Central storage solutions Cloud computing Summary 2

Resources and Sites 3 #Site NameTypeCPU CoresSE TypeSE CapacityStatus 1CLOUD.IHEP.cnCloud144dCache214 TBActive 2CLUSTER.UCAS.cnCluster152Active 3CLUSTER.USTC.cnCluster200 ~ 1280dCache24 TBActive 4CLUSTER.PKU.cnCluster100Active 5CLUSTER.WHU.cnCluster100 ~ 300StoRM39 TBActive 6CLUSTER.UMN.usCluster768BeStMan50 TBActive 7CLUSTER.SJTU.cnCluster100 ~ 360Active 8GRID.JINR.ruGrid100 ~ 200dCache30 TBActive 9GRID.INFN-Torino.itGrid200Active 10Cluster.SDU.cnClusterOn the way 11Cluster.BUAA.cnClusterOn the way Total1864 ~ TB CPU resources are about 2000 cores, storage about 350TB Sites’ name have been adjusted and classified according to their Type SJTU is newly added site, JINR increase its storage to 30TB Local IHEP PBS site has moved to IHEP cloud site, preparing for cloud application into production

MAJOR UPGRADE FOR CENTRAL SERVER 4

Major upgrade 5 BESDIRAC Cloud (v1r0 -> v1r2) DIRAC v6r10pre17->v6r10p25 gLite->EMI3 (grid middleware) SL5->SL6 (OS) Separate the database from main server Reduce load of main server Easier to maintain Easier to next upgrade Large scale of testing has been done and these tests have shown good status of new server Jobs, data query, data transfer, etc

Jobs 6 ~ 1500 jobs running 98.7% success

Data File Catalog and Transfers 7 Data File Catalog is working fine – More than 600,000 files are registered in the file catalog – Dataset and replicas query have been tested – A single metadata query takes 0.5s Data transfer system is OK – Two batch DST data transferred from IHEP SE to WHU SE: 8 TB xyz data used – Transfer speed can reach 90+ MB/s – One time success rate is beyond 99%

PRIVATE PRODUCTION SUPPORT 8

User Job Status 9 More than 12,000 user jobs have been successfully done – 3500 is Simulation+Reconstruction+Analysis jobs with one time 98% success rate – 4000 is Simulation+Reconstruction jobs with customized DIY generator About 7,500 jobs / 63M events ( – ) Job status of all private user jobs

Simulation with customized generator Normal simulation process is done with officially published BOSS For simulation with customized generator, users do simulation with his own compiled generator – Customized generator need to be shipped with his jobs – Customized generator need to be used instead of official one How to use – Just specify your own generator lib in GangaBOSS configuration j.inputsandbox.append(‘ /InstallArea/x86_64-slc5-gcc43- opt/lib/libBesEvtGenLib.so') Feedback and Answer – Can my own lib be added automatically to the configuration? Yes if needed – Too slow to submit a large scale of jobs with big user library (~10MB) Currently each job need to upload this big library to the central server Soon will improve by using just one replica in SE to be shared among the related user jobs 10

Simulation+Reconstruction+Analysis(1) 11 Simulation Reconstruction DST Files Download DST Files Analysis ROOT Files Distributed SystemLocal Farm Simulation Reconstruction Analysis Download ROOT Files Distributed SystemLocal Farm ROOT Files Simu+Reco Simu+Reco+Ana These three processes can be done sequentially in one job – No input needed – Return Ntuple ROOT files This job type is highly recommended – For users, just one step to complete all three processes to get final results

Simulation+Reconstruction+Analysis(2) 12 – For distributed system, greatly reduce data movements No intermediate data movements needed ROOT Ntuple file is normally much smaller than DST or Raw files It is a good case of using distributed computing system sim+ rec sim+ rec+ ana

Simulation+Reconstruction+Analysis(3) How to use – Specify sim, rec and ana joboption files in GangaBOSS configuration to start 13 Feedback – The intermediate output can be also required to return in some case? It can be done. We will provide a way for you to decide which files will be returned

Physics Validation 14 Physics validation done by physics user – psi(4160) 9 decay modes, 200,000 events each mode – Same splitting and random seeds Results show that the reconstruction DST data from distributed computing and local farm are exactly identical One of the 9 modes Graph from SUN Xinghua Distributed computing Local farm

Support Four job type are supported now – Simulation (return rtraw files) – Simulation + Reconstruction (return dst files) – Simulation + Reconstruction + Analysis (return ntuple root files) – Simulation with customized generator (eg. DIY generator) Currently supported BOSS version – 6.6.2, 6.6.3, p01, 6.6.4, p01, p02 Detailed user guide has been provide in twiki – How to submit a BOSS job to distributed computing: torial – How to submit different type of BOSS job: S_Job_Guide S_Job_Guide Welcome to use and your feedbacks are valuable to us! 15

CENTRAL STORAGE SOLUTION 16

Central Storage Central storage plays a great role in BESIII computing model – Share DST and Random Trigger Data with sites – Accept and save MC output from remote sites Central storage – Lustre (local file system) holds DST data and Random Trigger data – SE ( grid Storage Element, dCache) exposes grid interface to access data from outside IHEP 17 Current situation of data flow – Lustre and SE are completely separated – Manual copy data between Lustre and SE are needed Improvement – Automate and speed up or eliminate any data movements between Lustre and SE by closely uniting them

dCache + Lustre Solution(Xiaofei Yan) 18 1.Administrator copy data from Lustre to dCache. 2.User can access data from dcache. Separated model Combined model Testbed based on current infrastructure: dCache version one cache pool with 88TB data array added

dCache+Lustre Read Test 19 Transfer 1TB data from the dCache+Lustre SE to the WHU SE 1.Register Lustre metadata into dCache DB (~ 7 minutes without checksum) Average: 83.5 MB/s Peak: 93.1 MB/s One-time success rate: 97.0% One-time success rate: 100%

StoRM + Lustre solution 20 Architecture of StoRM StoRM is a Storage Resource Manager, which can provide grid interface to POSIX file system eg. Lustre, GPFS Lightweight architecture StoRM has been successfully used as a grid storage solution for remote sites in BESIII distributed system WHU SE base on StoRM is working well Widely used in LCG The testbed has been set up Single StoRM Server with Lustre “/cefs” attached StoRM version

21 StoRM + Lustre Read Test Transfer 2TB data from the StoRM+Lustre SE to WHU SE: One-time success rate: 100% Average: 80.9 MB/s peak 91.9 MB/s

22 StoRM + Lustre Write Test Use about 7000 BOSS jobs, 1300 jobs in peak time – 900GB output of jobs are written back to the StoRM+Lustre SE (IHEP-STORM) – At the same time, these output can be directly seen and read from Lustre in IHEP local farm With this solution, users don’t need to download data from grid to local farm for further analysis Grid data = Local data

Comparison 23 The StoRM solution is easier to install and maintain, no extra development is required The StoRM solution could be more efficient without registering lustre metadata in advance and without data movement StoRM is a promising solution and we will do more tests before making final decision dCache + LustreStoRM + Lustre HardwareNeed extra disk array as cache poolJust mount Lustre SoftwareNeed developing extra scripts to support metadata synchronization SE Transfer speed83.5 MB/s80.9 MB/s Read/WriteCan read, but need to do metadata register, write under development Can support read/write Data movementCache pool Lustreno SecurityGrid authentication

CLOUD COMPUTING 24

Cloud integration Distributed computing has integrated cloud resources based on pilot schema, implementing dynamic scheduling Cloud resources used can be shrunk and extended dynamically according to job requirements 25 VM1, VM2, … Cloud Distributed Computing Job1, Job2, Job3… Cloud Distributed Computing No Jobs User Job Submission Create Get Job VM Cloud Distributed Computing No Jobs Job Finished Delete

Cloud sites 5 cloud sites from Torino, JINR, CERN and IHEP have been set up and connected to distributed computing system – About 320 CPU cores, 400GB Memory, 10TB disk 26

Cloud tests More than 4500 jobs have been done with 96% success rate Failure reason is lack of disk space – Disk space will be extended in IHEP cloud Expect to do large scale tests with the support of the Torino cloud site 27

Performance and Physics validation Performance tests has shown that running time in the cloud sites are comparable with other production sites – Simulation, Reconstruction, Download random trigger data Physics validation has proved that physics results are highly consistent between clusters and cloud sites 28

User support Cloud usage is transparent to BESIII physics users through distributed computing system Users can specify cloud sites same as other sites through GangaBOSS if needed Cloud sites will be opening to users after collaboration meeting – Default env: Scientific Linux 6.5, BOSS software – ud_Guide ud_Guide With flexible feature of cloud, it is interesting for the users with special requirements to try – Different OS, software env …. from clusters – Let us know your requirements 29

Future Plan Further strengthen user support – User tutorial will be provided regularly if needed – More improvements will be done according to user feedback Make cloud resources easier to be centrally managed – Improve cloud monitoring and configuration to ease the life of central admins More efforts will be done to make system more robust – Take care of big inputsandbox (user packages) – Push usage of mirror offline database, implementing real-time synchronization – Consider redundant central server to avoid one point failure 30

Summary Distributed computing system remains in good status after major update Private user production is strongly supported with two more job type added In central storage tests, StoRM+Lustre is found to be a promising solution Cloud application will be moved into production before end of year 31

32 Thanks for your attention! Thank you for your feedback! Thanks to resource contributors! Thanks to all site administrators for the help and participation!