Presentation is loading. Please wait.

Presentation is loading. Please wait.

Exploit the massive Volunteer Computing resource for HEP computation

Similar presentations


Presentation on theme: "Exploit the massive Volunteer Computing resource for HEP computation"— Presentation transcript:

1 Exploit the massive Volunteer Computing resource for HEP computation
Wenjing Wu Computer Center, IHEP, China ISGC 2018

2 Outline A brief timeline of the @home projects in HEP
Different implementations Recent development Common challenges Summary

3 Volunteer Computing CAS@home SETI@home LHC@home ATLAS@home
BOINC is a commonly used middleware harness the idle CPU resources on personal computers(desktops, laptops, smart phones) computing tasks run at a lower priority (only when CPUs are idle) suitable for CPU intensive, low latency computing

4 A Brief Time line of VC in HEP
2004 CERN Sixtrack: Simul. accelerator Obstacles: Software size Heterogeneous OS Workflow integration Security Dev. New tech. Virtualbox/BOINC VM CVMFS Other BOINC features 2017 CERN Consolidated all LHC projects 2016 CERN Event Simul. 2014 IHEP/ CERN Event Simul.

5 Other projects under construction
BelleII, BESIII developments going on since 2014 Status: prototype in Beta Test The motive: Most of the LHC projects need much more power than the grid sites can deliver.. Volunteer Computing is a resource of great potential to be exploited

6 Common solutions Virtualization on non-Linux OS
VM images are built and dispatched to volunteer computers CVMFS for software distribution inside VM Develop a gateway service to bridge the workflows

7 Different implementations
One of the challenges is to integrate BOINC into the workflow which was designed based on Grid Computing WMS(Workload Management System) Pilot implementation GSI authentication required by Grid services vs. the untrusted feature of volunteer computers Different projects have to adapt different solutions to address the above issues

8 ATLAS@home ACE CE acts as the gateway:
Fetch jobs from PanDA, forward jobs to BOINC Cache input/output data, download/upload to Grid SE Authentication/communication with Grid service are done on the gateway, spare the work nodes from storing the authenticators

9 Challenge: different workflows:CRAB3 uses “push”, but BOINC uses “pull” for jobs Developed DataBridge as the gateway, it acts as Plugin to CRAB3, receives job description from CRAB3, store them in message queue Ceph buckets are used for input/output data, so BOINC can access them Stage in/out data between Grid SE and Ceph buckets Grid credential is stored in DataBridge to interact with Grid services.

10

11 Challenges: the merge of two different authentication, the GSI authentication is closely coupled with payload running in the DIRAC pilot. Developed WMSSecureGW as the gateway which acts as Intermediate DRIAC services to volunteer computers, so by using a fake proxy, the volunteer computer can request jobs and stage data from the gateway DIRAC client and pilot to the production DIRAC services to fetch jobs and stage data

12 Daily CPU time of successful jobs in LHC@home
The current scale Daily CPU time of successful jobs in ATLAS SixTrack Includes Sixtrack, ATLAS, CMS, LHCb and some other HEP applications Daily CPU usage reaches 25K CPU days, average core power is 10HS06, equivalent to a cluster with 400KHS06 (assume the avg. CPU Eff is 60%) Sixtrack gains most of the CPU because it does not require virtualization, easy for volunteers to start with

13 ATLAS@home CPU time of good jobs of All ATLAS sites in a week
The whole CPU time per day is between 300K~400K CPU days, remains at the TOP 10 sites

14 ATLAS@home CPU time of good jobs of All ATLAS sites in a week
Avg. BOINC core power: 11HS06 BOINC: 2.26% BOINC Avg CPU days per day by good jobs

15 Recent development Lightweight/native
Use container instead of virtual machines Use BOINC to backfill busy cluster nodes. The average CPU Utilization rate for clusters is between 50%~70% ATLAS grid site uses to exploit extra CPU from the fully loaded cluster

16 Lightweight model for ATLAS@home
ARC CE ACT PanDA Volunteer Host (Windows/Mac) VirtualBox VM BOINC Server ATLAS app Cloud Nodes /PC/Linux servers Volunteer Host ( Linux) Singularity Grid sites Run directly Volunteer Host ( SLC/Centos 6)

17 100% wall != 100% CPU utilization due to job CPU Eff.
Job 2: 12 Wall hour Job 1: 12 Wall hour 8 CPU hour Job 2: 12 Wall hour Job 1: 12 Wall hour 8 CPU hour Job 2: Wall hour Job 3: 12Wall hour 4 CPU hour One work node With job 1-2, 100% wall utilization, assume job CPU Eff. 75%,then 25% CPU is wasted With job 1-4, 200% wall utilization, 100% CPU utilization, job eff 75% and 25%

18 Put more jobs on work nodes
Run 2 jobs on each core 1 grid job with normal priority (pri=20), 1 BOINC job with the lowest priority (pri=39) Linux uses “non preemptive” scheduling for CPU cycles, which means high priority jobs occupies CPU until it releases the CPU voluntarily. BOINC only gets CPU cycles when grid jobs do not use it. ATLAS Grid job job

19 Experience from ATLAS@home: at BEIJING Tier 2 site
Grid jobs Walltime Util. is 87.8%, Grid CPU Util. is 65.6%, BOINC exploit an extra 23% of CPU time, node CPU Util. reaches 89% More details from here: BEIJING_BOINC

20 Experience from ATLAS@home: at BEIJING Tier 2 site
Looking at one node The overall CPU Util. is 98.44% on this node in 24 hours

21 Common Challenges in the future
Discontinued development for the BOINC software More flexible scheduling for better utilization of diverse resources(availability, power) Scalability issues: IO bottleneck.. has already hit the bottleneck Outreach: how to attract more volunteer computers

22 Summary The appliance of volunteer computing in HEP started over a decade ago, but only a few years for the big experiments thanks to the development of key technologies Each experiment needs its own implementation in order to integrate the VC resource into the existing workflow Volunteer computing has been providing a very considerable amount of CPU to HEP computing projects can be used in a broad range: Manage Tier 3 sites/institutes’ internal computing devices Backfilling clusters

23 Thanks!


Download ppt "Exploit the massive Volunteer Computing resource for HEP computation"

Similar presentations


Ads by Google