Presentation is loading. Please wait.

Presentation is loading. Please wait.

Job Submission Via File Transfer

Similar presentations


Presentation on theme: "Job Submission Via File Transfer"— Presentation transcript:

1 Job Submission Via File Transfer

2 Run Everywhere Use all available resources Submit locally Run globally

3 Foreign Languages Using remote resources
Easy with HTCondor and friendly admins Flocking Harder otherwise Lost jobs Unknown queue times Different allocation expectations Different level of service

4 Paper Over the Differences
Don’t send user jobs directly Make everything act like HTCondor Glideins Run HTCondor startd as a job/image/container Send jobs to it Turn remote resources into temporary members of your own HTCondor pool See also Annex

5 Glidein Example Home Away slurm schedd startd shadow* starter* job
* One shadow, starter per running job

6 Private Networks Machines with no public network access
HPC Cloud “Split the schedd”

7 2-Schedd Glidein Example
Home Away slurm schedd schedd startd shadow starter job (Private network)

8 Really Private Networks
Centers with limited remote access Job submission Data transfer No general network communication Submit via a shared file service

9 Talking Via a File System
Goal: Support any file sharing mechanism NFS, Gluster, GPFS Box.com, Google Drive, Dropbox Gridftp, xrootd, rsync Blog site comments

10 Job Management Primitives
Submit Write job description and input sandbox Status (optional) Write status description Completion Write final status description and output sandbox Cleanup or Removal Delete job description and input sandbox

11 File-Based Submission Example
Home Away slurm schedd schedd startd shadow starter job (Private network)

12 File-Based Job Submission
Box Cloud Storage JobXXX Schedd A Schedd B request status.1 status.2 status.3 input input output input output output

13 And Along Came CMS Stretching the boundaries of Run Everywhere
250,000 cores 100 sites CMS researchers at PIC wanted to run on Mare Nostrum at BSC

14 BSC Site Setup Execute nodes Login nodes Shared filesystem
No public network access (in or out) Login nodes No output network connections Inbound network for ssh and file transfer only No long-lived or cpu-intensive programs Shared filesystem GPFS (IBM General Parallel File System)

15 Find a New Model Can’t run a schedd at BSC CMS likes late binding
Maybe run as part of the glidein CMS likes late binding Jobs stay at home schedd until a machine is ready to run them Let’s split the starter in two

16 Setting It Up Run a startd at PIC (close to BSC)
Advertises the resources of a set of BSC nodes Won’t match until BSC job starts Sshfs mount from PIC to BSC’s GPFS Submit starter job to BSC’s SLURM When a job arrives at PIC startd PIC starter writes job to GPFS BSC starter reads job from GPFS and runs it

17 File System Example CERN PIC BSC slurm schedd startd launcher shadow
starter starter job sshfs (Private network)

18 How Is It Different? No changes outside of starter
Other daemons unaware Some features don’t work Ssh-to-job Chirp Streaming output Periodic checkpointing

19 Progress Done TODO Run sleep jobs on 2 BSC nodes
Jobs started at CERN schedd TODO Use more BSC nodes Larger data transfers Fault tolerance Run CMS application

20 Questions?


Download ppt "Job Submission Via File Transfer"

Similar presentations


Ads by Google