Presentation is loading. Please wait.

Presentation is loading. Please wait.

HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017.

Similar presentations


Presentation on theme: "HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017."— Presentation transcript:

1 HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017

2 Large Synoptic Survey Telescope
8.4 meter ground-based telescope on Cerro Pachón in Chile 3.2 gigapixel camera Transferring 15 terabytes of data nightly, for 10 years Nightly alert processing at NCSA Yearly data release processing at NCSA and IN2P3 in France

3 LSST Software Stack Organized into Dozens of Custom and Third Party Packages Applications framework Data Access Sky Tessellation Managing Task Execution

4 How We’ve Used HTCondor So Far
Software stack scaling tests Alert Processing Simulation Orchestration/Execution on different sites using templates Statistics gathering for better insight on how things are running

5 Alert Processing Changing sky will produce ~10 million alerts nightly; astronomers will be able subscribe to alerts they’re interested in, which will be produced within 60 seconds of observation Alert Processing Simulation Proof of concept of data and job workflows 25 VMs – simulating 240 nodes Two HTCondor Clusters: task execution and custom transfer

6 Alert Processing Simulation

7 Orchestration Sets up, Invokes workflow execution, monitors status Captures software environment and records versions Plugins – (eg, DAGMan, Pegasus, simple ssh invocations) User-specified configuration files /home, Software stack locations, scratch directories, etc Configuration can be complicated for new users

8 Execution Abstract away the details for glide-ins and for execution Platform specific configuration Substitution of common elements on platform specific templates, to help eliminate user errors User specifies the minimum amount of information required Site name Time allocation for nodes Input data and execution script DAG generator (DAGman or Pegasus script)

9 Example from sites.xml Before:
<profile namespace="pegasus" key="auxillary.local">true</profile> <profile namespace="condor" key="getEnv">True</profile> <profile namespace="env" key="PEGASUS_HOME" >$PEGASUS_HOME</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "$NODE_SET")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"$NODE_SET"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/$USERNAME/eupsUserData</profile> After: <profile namespace="env" key="PEGASUS_HOME" >/software/middleware/pegasus/current</profile> <profile namespace="condor" key="requirements">(ALLOCATED_NODE_SET == "srp_478")</profile> <profile namespace="condor" key="+JOB_NODE_SET">"srp_478"</profile> <profile namespace="env" key="EUPS_USERDATA" >/scratch/srp/eupsUserData</profile>

10 Statistics The statistics package contains commands to ingest Dagman log event records into a database. The package’s utilities group all of these records according to Condor job id in order to get an overview of what happened during the job. 

11 Grouped records Submitted, executing on node , updated to image size , an exception in the Shadow daemon occurs, execution starts on , updated image size to , and then again to , disconnected from node, reconnection to node failed, execution starts on , and then terminates.

12 Execution report

13 More information Alert Production Simulator Description: Anim: Simulator: Statistics: Orchestration: Execution: Platform config:


Download ppt "HTCondor and LSST Stephen Pietrowicz Senior Research Programmer National Center for Supercomputing Applications HTCondor Week May 2-5, 2017."

Similar presentations


Ads by Google