Five todos when moving an application to distributed HTC.

Slides:

Advertisements

Similar presentations

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu

Open Science Grid: More compute power Alan De Smet

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.

Matlab, R and Other Jobs in CHTC. chtc.cs.wisc.edu No suitable R No Matlab runtime Missing shared libraries Missing compilers … Running On Bare Bones.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

INTRODUCTION TO WEB DATABASE PROGRAMMING

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:

A Comparison of Library Tracking Methods in High Performance Computing Computer System Cluster and Networking Summer Institute 2013 Poster Seminar William.

Grid Computing I CONDOR.

Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.

Experiences with a HTCondor pool: Prepare to be underwhelmed C. J. Lingwood, Lancaster University CCB (The Condor Connection Broker) – Dan Bradley

BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

INFSO-RI Module 01 ETICS Overview Etics Online Tutorial Marian ŻUREK Baltic Grid II Summer School Vilnius, 2-3 July 2009.

Sage ACT! 2013 SDK Update Brian P. Mowka March 23, 2012 Template date: October 2010.

How to configure, build and install Trilinos November 2, :30-9:30 a.m. Jim Willenbring Mike Phenow.

Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.

9/2/ CS171 -Math & Computer Science Department at Emory University.

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

The impacts of climate change on global hydrology and water resources Simon Gosling and Nigel Arnell, Walker Institute for Climate System Research, University.

Holding slide prior to starting show. Applications WG Jonathan Giddy

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

SQL SERVER 2008 Installation Guide A Step by Step Guide Prepared by Hassan Tariq.

INFSO-RI Enabling Grids for E-sciencE Using of GANGA interface for Athena applications A. Zalite / PNPI.

How to configure, build and install Trilinos November 2, :30-9:30 a.m. Jim Willenbring.

Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.

Gridmake for GlueX software Richard Jones University of Connecticut GlueX offline computing working group, June 1, 2011.

CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.

INFSO-RI Enabling Grids for E-sciencE Ganga 4 Technical Overview Jakub T. Moscicki, CERN.

Cliff Addison University of Liverpool NW-GRID Training Event 26 th January 2007 SCore MPI Taking full advantage of GigE.

Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Node.js Modules Header Mastering Node.js, Part 2 Eric W. Greene

Development Environment

Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am

Build Automation with Gradle

Checkpoint/restart in Slurm: current status and new developments

Progress on NA61/NA49 software virtualisation Dag Toppe Larsen Wrocław

Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison

GLAST Release Manager Automated code compilation via the Release Manager Navid Golpayegani, GSFC/SSAI Overview The Release Manager is a program responsible.

OSG Connect and Connect Client

Integration of Singularity With Makeflow

Haiyan Meng and Douglas Thain

HTCondor Security Basics HTCondor Week, Madison 2016

Condor Glidein: Condor Daemons On-The-Fly

Condor: Firewall Mirroring

GRID Workload Management System for CMS fall production

Credential Management in HTCondor

Job Submission Via File Transfer

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Five todos when moving an application to distributed HTC

chtc.cs.wisc.edu 1) Understand your job 2) Take it with you 3) Cache your data 4) Remote I/O 5) Be checkpointable Overview

chtc.cs.wisc.edu Understand your job › Is it ready for HTC?  Runs without interaction › Requirements are well-understood?  Input required  Execution Time  Output generated

chtc.cs.wisc.edu Understand your job › ALL requirements understood?  Licenses  Network bandwidth  Software dependencies › Based on the requirements, we’ll use one or more of the following strategies to get the job running smoothly

chtc.cs.wisc.edu Take it with you › Your job can run in more places, and therefore potentially access more resources, if it has fewer dependencies.  Don’t rely on obscure packages being installed  If you require a specific version of something (perl, python, etc.) consider making it part of your job

chtc.cs.wisc.edu Take it with you › Know what your set of input files is  Remote execution node may not share the same filesystems, and you’ll want to bring all the input with you. › You can maybe specify the entire list of files to transfer or a directory (HTCondor) › If the number of files is very large, but the size is small, consider creating a tarball containing the needed run-time environment

chtc.cs.wisc.edu Take it with you › Wrapper scripts can help here  Untar input or otherwise prepare it  Locate and verify dependencies  Set environment variables › We use a wrapper-script approach to running Matlab and R jobs on CHTC

chtc.cs.wisc.edu Take it with you › Licensing › Matlab requires a license to run the interpreter or the compiler, but not the results of the compilation › Part of the submission process then is compiling the Matlab job, which is done on a dedicated, licensed machine, using HTCondor and a custom tool: chtc_mcc –mfiles=my_code.m

chtc.cs.wisc.edu Take it with you › Another way to manage licenses is using HTCondor’s “concurrency limits”  The user places in the submit file: concurrency_limits = sw_foo  The admin places in the condor_config: SW_FOO_LIMIT = 10

chtc.cs.wisc.edu Cache your data › Let’s return for a moment to the compiled Matlab job › The job still requires the Matlab runtime libraries › As mentioned earlier, let’s not assume they will be present everywhere

chtc.cs.wisc.edu Cache your data › This runtime is the same for every Matlab job › Running hundreds of these simultaneously will cause the same runtime to be sent from the submit node to each execute node › CHTC solution: squid proxies

chtc.cs.wisc.edu Cache your data › The CHTC wrapper script fetches the Matlab runtime using http › Before doing so, it also sets the http_proxy environment variable › curl then automatically uses the local cache › Can also be done with HTCondor’s file transfer plugin mechanisms, which support third party transfers (including http)

chtc.cs.wisc.edu Cache your data › The same approach would be taken for any other application that has one or more chunks of data that are “static” across jobs  R runtime  BLAST databases

chtc.cs.wisc.edu Remote I/O › What if I don’t know what data my program will access? › Transferring everything possible may be too unwieldy and inefficient › Consider Remote I/O

chtc.cs.wisc.edu Remote I/O › Files could be fetched on demand, again using http or whatever mechanism › When running in HTCondor, the condor_chirp tool allows files to be fetched from and stored to during the job › Also consider an interposition agent, such as parrot which allows trapping of I/O.

chtc.cs.wisc.edu Remote I/O › In HTCondor, add this to the submit file: WantRemoteIO = True › It is off by default › Now the job can execute: condor_chirp fetch /home/zmiller/foo bar

chtc.cs.wisc.edu Remote I/O › Galaxy assumes a shared filesystem for both programs and data › Most HTCondor pools do not have this › Initially tried to explicitly transfer all necessary files  This requires additional work to support each application

chtc.cs.wisc.edu Remote I/O › New approach: Parrot  Intercepts job’s I/O calls and redirects them back to the submitting machine › New job wrapper for HTCondor/Parrot  Transfers parrot to execute machine and invokes job under parrot › Could also be extended to have parrot do caching of large input data files

chtc.cs.wisc.edu Checkpointing › Policy on many clusters prevents jobs from running longer than several hours, or maybe up to a handful of days, before the job is preempted › What if your job will not finish and no progress can be made? › Make your job checkpointable

chtc.cs.wisc.edu Checkpointing › HTCondor supports “standard universe” in which you recompile (relink, actually) your executable › Checkpoints are taken automatically when run in this mode, and when the job is rescheduled, even on a different machine, it will continue from where it left off

chtc.cs.wisc.edu Checkpointing › condor_compile is the tool used to create checkpointable jobs › There are some limitations  No fork()  No open sockets

chtc.cs.wisc.edu Checkpointing › Condor is also working on integration with DMTCP to do checkpointing › Another option is user-space checkpointing. If your job can catch a signal and write its status to a file, it may be able to resume from there

chtc.cs.wisc.edu Conclusion › Jobs have many different requirements and patterns of use › Using one or more of the ideas above should help you get an application running smoothly on a large scale › Question? Please come talk to me during a break, or › Thanks!