Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

Cluster Computing at IQSS Alex Storer, Research Technology Consultant.

Advertisements

An End-User Perspective On Using NatQuery Extraction From two Files T

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Microsoft ® Office PowerPoint ® 2003 Training Package to a CD Your STS, Tom Redd, presents:

Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Final Project of Information Retrieval and Extraction by d 吳蕙如.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

DIRAC API DIRAC Project. Overview  DIRAC API  Why APIs are important?  Why advanced users prefer APIs?  How it is done?  What is local mode what.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Introduction to UNIX/Linux Exercises Dan Stanzione.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

Grid Computing I CONDOR.

Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.

BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.

CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.

LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Zach Miller Computer Sciences Department University of Wisconsin-Madison Bioinformatics Applications.

Linux & Shell Scripting Small Group Lecture 3 How to Learn to Code Workshop group/ Erin.

Quiz 15 minutes Open note, open book, open computer Finding the answer – working to get it – is what helps you learn I don’t care how you find the answer,

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Condor: BLAST Monday, July 19 th, 3:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

1 A Steering Portal for Condor/DAGMAN Naoya Maruyama on behalf of Akiko Iino Hidemoto Nakada, Satoshi Matsuoka Tokyo Institute of Technology.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Condor: BLAST Monday, 3:30pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Deconstructing Science Problems into Parallel Jobs W ednesday afternoon, 1:15p Zach Miller Systems Programmer Condor Project.

Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.

Five todos when moving an application to distributed HTC.

UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.

Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.

Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.

Christina Koch Research Computing Facilitators

Compute and Storage For the Farm at Jlab

Welcome to Indiana University Clusters

Intermediate HTCondor: More Workflows Monday pm

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison

Integration of Singularity With Makeflow

College of Engineering

Haiyan Meng and Douglas Thain

Initial job submission and monitoring efforts with JClarens

Thursday AM, Lecture 1 Lauren Michael

Presentation transcript:

Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison

Slides

Overview You now know how to run jobs using Condor, create basic workflows using DAGMan, and how to run a simple BLAST query. Let's put these pieces together to tackle larger problems. This session will focus on how to break down and process large problems.

Review of Blast Example First, run blast locally ssh osg-edu.cs.wisc.edu /nfs/osg-app/osgedu/blast/bin/blastp -db /nfs/osg-data/osgedu/blast_dbs/yeast.aa - query /nfs/osg-data/osgedu/blast_inputs/query1

Think about running your application remotely What are the dependencies?  Architecture  OS // Linux Distro  Shared libraries  Input files  Environment variables  Scratch space  Availble cpu …

OSG Summer School 2010 Running BLAST under Condor Did we get all dependencies? Are we sized correctly? How long will the job run? How much data will it produce? What kind of load are we putting on various system resources?

OSG Summer School 2010 Sizing jobs for the grid If the job is too small, there's too much overhead If the job is too big, there's potentially for “badput” Badput is when a job runs for a while but is preempted and the output thrown away Rule of thumb: between 1 and 8 hours

OSG Summer School 2010 Other Considerations Besides how long a job will run, consider:  Memory requirements  Disk requirements  Network I/O  Consumable Licenses Try to identify all types of resources needed.

OSG Summer School 2010 Hands-on Exercise Processing multiple sequences You get all queries in one file Blast will accept input files with multiple queries Try running BLAST with the input file /nfs/osg-data/osgedu/blast_inputs/five_inputs  How long does it take?

OSG Summer School 2010 Hands-on Exercise Now, let's try running a much larger input /nfs/osg-data/osgedu/blast_inputs/large_input Note: it contains 6298 input files and will maybe take 20 minutes or more! (Go ahead and submit it while I talk!)  blastp -db yeast.aa -query large_input | grep '^Query=' | nl Let's think about how we can do this more quickly...

OSG Summer School 2010 BLAST Input BLAST Input can be split at sequence boundaries Make a temporary directory in your home dir, and copy the file /nfs/osg-data/osgedu/blast_inputs/five_inputs into the temporary directory Edit five_inputs, and look at the structure With only 5, it could be split manually, but...

OSG Summer School 2010 BLAST Input But with 6298 sequences, doing this manually is out of the question. We need some sort of automation! Write a script to split apart the input file, or copy /nfs/osg-app/osgedu/helper_scripts/split_file.pl Use your script first on five_inputs to create five input files:  split_file.pl SMALL_TEST 1 < five_inputs

OSG Summer School 2010 Submitting to Condor But how to submit? You could create a submit file for each new input file... Or you can use some fancy features in Condor to use a template submit file. Copy /nfs/osg-data/osgedu/blast_template/blast_template.sub to a temporary directory and examine it Modify it to your needs, or just copy /nfs/osg- data/osgedu/blast_template/blast_template_1.sub

OSG Summer School 2010 Submitting to Condor Now run:  condor_submit blast_template_1.sub Watch it run with condor_q Examine the output (*.out) Count the results:  grep '^Query=' *.out | nl

OSG Summer School 2010 Submitting to Condor Now do the same with a very large input:  Create a temporary directory and cd to it  cp /nfs/osg-data/osgedu/blast_inputs/large_input.  /nfs/osg-app/osgedu/helper_scripts/split_file.pl BIG_TEST 315 < large_input  cp /nfs/osg- data/osgedu/blast_template/blast_template_2.sub.  condor_submit blast_template_2.sub Again, watch it run:  condor_q -dag  condor_q -run

OSG Summer School 2010 Success? Problems with this approach:  Not completely automated  Requires editing template files  How do you know when the workflow is done?  How do you know it was all successful?

OSG Summer School 2010 Managing your Large Run Using DAGMan can help here! DAGMan brings the ability to implement several features:  Final notification when all pieces are complete  Verification that all results are present  Filtering or massaging the output into a final form  Throttling job submission for busy pools

OSG Summer School 2010 Managing your Large Run Once again, this can be automated Write a script that also generates a.dag file and also the submit files. The DAG will list all jobs but no relationships between them. (see /nfs/osg- data/osgedu/blast_template/example.dag ) Optionally, copy this script which does it for you: /nfs/osg- app/osgedu/helper_scripts/gen_dag.pl

OSG Summer School 2010 Managing your Large Run Try splitting the input into 20 pieces 6298 sequences / 20 == 315 per file gen_dag.pl LARGE_RUN 315  Look at what that script produced condor_submit_dag LARGE_RUN.dag Watch it run... How long is each piece taking? How would you change these numbers for an input of one million queries?

OSG Summer School 2010 Additional Work Modify your script to create a new node in the DAG that is a child of all other nodes Make a submit file for that node that runs a script. What should that script do?  Send ?  Verify results?  Concatenate results?  Compress results? How many of those you can implement?

OSG Summer School 2010 Notes on DAGMan Different nodes in the DAG can be different types of job: Vanilla, Grid, or Local Make the final node LOCAL universe (Check Condor Manual for details). You want it to run locally to verify that all results received are intact. Feel free to use and edit my script: /nfs/osg- app/osgedu/helper_scripts/gen_dag_w_final.pl

OSG Summer School 2010 Additional Work Run the individual pieces on OSG using Condor-G, instead of in the Vanilla universe Write another script that does everything using the pieces we just wrote:  Splits the input  Generates the DAG  Submits the workflow  Waits for completion (hint: condor_wait)  Combines results

OSG Summer School 2010 Additional Work Try running five_inputs and large_input against other databases. You'll need to:  Run a single input as a test. How long did it take?  Estimate how many compute hours this will take...  Decide how to split up the input appropriately  Get access to enough available resources to do this in a reasonable amount of time

OSG Summer School 2010 Questions? Questions about splitting, submitting, automating, etc? Break Time Challenges: – run large_input against the nr database – Run jobs on the grid using glidein