CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.

Slides:



Advertisements
Similar presentations
Pulan Yu School of Informatics Indiana University Bloomington Web service based Varuna.Net.
Advertisements

Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Bosco: Enabling Researchers to Expand Their HTC Resources The Bosco Team: Dan Fraser, Jaime Frey, Brooklin Gore, Marco Mambelli, Alain Roy, Todd Tannenbaum,
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
RCAC Research Computing Presents: DiaGird Overview Tuesday, September 24, 2013.
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
Linux Platform  Download the source tar ball from the BLAST source code link  ncbi-blast src.tar.gz  Compilation  cd /BLASTdirectory/c++ ./configure.
Bioinformatics and Phylogenetic Analysis
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local.
What is Blast What/Why Standalone Blast Locating/Downloading Blast Using Blast You need: Your sequence to Blast and the database to search against.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Apache Airavata GSOC Knowledge and Expertise Computational Resources Scientific Instruments Algorithms and Models Archived Data and Metadata Advanced.
MyOSG: A user-centric information resource for OSG infrastructure data sources Arvind Gopu, Soichi Hayashi, Rob Quick Open Science Grid Operations Center.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
OSG Area Coordinators Campus Infrastructures Update Dan Fraser Miha Ahronovitz, Jaime Frey, Rob Gardner, Brooklin Gore, Marco Mambelli, Todd Tannenbaum,
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
LHCb Software Week November 2003 Gennady Kuznetsov Production Manager Tools (New Architecture)
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Condor: BLAST Rob Quick Open Science Grid Indiana University.
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
SMBL and Blast Joe Rinkovsky Unix Systems Support Group Indiana University.
Having a Blast! on DiaGrid Carol Song Rosen Center for Advanced Computing December 9, 2011.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Running persistent jobs in Condor Derek Weitzel & Brian Bockelman Holland Computing Center.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Providing National Cyberinfrastructure to Biologists, esp. Genomicists. William K. Barnett, Ph.D. (Director) Thomas G. Doak (Manager & Domain Biologist)
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
Open Science Grid as XSEDE Service Provider Open Science Grid as XSEDE Service Provider December 4, 2011 Chander Sehgal OSG User Support.
GADU: A System for High-throughput Analysis of Genomes using Heterogeneous Grid Resources. Mathematics and Computer Science Division Argonne National Laboratory.
OPTIMIZATION OF DIESEL INJECTION USING GRID COMPUTING Miguel Caballer Universidad Politécnica de Valencia.
Interoperability Achieved by GADU in using multiple Grids. OSG, Teragrid and ANL Jazz Presented by: Dinanath Sulakhe Mathematics and Computer Science Division.
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.
Improving the Research Bootstrap of Condor High Throughput Computing for Non-Cluster Experts Based on Knoppix Instant Computing Technology RIKEN Genomic.
Computational Sciences at Indiana University an Overview Rob Quick IU Research Technologies HTC Manager.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Bioinformatics Computation in the Cloud A Joint Collaboration Between Microsoft’s External Research and eXtreme Computing Groups
Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.
Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
Computing challenges in working with genomics-scale data
Introduction to Bioinformatics Resources for DNA Barcoding
Haiyan Meng and Douglas Thain
Bioinformatics and BLAST
Overview Bioinformatics: Analyzing biological data using statistics, math modeling, and computer science BLAST = Basic Local Alignment Search Tool Input.
LESSON 1 INTNRODUCTION HYE-JOO KWON, Ph.D /
Lesson 3 Bioinformatics Laboratory
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University – Manager High Throughput Computing Computational Sciences at Indiana University (CSIU) – VO Manager

2012 Africa Grid School Motivation What is BLAST? Submission to OSG Galaxy UI 2

2012 Africa Grid School National Center for Genome Analysis Support (NCGAS) “The mission of the National Center for Genome Analysis Support is to enable the biological research community of the US to analyze, understand, and make use of the vast amount of genomic information now available. NCGAS focuses particularly on transcriptome- and genome-level assembly, phylogenetics, metagenomics/transcriptomics and community genomics.” 3

2012 Africa Grid School Mason Cluster Mason at Indiana University  Large memory computer cluster (512G per node)  Configured to support data-intensive, high- performance computing tasks for researchers using genome assembly software  Suitable for assembly of data from next- generation sequencers  Large-scale phylogenetic software  Other genome analysis applications  Require large amounts of computer memory. 4

2012 Africa Grid School What is BLAST? Basic Local Alignment Search Tool  One of the most widely used bioinformatics programs  Algorithm for comparing biological sequence information  Compares a query sequence to a library of sequences  Allows comparison of an unknown sequence to known similar genes 5

2012 Africa Grid School BLAST Vitals Input – Query Sequence  1 to 70k+ sequences Output – Plain text, XML, or HTML query report Application – blastp, blastx, blastn (each 26M) Database – ~35G Uncompressed  13 Sub Sections each ~2.5GB  Updated ~monthly by NCBI 6

2012 Africa Grid School BLAST on OSG We’ve experimented with several options  Application  Sent with Job (non-trivial size)  Local Installation  OASIS (OSG wide HTTP FS)  Database  Validation and Installation Job  Splitting into smaller DB sub-sections  Reassembly of output 7

2012 Africa Grid School Test Case 38k queries - 3 Acanthamoeba RNA- Seq  Split into 10 query jobs and condor submission file created  Tested different submission techniques  Galaxy  BOSCO  OSG_XSEDE  Glidein  Galaxy  AMPQ  OSG_XSEDE  Glidein  Pegasus based workflow  Condor_g submission 8

2012 Africa Grid School Some Behavior Issues Execution Time  Jobs submitted to the same resource share the DB  Sometimes 3-4 hours to run 10 Queries Memory Growth  Memory usage grows over time (leak in blastp?)  Some sites kill at memory sizes over 2.5G Merging Outputs  Size of output 9

2012 Africa Grid School Converging on Solution Generate Segmented BLAST DB and publish on osg- xsede Construct workflow using Condor DAG BLAST app shipped with job BLAST db downloaded by each job (only the segment necessary) Execute with –dbsize to simulate full DB run Merged with –xml output as part of the DAG Galaxy will submit DAG workflow to local condor queue which forwards to osg-xsede 10

2012 Africa Grid School Architecture Flow 11

2012 Africa Grid School Galaxy UI at IU 12

2012 Africa Grid School Galaxy UI at IU 13

2012 Africa Grid School Galaxy Interaction BOSCO instance runs on the Galaxy UI server  DAG is submitted to local Condor Queue  Galaxy Node  osg-xsede  glidein factory  Wait for execution  Format and delivery of data Other work on Galaxy node uses local PBS Queue 14

2012 Africa Grid School Other Notes OSG Accounting Project = IU_GALAXY  46k cpu/hr testing Sept k queries run in ~6hrs Targeting this work for publication in a peer reviewed bioinformatics journal We will submit this work to Galaxy as a possible branch 15

2012 Africa Grid School Acknowlegements Soichi Hayashi Carrie Genote Le-Shin Wu Scott Teige Rich LeDuc Derek Weitzel Bill Barnett 16