Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.

Slides:



Advertisements
Similar presentations
NCBI BLAST, CDD, Mini-courses Katia Guimarães 2007/2.
Advertisements

SCHOOL OF COMPUTING ANDREW MAXWELL 9/11/2013 SEQUENCE ALIGNMENT AND COMPARISON BETWEEN BLAST AND BWA-MEM.
BIOINFORMATICS Ency Lee.
Linux Platform  Download the source tar ball from the BLAST source code link  ncbi-blast src.tar.gz  Compilation  cd /BLASTdirectory/c++ ./configure.
Run BLAST in command line mode Yanbin Yin Fall
Bioinformatics and Phylogenetic Analysis
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local.
What is Blast What/Why Standalone Blast Locating/Downloading Blast Using Blast You need: Your sequence to Blast and the database to search against.
GMOD in the Cloud Genome Informatics November 3, 2011 Scott Cain GMOD Project Coordinator Ontario Institute for Cancer Research
Statewide IT Conference, Bloomington IN (October 7 th, 2014) The National Center for Genome Analysis Support, IU and You! Carrie Ganote (Bioinformatics.
Configuration Management and Server Administration Mohan Bang Endeca Server.
Next Generation Cyberinfrastructures for Next Generation Sequencing and Genome Science AAMC 2013 Information Technology in Academic Medicine Conference.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
MCB 5472 Assignment #5: RBH Orthologs and PSI-BLAST February 19, 2014.
Genomics, Transcriptomics, and Proteomics: Engaging Biologists Richard LeDuc Manager, NCGAS eScience, Chicago 10/8/2012.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Bio-IT World Asia, June 7, 2012 High Performance Data Management and Computational Architectures for Genomics Research at National and International Scales.
The National Center for Genome Analysis Support as a Model Virtual Resource for Biologists Internet2 Network Infrastructure for the Life Sciences Focused.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BLAST benchmarks George Coulouris NCBI/NLM/NIH June 2005.
RNA-Seq 2013, Boston MA, 6/20/2013 Optimizing the National Cyberinfrastructure for Lower Bioinformatic Costs: Making the Most of Resources for Publicly.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
UMR ASP UMR ASP Structural & Comparative Genomics in Bread Wheat TriAnnotPipeline A LifeGrid Project based on AUVERGRID F. Giacomoni, M.
BLAST Basic Local Alignment Search Tool (Altschul et al. 1990)
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Assignment feedback Everyone is doing very well!
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Running BLAST on the cluster system over the Pacific Rim.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Having a Blast! on DiaGrid Carol Song Rosen Center for Advanced Computing December 9, 2011.
The National Center for Genomic Analysis Support: creating a national cyberinfrastructure environment for genomics researchers. William Barnett, Thomas.
Providing National Cyberinfrastructure to Biologists, esp. Genomicists. William K. Barnett, Ph.D. (Director) Thomas G. Doak (Manager & Domain Biologist)
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
Using Local Tools: BLAST
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
Open Science Grid as XSEDE Service Provider Open Science Grid as XSEDE Service Provider December 4, 2011 Chander Sehgal OSG User Support.
Copyright OpenHelix. No use or reproduction without express written consent1.
Annotation of eukaryotic genomes
What is BLAST? Basic BLAST search What is BLAST?
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
Bioinformatics Computation in the Cloud A Joint Collaboration Between Microsoft’s External Research and eXtreme Computing Groups
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
March 2014 Open Science Grid Operations A Decade of HTC Infrastructure Support Kyle Gross Operations Support Lead Indiana University / Research Technologies.
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
What is BLAST? Basic BLAST search What is BLAST?
Computing challenges in working with genomics-scale data
Introduction to Bioinformatics Resources for DNA Barcoding
Stand alone BLAST on Linux
Using Local Tools: BLAST
Basics of BLAST Basic BLAST Search - What is BLAST?
National Center for Genome Analysis Support
OSG Rob Gardner • University of Chicago
Mangaldai College, Mangaldai
Bioinformatics and BLAST
Comparative Genomics.
Basic Local Alignment Search Tool (BLAST)
Using Local Tools: BLAST
Using Local Tools: BLAST
Distributing META-pipe on ELIXIR compute resources
Basic Local Alignment Search Tool
Presentation transcript:

Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University / Research Technologies Open Science Grid

Galaxy-NCBI-BLAST Galaxy - an open, web-based platform for data intensive biomedical research. NCBI (National Center for Biotechnology Information) - provides access to biomedical and genomic information. BLAST (Basic Local Alignment Search Tool) Popular application for Bioinformaticists Compares biological sequences Identify unknown sequences Discover related organism Many flavors based on needed query and database format. (blastn, blastx, blastp, tblastn, tblastx)

>CHR Chromosome I Sequence CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACA CATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTT ACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCATTCAACCATACCACTCCGAAC CACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATAC TGTTCTTCTACCCACCATATTGAAACGCTAACAAATGATCGTAAATAACACACACGTGCT TACCCTACCACTTTATACCACCACCACATGCCATACTCACCCTCACTTGTATACTGATTT TACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATG CACGGCACTTGCCTCAGCGGTCTATACCCTGTGCCATTTACCCATAACGCCCATCATTAT CCACATTTTGATATCTATATCTCATTCGGCGGTCCCAAATATTGTATAACTGCCCTTAAT ACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC AATATTACAGAAAAATCCCCACAAAAATCACCTAAACATAAAAATATTCTACTTTTCAAC Input Query (Unknown Organism) $ blastn -db mydb -query input_query.fasta -out output.txt -outfmt 1 Blast DB comp10597_c0_seq1Uextra e comp10597_c0_seq1Uextra e comp12438_c0_seq12L e comp12438_c0_seq22L e >gi| |ref|NC_ | Saccharomyces cerevisiae mitochondrion TTCATAATTAATTTTTTATATATATATTATATTATAATATTAATTTATATTATAAAAATAATATTTATTATTAAAATATT TATTCTCCTTTCGGGGTTCCGGCTCCCGTGGCCGGGCCCCGGAATTATTAATTAATAATAAATTATTATTAATAATTATT TATTATTTTATCATTAAAATATATAAATAAAAAATATTAAAAAGATAAAAAAAATAATGTTTATTCTTTATATAAATTAT ATATATATATATAATTAATTAATTAATTAATTAATTAATAATAAAAATATAATTATAAATAATATAAATATTATTCTTTA TTAATAAATATATATTTATATATTATAAAAGTATCTTAATTAATAAAAATAAACATTTAATAATATGAATTATATATTAT TATTATTATTAATAAAATTATTAATAATAATCAATATGAAATTAATAAAAATCTTATAAAAAAGTAATGAATACTCCTTT … (150,000 lines) $ makeblastdb -in yeast.fasta -dbtype nucl - out yeast Database Source fasta

Common Blast Databases National Center for Biotechnology Information (NCBI) NCBI NT Collection of taxonomically diverse, non-redundant and richly annotated sequences. NCBI PATNT Patent database from USPTO or from EU/Japan Patent Agencies via EMBL/DDBJ High Throughput Genomic Sequence (HTGS) Flybase

Galaxy A popular Web-based platform for data intensive biomedical research NCGAS (National Center for Genome Analysis Support) hosts an instance of Galaxy portal ●IU Mason Cluster (8TB-memory) ●Access to IU DC2 (3.5PB) ●Genome assembly ●Large-scale phylogenetic software ●Blast

BLAST is CPU intensive (not memory) IU/Mason is not an optimal resource to run BLAST Growth in data volume will squeeze available resource capacity at NCGAS in coming years. OSG’s opportunistic resource could be used as an alternative for Mason and can provide necessary resource capacity. Why BLAST on OSG?

osg-blast (v2) Written in nodejs / node-osg & node-htcondor modules Can be installed on any OSG submit hosts via “npm install osg-blast” Hosted databases distributed via OASIS (CVMFS) Needs to be highly reliable and autonomous o Handle unexpected issues well o Needs to figure out the best configuration by itself. o Report site specific issues to OSG Operations (and recover) o Cleanup after itself (removing temp files, canceling jobs)

Test Stage Determine best input block size Detects issue with user input / OSG environment. Main Stage Submit all jobs using information gathered during the test stage. Use -dbsize to correct e-value osg-blast (v2) Splits both input queries / databases and run all jobs in parallel. Results are merged to create a single output sorted by e-value.

Conclusions We will need more computing resources to run BLAST in coming years, and OSG’s opportunistic environment can provide that need. Galaxy allows bioinformatics community to use existing UI to submit BLAST jobs. BLAST works well in HTC environment, and it seems to scale as expected using OSG’s opportunistic resources. Challenges / Future Goal osg-blast output merger needs to be implemented for other output formats. Might need to explore alternative to CVMFS for hosting BLAST DBs. Software passed to the University of Notre Dame for friendly testing.

Acknowledgements Bill Barnett, Tom Doak, Rich LeDuc IU) Ruth Pordes, Chander Seghal (Fermilab) Derek Weitzel (UNL) Mats Rynge (Information Science USC) Alain Deximo, Kyle Gross, Tom Lee, Vince Neal, Chris Pipes, Elizabeth Prout, Michel Tavares, and Scott Teige (OSG IU) Contacts Soichi | Rob Quick