A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
BLAST Sequence alignment, E-value & Extreme value distribution.
Hidden Markov Models in Bioinformatics
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Hidden Markov Models.
Hidden Markov Models Modified from:
Hidden Markov Models in Bioinformatics
Profiles for Sequences
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
BIOINFORMATICS Ency Lee.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Sequence Similarity Searching Class 4 March 2010.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Major Application: Finding Homologies (C) Mark Gerstein, Yale University bioinfo.mbb.yale.edu/mbb452a.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
A Parallel Solution to Global Sequence Comparisons CSC 583 – Parallel Programming By: Nnamdi Ihuegbu 12/19/03.
1 Bio-Sequence Analysis with Cradle’s 3SoC™ Software Scalable System on Chip Xiandong Meng, Vipin Chaudhary Parallel and Distributed Computing Lab Wayne.
Comparative ab initio prediction of gene structures using pair HMMs
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Sequence alignment, E-value & Extreme value distribution
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Hidden Markov Models In BioInformatics
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
David Hoksza, Supervisor: Tomáš Skopal, KSI MFF UK Similarity Search in Protein Databases.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
CSCI 6900/4900 Special Topics in Computer Science Automata and Formal Grammars for Bioinformatics Bioinformatics problems sequence comparison pattern/structure.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
Exploiting Conserved Structure for Faster Annotation of Non-Coding RNAs without loss of Accuracy Zasha Weinberg, and Walter L. Ruzzo Presented by: Jeff.
Using BLAST for Genomic Sequence Annotation Jeremy Buhler For HHMI / BIO4342 Tutorial Workshop.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
How can we find genes? Search for them Look them up.
Computer Architecture Lecture 26 Past and Future Ralph Grishman November 2015 NYU.
Big Data Bioinformatics By: Khalifeh Al-Jadda. Is there any thing useful?!
Doug Raiford Phage class: introduction to sequence databases.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
(H)MMs in gene prediction and similarity searches.
GeneScout: a data mining system for predicting vertebrate genes in genomic DNA sequences Authors: Michael M. Yin and Jason T. L. Wang Sources: Information.
What is BLAST? Basic BLAST search What is BLAST?
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
What is BLAST? Basic BLAST search What is BLAST?
bacteria and eukaryotes
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Fast Sequence Alignments
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Comparative Genomics.
Applying principles of computer science in a biological context
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Basic Local Alignment Search Tool
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc., Pasadena, CA

CMSC 838T – Presentation Motivation u Genome annotation  Extraction of biologically relevant knowledge from raw genomic sequence data u Need faster genome annotation methods  DNA sequences are very long (millions of nucleotides)  Current methods are computationally too expensive u Approach/Solution  GeneMatcher2 hardware acceleration of GeneWise

CMSC 838T – Presentation Outline u Motivation  Genome annotation u GeneMatcher2  Design  ASIC hardware u Comparison  GeneWise algorithm  HalfWise algorithm  Performance (time, precision) u Observations  Performance improvement  Cost effectiveness

CMSC 838T – Presentation Approach u Problem: make GeneWise run faster  “Embarassingly parallel” algorithm  Computationally too expensive when run in parallel on PC’s u Paracell’s solution: hardware acceleration  Don’t change the algorithm  Produce an implementation on the GeneMatcher2 supercomputer that works as much like the original software as possible  6LITE algorithm, now also in Wise2

CMSC 838T – Presentation GeneMatcher Architecture

CMSC 838T – Presentation ASIC Hardware u ASIC – application specific integration circuit  Designed to speed up dynamic programming algorithms l (could be used for Smith-Waterman)  Each ASIC board has 3072 processors  System has up to 9 boards  Cost per board around $40K

CMSC 838T – Presentation GeneWise Algorithm u Perform a search of genomic DNA sequence data using a protein HMM  Build HMMs from protein families  Scan genome using HMM l Look for start codon l “GT” sequence signals possible 5’ splice site l “AG” sequence signals possible 3’ splice site  Dynamic programming used in the scanning process l Obtain probability of the most likely path in HMM generating the sequence l Obtain alignment by backtracking

CMSC 838T – Presentation GeneWise model on GeneMatcher2

CMSC 838T – Presentation HalfWise Algorithm u Reduce cost by running BLAST to select HMMs with possible hits u Use these HMMs with GeneWise database search and sequence alignment algorithm u May miss some genes due to BLAST misses

CMSC 838T – Presentation Evaluation u Test data set  A genomic DNA sequence contig of about 2.9 Mb from the Drosophila Adh region  Focuss on finding all Pfam (Protein families database of alignments and HMMs) protein profile-HMMs that occur in the Adh genomic sequence

CMSC 838T – Presentation Evaluation: Speed

CMSC 838T – Presentation Evaluation: Score

CMSC 838T – Presentation Evaluation: Sensitivity and Specificity

CMSC 838T – Presentation Observations u Performance improvement  The speedup is several orders of magnitude. l Makes real target applications possible  Accuracy might be improved over HalfWise algorithm u Cost effectiveness  System used costs around $500K  500K worth Linux PC’s (500 processors at $1K each) would run about 10 times slower u Weaknesses  Cannot modify the algorithm  Not enough data to assess scalability