Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.

Slides:



Advertisements
Similar presentations
Quick Lesson on dN/dS Neutral Selection Codon Degeneracy Synonymous vs. Non-synonymous dN/dS ratios Why Selection? The Problem.
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
MNW2 course Introduction to Bioinformatics
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
McPromoter – an ancient tool to predict transcription start sites
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Finding Eukaryotic Open reading frames.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Master’s course Bioinformatics Data Analysis and Tools Lecture 12: (Hidden) Markov models Centre for Integrative Bioinformatics.
Sequence Similarity Searching Class 4 March 2010.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Transcription factor binding motifs (part I) 10/17/07.
Master’s course Bioinformatics Data Analysis and Tools
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Journal club 06/27/08. Phylogenetic footprinting A technique used to identify TFBS within a non- coding region of DNA of interest by comparing it to the.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Markov Chain Models BMI/CS 576 Fall 2010.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Sequencing a genome and Basic Sequence Alignment
Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Construction of Substitution Matrices
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Gene Family Size Distributions Brought to You By Your Neighorhood Durand Lab Narayanan Raghupathy Nan Song Rose Hoberman.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
ORF Calling.
bacteria and eukaryotes
Markov Chain Models BMI/CS 776
Interpolated Markov Models for Gene Finding
Eukaryotic Gene Finding
More on translation.
Predicting Genes in Actinobacteriophages
Microbial gene identification using interpolated Markov models
Basic Local Alignment Search Tool
Presentation transcript:

Interpolated Markov Models for Gene Finding BMI/CS Mark Craven February 2002

Announcements HW #1 out; due March 11 class accounts ready –quasar-1.biostat.wisc.edu, quasar-2.biostat.wisc.edu class mailing list ready –please check mail regularly and frequently, or forward it to wherever you can do this most easily reading for next week –Bailey & Elkan, The Value of Prior Knowledge in Discovering Motifs with MEME (on-line) –Lawrence et al., Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment (handed out in class) talk tomorrow Bioinformatics Tools to Study Sequence Evolution: Examples from HIV Keith Crandall, Dept. of Zoology, BYU 10am, Thursday 2/28 Biotech Center Auditorium (425 Henry Mall)

Approaches to Finding Genes search by sequence similarity: find genes by looking for matches to sequences that are known to be related to genes search by signal: find genes by identifying the sequence signals involved in gene expression search by content: find genes by statistical properties that distinguish protein-coding DNA from non-coding DNA combined: state-of-the-art systems for gene finding combine these strategies

Gene Finding: Search by Content encoding a protein affects the statistical properties of a DNA sequence –some amino acids are used more frequently than others (Leu more popular than Trp) –different numbers of codons for different amino acids (Leu has 6, Trp has 1) –for a given amino acid, usually one codon is used more frequently than others this is termed codon preference these preferences vary by species

Codon Preference in E. Coli AA codon / Gly GGG 1.89 Gly GGA 0.44 Gly GGU Gly GGC Glu GAG Glu GAA Asp GAU Asp GAC 43.26

Search by Content common way to search by content –build Markov models of coding & noncoding regions –apply models to ORFs or fixed-sized windows of sequence GeneMark [Borodovsky et al.] –popular system for identifying genes in bacterial genomes –uses 5 th order inhomogenous Markov chain models

Reading Frames

a given sequence may encode a protein in any of the six reading frames G C T A C G G A G C T T C G G A G C C G A T G C C T C G A A G C C T C G

Markov Models & Reading Frames consider modeling a given coding sequence for each “word” we evaluate, we’ll want to consider its position with respect to the reading frame we’re assuming G C T A C G G A G C T T C G G A G C G C T A C G reading frame G is in 3 rd codon position C T A C G G G is in 1 st position T A C G G A A is in 2 nd position

A Fifth Order Inhomogenous Markov Chain GCTAC AAAAA TTTTT CTACG CTACA CTACC CTACT start AAAAA TTTTT TACAG TACAA TACAC TACAT GCTAC AAAAA TTTTT CTACG CTACA CTACC CTACT position 1position 2position 3

Selecting the Order of a Markov Chain Model higher order models remember more “history” additional history can have predictive value example: –predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?) –now predict it given more history “Nice guys finish __”

Selecting the Order of a Markov Chain Model but the number of parameters we need to estimate grows exponentially with the order –for modeling DNA we need parameters for an nth order model the higher the order, the less reliable we can expect our parameter estimates to be –estimating the parameters of a 2 nd order homogenous Markov chain from the complete genome of E. Coli, we’d see each word > 72,000 times on average –estimating the parameters of an 8 th order chain, we’d see each word ~ 5 times on average

Interpolated Markov Models the IMM idea: manage this trade-off by interpolating among models of various orders simple linear interpolation: where

Interpolated Markov Models we can make the weights depend on the history –for a given order, we may have significantly more data to estimate some words than others general linear interpolation

The GLIMMER System Salzberg et al., 1998 system for identifying genes in bacterial genomes uses 8 th order, inhomogeneous, interpolated Markov chain models

IMMs in GLIMMER how does GLIMMER determine the values? first, let’s express the IMM probability calculation recursively let be the number of times we see the history in our training set

IMMs in GLIMMER if we haven’t seen more than 400 times, then compare the counts for the following: nth order history + base(n-1)th order history + base use a statistical test ( ) to get a value d indicating our confidence that the distributions represented by the two sets of counts are different

IMMs in GLIMMER putting it all together where

GLIMMER Experiment 8 th order IMM vs. 5 th order Markov model trained on 1168 genes (ORFs really) tested on 1717 annotated (more or less known) genes

Accuracy Metrics true positives (TP) true negatives (TN) false positives (FP) false negatives (FN) positive negative positivenegative predicted actual class

GLIMMER Results TPFNFP GLIMMER 5 th Order