Ab initio gene prediction Genome 559, Winter 2011.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Hidden Markov Model in Biological Sequence Analysis – Part 2
BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Profiles for Sequences
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Markov Chains Lecture #5
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Gene Finding (DNA signals) Genome Sequencing and assembly
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Sequencing a genome and Basic Sequence Alignment
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Genome Annotation BBSI July 14, 2005 Rita Shiang.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
What is a Hidden Markov Model?
EGASP 2005 Evaluation Protocol
Eukaryotic Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Hidden Markov Models (HMMs)
Finding regulatory modules
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Modeling of Spliceosome
The Toy Exon Finder.
Presentation transcript:

Ab initio gene prediction Genome 559, Winter 2011

Review Comparing networks Node degree distributions Power law distribution Network motifs - over and under representation Randomizing networks while maintaining node degrees.

Ab initio gene prediction method Define parameters of real genes (based on experimental evidence): Use those parameters to obtain a best interpretation of genes from any region from genome sequence alone. 1)Splice donor sequence model 2)Splice acceptor sequence model 3)Intron and exon length distribution 4)Open reading frame requirement in coding exons 5)Requirement that introns maintain reading frame 6)Transcription start and stop models (difficult to predict, often omitted). ab initio = "from the beginning" (i.e. without experimental evidence)

Sites we might want to predict Splice donor site Splice acceptor site Translation start Translation stop (some predictors only deal with coding exons; the 5' and 3' ends are harder to predict.)

Open reading frames (random sequence) 61 of 64 codons are not stop codons (0.953 assuming equal nucleotide frequencies). Probability of not having a stop codon in a particular reading frame along a length L of DNA is a geometric distribution that decays rapidly. There are 3 reading frames on each DNA strand.

(distance in codons) long open reading frames are rare in random sequence Geometric distribution in random sequence of distance to first stop codon (p=3/64)

Splice donor and acceptor information donor, C. elegans (sums to ~8 bits) acceptor, C. elegans (sums to ~9 bits) Note – these show a log-odds measure of information content compared to background nucleotide frequencies. Similar to BLOSUM matrix log-odds. exon intron

Position Specific Score Matrix (PSSM) ACGTACGT Slide PSSM along DNA, computing a score at every position. splice donor (this is a conceptual example, the real thing would be computed as log-odds values, similar to BLOSUM matrices)

Intron length distribution (C. elegans) Note: intron length distributions in Drosophila melanogaster and Homo sapiens (and most other species) are longer and broader.

Other information that can be used Splice donor and acceptor must be paired and donor must be upstream of acceptor (duh). Introns in coding regions must maintain reading frame of the flanking exons. Nucleotide content analysis (e.g. introns tend to be AT rich).

Simple conceptual example Sites scored on basis of PSSM matches to known splice donor model (schematized below). Arrow length reflects quality of match (worse matches not shown). (plus strand only)

(example cont.) (one probable interpretation) Add splice acceptor information Where would you infer introns?

(example cont.) stop codon before highest scoring splice donor! reinterpreted (avoids stop codon by using lower scoring splice donor):

Real example (end result) Note that this gene has no mRNA sequences (EST and ORFeome tracks empty). This is a pure ab initio prediction. 1234

Hidden Markov Model (HMM) Markov chain - a linear series of states in which each state is dependent only on the previous state. HMM - a model that uses a Markov chain to infer the most likely states in data with unknown states ("hidden" states). A Markov chain has states and transition probabilities: A B p AB p BA (implicitly the probability of staying in state A is 1- p AB and the probability of staying in state B is 1-p BA )

AB A B What will the series of states look like (roughly) for this Markov chain? It will have long stretches of A states, interspersed with short stretches of B states. A -> B B -> A

Hidden Markov Model We have a Markov chain with appropriate states and known transition probabilities (e.g. inferred from experimentally known genes). We have a DNA sequence with unknown states. Find the series of Markov chain states with the maximum likelihood for the DNA sequence. Solved with the Viterbi algorithm (we won't cover this, but it is another dynamic programming algorithm). See

coding exon states (three frames) intron states (three frames of codon they insert into) special first (init) and last (term) coding exon states (splice acceptor) (splice donor) Gene Prediction HMM States taken from Stormo lab paper

A way to connect the HMM formalism to specifics Note – these probabilities are qualitative and are intended only to portray the local trends.

Long open reading frames favor exon state

Intron positions and reading frame The intron can be any length and still produce the same exons This particular splice is between two codons (0-shifting) The splice position can move and maintain coding frame as long as both positions move coordinately. If one splice endpoint moves it may change reading frame intronexon M I L E S D A V I S

good exon dubious exon Gene A (ab initio model) Gene B (ab initio model) DNA dot matrix comparison of two ab initio gene predictions in related genomes other possible corrections?

After correction of exons 1 and 2