Ab initio gene prediction Genome 559, Winter 2011.

Slides:

Advertisements

Similar presentations

Gene Prediction: Similarity-Based Approaches

Advertisements

. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.

Hidden Markov Model in Biological Sequence Analysis – Part 2

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Patterns, Profiles, and Multiple Alignment.

1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.

Ka-Lok Ng Dept. of Bioinformatics Asia University

Profiles for Sequences

Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.

Markov Chains Lecture #5

CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Hidden Markov Model Special case of Dynamic Bayesian network Single (hidden) state variable Single (observed) observation variable Transition probability.

. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.

S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

Gene Finding (DNA signals) Genome Sequencing and assembly

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

CSE182-L10 Gene Finding.

CSE182-L12 Gene Finding.

Comparative ab initio prediction of gene structures using pair HMMs

1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.

Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

Eukaryotic Gene Finding

Lecture 12 Splicing and gene prediction in eukaryotes

Eukaryotic Gene Finding

Sequence comparison: Score matrices Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.

Biological Motivation Gene Finding in Eukaryotic Genomes

Sequencing a genome and Basic Sequence Alignment

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

Genome Annotation BBSI July 14, 2005 Rita Shiang.

Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.

Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.

Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.

Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.

A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.

Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.

Mark D. Adams Dept. of Genetics 9/10/04

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.

From Genomes to Genes Rui Alves.

Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.

Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.

Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.

JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.

Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

(H)MMs in gene prediction and similarity searches.

1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.

bacteria and eukaryotes

EGASP 2005 Evaluation Protocol

What is a Hidden Markov Model?

EGASP 2005 Evaluation Protocol

Eukaryotic Gene Finding

Ab initio gene prediction

Recitation 7 2/4/09 PSSMs+Gene finding

Introduction to Bioinformatics II

Hidden Markov Models (HMMs)

Finding regulatory modules

HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY

Modeling of Spliceosome

The Toy Exon Finder.

Presentation transcript:

Ab initio gene prediction Genome 559, Winter 2011

Review Comparing networks Node degree distributions Power law distribution Network motifs - over and under representation Randomizing networks while maintaining node degrees.

Ab initio gene prediction method Define parameters of real genes (based on experimental evidence): Use those parameters to obtain a best interpretation of genes from any region from genome sequence alone. 1)Splice donor sequence model 2)Splice acceptor sequence model 3)Intron and exon length distribution 4)Open reading frame requirement in coding exons 5)Requirement that introns maintain reading frame 6)Transcription start and stop models (difficult to predict, often omitted). ab initio = "from the beginning" (i.e. without experimental evidence)

Sites we might want to predict Splice donor site Splice acceptor site Translation start Translation stop (some predictors only deal with coding exons; the 5' and 3' ends are harder to predict.)

Open reading frames (random sequence) 61 of 64 codons are not stop codons (0.953 assuming equal nucleotide frequencies). Probability of not having a stop codon in a particular reading frame along a length L of DNA is a geometric distribution that decays rapidly. There are 3 reading frames on each DNA strand.

(distance in codons) long open reading frames are rare in random sequence Geometric distribution in random sequence of distance to first stop codon (p=3/64)

Splice donor and acceptor information donor, C. elegans (sums to ~8 bits) acceptor, C. elegans (sums to ~9 bits) Note – these show a log-odds measure of information content compared to background nucleotide frequencies. Similar to BLOSUM matrix log-odds. exon intron

Position Specific Score Matrix (PSSM) ACGTACGT Slide PSSM along DNA, computing a score at every position. splice donor (this is a conceptual example, the real thing would be computed as log-odds values, similar to BLOSUM matrices)

Intron length distribution (C. elegans) Note: intron length distributions in Drosophila melanogaster and Homo sapiens (and most other species) are longer and broader.

Other information that can be used Splice donor and acceptor must be paired and donor must be upstream of acceptor (duh). Introns in coding regions must maintain reading frame of the flanking exons. Nucleotide content analysis (e.g. introns tend to be AT rich).

Simple conceptual example Sites scored on basis of PSSM matches to known splice donor model (schematized below). Arrow length reflects quality of match (worse matches not shown). (plus strand only)

(example cont.) (one probable interpretation) Add splice acceptor information Where would you infer introns?

(example cont.) stop codon before highest scoring splice donor! reinterpreted (avoids stop codon by using lower scoring splice donor):

Real example (end result) Note that this gene has no mRNA sequences (EST and ORFeome tracks empty). This is a pure ab initio prediction. 1234

Hidden Markov Model (HMM) Markov chain - a linear series of states in which each state is dependent only on the previous state. HMM - a model that uses a Markov chain to infer the most likely states in data with unknown states ("hidden" states). A Markov chain has states and transition probabilities: A B p AB p BA (implicitly the probability of staying in state A is 1- p AB and the probability of staying in state B is 1-p BA )

AB A B What will the series of states look like (roughly) for this Markov chain? It will have long stretches of A states, interspersed with short stretches of B states. A -> B B -> A

Hidden Markov Model We have a Markov chain with appropriate states and known transition probabilities (e.g. inferred from experimentally known genes). We have a DNA sequence with unknown states. Find the series of Markov chain states with the maximum likelihood for the DNA sequence. Solved with the Viterbi algorithm (we won't cover this, but it is another dynamic programming algorithm). See

coding exon states (three frames) intron states (three frames of codon they insert into) special first (init) and last (term) coding exon states (splice acceptor) (splice donor) Gene Prediction HMM States taken from Stormo lab paper

A way to connect the HMM formalism to specifics Note – these probabilities are qualitative and are intended only to portray the local trends.

Long open reading frames favor exon state

Intron positions and reading frame The intron can be any length and still produce the same exons This particular splice is between two codons (0-shifting) The splice position can move and maintain coding frame as long as both positions move coordinately. If one splice endpoint moves it may change reading frame intronexon M I L E S D A V I S

good exon dubious exon Gene A (ab initio model) Gene B (ab initio model) DNA dot matrix comparison of two ab initio gene predictions in related genomes other possible corrections?

After correction of exons 1 and 2