GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Two implementation issues Alphabet size Generalizing to multiple strings.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
CSE 746 – Introduction to Bioinformatics Research Project Two methods of DNA Sequencing – Comparing and Intertwining Suffix Trees and De Bruijn Graphs.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 More Specialized Data Structures String data structures Spatial data structures.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models in Bioinformatics
Profiles for Sequences
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Lecture 6, Thursday April 17, 2003
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Picking Alignments from (Steiner) Trees Fumei Lam Marina Alexandersson Lior Pachter.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lyle Ungar, University of Pennsylvania Hidden Markov Models.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
CSE182-L10 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Hidden Markov Models In BioInformatics
Introduction to Profile Hidden Markov Models
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
Pattern and string matching tools Biology 162 Computational Genetics Todd Vision 9 Sep 2004.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Mark D. Adams Dept. of Genetics 9/10/04
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Tries 07/28/16 11:04 Text Compression
Eukaryotic Gene Finding
Pair Hidden Markov Model
Chapter 11 Data Compression
Intro to Alignment Algorithms: Global and Local
Presentation transcript:

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State

Hidden Markov Model for Gene Finding Intron, Exon, Intergenic states Exon frame is encoded in the architecture by defining more states Exon states have explicit duration density Intron states have geometric duration Parameters are trained separately in different levels of GC content (correlated with amount of genes, and length of exons & introns)

Comparison-based Methods

Cross-species gene finding 5’ 3’ Exon1 Exon2 Exon3 Intron1Intron2 [human] [mouse] GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

Comparison of 1196 orthologous genes (Makalowski et al., 1996) Sequence identity between genes in human/mouse –exons: 84.6% –protein: 85.4% –introns: 35% –5’ UTRs: 67% –3’ UTRs: 69% 27 proteins were 100% identical.

HumanMouse Human-mouse homology

Not always: HoxA human-mouse

Twinscan Twinscan is an augmented version of the Gencscan HMM. E I transitions duration emissions ACUAUACAGACAUAUAUCAU

Twinscan Algorithm 1.Align the two sequences (eg. from human and mouse) 2.Mark each human base as gap ( - ), mismatch ( : ), match ( | ) New “alphabet”: 4 x 3 = 12 letters  = { A-, A:, A|, C-, C:, C|, G-, G:, G|, U-, U:, U| }

Twinscan Algorithm 3.Run Viterbi using emissions e k (b) where b  { A-, A:, A|, …, T| } Note: Emission distributions e k (b) estimated from real genes from human/mouse e I (x|) < e E (x|): matches favored in exons e I (x-) > e E (x-): gaps (and mismatches) favored in introns

Example Human : ACGGCGACGUGCACGU Mouse : ACUGUGACGUGCACUU Alignment : ||:|:|||||||||:| Input to Twinscan HMM: A| C| G: G| C: G| A| C| G| U| G| C| A| C| G: U| Recall, e E (A|) > e I (A|) e E (A-) < e I (A-) Likely exon

HMMs for simultaneous alignment and gene finding: Generalized Pair HMMs

A Pair HMM for alignments M P(x i, y j ) I P(x i ) J P(y j )  1-  - 2        BEGIN END M J I

Generalized Pair HMMs

Exon GPHMM d e 1.Choose exon lengths (d,e). 2.Generate alignment of length d+e.

Cross-species gene finding 5’ 3’ Exon1 Exon2 Exon3 Intron1Intron2 CNS [human] [mouse]

The SLAM hidden Markov model

no. states max duration length seq1 length seq2 Computational complexity

Approximate alignment Reduces TU -factor to hT

Measuring Performance

Example: HoxA2 and HoxA3 SLAM SGP-2 Twinscan Genscan TBLASTX SLAM CNS VISTA RefSeq

Suffix Trees (a short break from biology)

Suffix Trees Suffix trees are a method to find all maximal matches between two strings (and much more) Example: x = dabdac d ab d a c c a b d a c c c c a d b

Definition of a Suffix Tree Definition: For string x = x 1 …x m, a suffix tree is:  A rooted tree with m leaves Leaf i: x i …x m  Each edge is a substring  No two edges out of a node, start with same letter It follows, every substring corresponds to an initial part of a path from root to a leaf

Naïve Algorithm to Construct a Suffix Tree 1.Initialize tree T: a single root node r 2.Insert special symbol $ at end of x 3.For j = 1 to m Find longest match of x i …x m to T, starting from r Split edge where match stops: new node w Create edge (w, j), and label with unmatched portion of x i …x m

Example of Suffix Tree Construction 1 x = d a b d a $ d ab d a $ 1. Insert d a b d a $ a b d a $ 2 2. Insert a b d a $ $ a d b 3 3. Insert b d a $ $ 4 4. Insert d a $ $ 5 5. Insert a $ $ 6 6. Insert $

Memory to Store Suffix Tree Can store in O( N ) memory! Every edge is labeled with (i, j): (i,j) denotes x i …x j Tree has O( N ) nodes Proof: 1.# leafs  # nodes – 1 2.# leafs = |x|

Faster Construction Several algorithms O( N ) time, O( N ) memory with a big constant ~15 bytes/char Technical but not deep, outside the scope of this course Optional: Gusfield, chapter 6

Application: find all matches between x, y 1.Build suffix tree for x, mark nodes with x 2.Insert y in suffix tree, mark all nodes y “passes from” with y  The path label of every node marked both 0 and 1, is a common substring

1 x = d a b d a $ y = a b a d a $ d ab d a $ 1. Construct tree for x a b d a $ 2 $ a d b 3 $ 4 $ 5 $ 6 x x x 6. Insert a $ Insert $ 4. Insert a d a $ d a $ 3 5. Insert d a $ y 4 2. Insert a b a d a $ a y d a $ 1 y y x 3. Insert b a d a $ a d y 2 a $ x Example of Suffix Tree construction

Application: common substrings of k strings To find the longest common substring of s 1, s 2, …s n 1.Build suffix tree for s 1,…, s n 2.All nodes labeled {s i1, …, s ik } represent a match between s i1, …, s ik

Suffix Arrays ABRACADABRA$ 11 $ 10 A$ 7 ABRA$ 0 ABRACADABRA$ 3 ACADABRA$ 5 ADABRA$ 8 BRA$ 1 BRACADABRA$ 4 CADABRA$ 6 DABRA$ 9 RA$ 2 RACADABRA#$ Fast O(log n) search for every specific string Used for data compression such as bzip2 Can be built in O(n) time by first building suffix tree and then get ordered suffixes by in-order traversal  Too much memory— ~15n bytes  Difficult to implement Theoretical build in O(n log n) using O(n/ sqrt(log n)) extra memory Hot topic how to build fast in practice