Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hidden Markov Model Ed Anderson and Sasha Tkachev.

Similar presentations


Presentation on theme: "Hidden Markov Model Ed Anderson and Sasha Tkachev."— Presentation transcript:

1

2 Hidden Markov Model Ed Anderson and Sasha Tkachev

3 Who Was Markov? Graduate of Saint Petersburg University (1878), where he began a professor in 1886 Applied the method of continued fractions, pioneered by his teacher Pafnuty Chebyshev, to probability theory He proved the central limit theorem under fairly general assumptions Most remembered for his study of Markov chains, sequences of random variables in which the future variable is determined by the present variable but is independent of the way in which the present state arose from its predecessors. This work launched the theory of stochastic processes In 1923 Norbert Weiner became the first to treat rigorously a continuous Markov process. The foundation of a general theory was provided during the 1930s by Andrei Kolmogorov. Excerpted from: http://www-groups.dcs.st-and.ac.uk/~history/Mathematicians/Markov.html Andrei A Markov Born: 14 June 1856 in Ryazan, Russia Died: 20 July 1922 in Petrograd, Russia

4 What is the Hidden Markov Model? Clipped from http://www.nist.gov/dads/HTML/hiddenMarkovModel.html

5 What Makes HMM Useful? Efficiency: The algorithms are simple enough to be performant for real- time speech recognition. Speed is advantageous when dealing with large biological data sets Strong Theoretical Basis Probability distribution must sum to 1. Scores are not influenced by ad-hoc criteria. Scores may be compared across different experiments of varying size and complexity Well suited for analyzing noisy, time-phased or sequentially connected events.

6 What are HMM’s Limitations? Model building is not so easy “Since HMM training algorithms are local optimizers, it pays to build HMMs on pre-aligned data whenever possible… the parameter space may be complex with may spurious local optima than can trap a training algorithm.” 1 Distance between related states must be constant A disadvantage when analyzing distant and arbitrarily spaced items: Amino acids in folded proteins RNA base pairs 1 Eddy, S.R., Profile hidden Markov models, Bioinformatics Review, 1998, Vol. 14, no. 9 1998, pg. 757

7 A Concrete Example Example adapted from http://en.wikipedia.org/wiki/Viterbi_algorithm Can you guess the weather based on a person’s activity? Use the Forward algorithm to calculate the probabilities.

8 How to Avoid False Optima? Is it necessary to calculate every possible path? The Viterbi algorithm can help. Example from http://www.telecom.tuc.gr/~ntsourak/demo_viterbi.htm

9 HMM In Speech Recognition Handling a single word; evaluating each HMM according to the input, using the Viterbi Search Every senone gets a HMM: Adapted from Shir, O. M., Speech Recognition Seminar, 10/15/03 Leiden Institute of Advanced Computer Science UW ONE TWO THREE T AHWN RTHIY 5-state HMM

10 HMM In Speech Recognition Taken from Shir, O. M., Speech Recognition Seminar, 10/15/03 Leiden Institute of Advanced Computer Science time State with best path-score State with path-score < best State without a valid path-score P (t) j = max [P (t-1) a b (t)] iijj i Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t

11 HMM in Bioinformatics Sequence profiling Gene finding Protein secondary structure prediction Radiation hybrid mapping Genetic linkage mapping Phylogenetic analysis

12 HMM in Sequence Profiling Review – Lecture 7 Highlights Emission probabilities and transition probabilities

13 HMM in Sequence Profiling Log Odds scores are comparable across different length sequences Taken from lecture 7 slides, apparently from Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998.

14 Why HMM for Sequence Analysis? Position-specific scoring methods make intuitive sense. BLAST and FASTA use pair-wise alignment as opposed to profile scoring Profile methods have historically used ad hoc scoring systems. HMM gap penalties a grounded in probability theory. HMMs provide a coherent, probabilistic model. 2 (2) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763

15 Profile HMM Software ‘Motif’ models have strings of match states separated by a small number of insert states. ‘Profile’ models have insert and delete states associated with each match state.. 3 (3) Eddy, Sean R., Profile hidden Markov models, Bioinformatics Review, Vol. 14 no. 9, 1998, pps. 755-763 (4) Ibid., Figure 3 on page 758. 4

16 HMMER Architecture Both local and global profile alignment. (5) Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/ 5

17 How Does it Work? Generative models work by recursive enumeration of possible sequences from a finite set of rules. The Plan 7 architecture explicitly models the entire target sequence, regardless of how much of that sequence matches the main model. All alignments to a Plan 7 model are “global” alignments! (6) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/ 6

18 HMMR Programs 7 hmmalign - align sequences to an HMM profile hmmbuild - build a profile HMM from an alignment hmmcalibrate - calibrate HMM search statistics hmmconvert - convert between profile HMM file formats hmmemit - generate sequences from a profile HMM hmmfetch - retrieve an HMM from an HMM database hmmindex - create a binary SSI index for an HMM database hmmpfam - search one or more sequences against an HMM database hmmsearch - search a sequence database with a profile HMM HMMER’s native alignment format is called Stockholm format, the format of the Pfam protein database that allows extensive markup and annotation. HMMER can read alignments in several common formats, including the output of the CLUSTAL family of programs, Wisconsin/GCG MSF format, the input format for the PHYLIP phylogenetic analysis programs, and “alighed FASTA” format (where the sequences in a FASTA file contain gap symbols, so that they are all the same length). (7) Excerpted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

19 Building a profile with hmmbuild 8 > hmmbuild globin.hmm globins50.msf hmmbuild - build a hidden Markov model from an alignment HMMER 2.3 (April 2003) Copyright (C) 1992-2003 HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) - - - - - - - - - - - - - - - - - - Alignment file: globins50.msf File format: MSF Search algorithm configuration: Multiple domain (hmmls) Model construction strategy: MAP (gapmax hint: 0.50) Null model used: (default) Prior used: (default) Sequence weighting method: G/S/C tree weights New HMM file: globin.hmm - - - - - - - - - - - - - - - - Alignment: #1 Number of sequences: 50 Number of columns: 308 Constructed a profile HMM (length 143) Average score: 189.04 bits Minimum score: -17.62 bits Maximum score: 234.09 bits Std. deviation: 53.18 bits (8) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

20 Calibrating the profile 9 > hmmcalibrate globin.hmm hmmcalibrate -- calibrate HMM search statistics HMMER 2.3 (April 2003) Copyright (C) 1992-2003 HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) - - - - - - - - - - - - - - - - - - HMM file: globin.hmm Length distribution mean: 325 Length distribution s.d.: 200 Number of samples: 5000 random seed: 1051632537 histogram(s) saved to: [not saved] POSIX threads: 4 - - - - - - - - - - - - - - - - HMM : globins50 mu : -39.897396 lambda : 0.226086 max : -9.567000 (9) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

21 Searching the sequence DB 10  Header Section  hmmsearch globin.hmm Artemia.fa hmmsearch - search a sequence database with a profile HMM HMMER 2.3 (April 2003) Copyright (C) 1992-2003 HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) - - - - - - - - - - - - - - - - - - HMM file: globin.hmm [globins50] Sequence database: Artemia.fa per-sequence score cutoff: [none] per-domain score cutoff: [none] per-sequence Eval cutoff: <= 10 per-domain Eval cutoff: [none] - - - - - - - - - - - - - - - - Query HMM: globins50 Accession: [none] Description: [none] [HMM has been calibrated; E-values are empirical estimates] (10) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

22 Searching the sequence DB (cont.) 11  Sequence Top Hits Section (11) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

23 Searching the sequence DB (cont.) 12  Alignment Output Section (12) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

24 Searching the sequence DB (cont.) 13  Score Histogram Section (13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

25 Local versus Global Alignment 14  HMMER does not do local (Smith/Waterman) and global (Needleman/Wunsch) style alignments in the same way that most computational biology analysis programs do it.  To HMMER, whether local or global alignments are allowed is part of the model, rather than being accomplished by running a different algorithm.  You must choose what kind of alignments you want to allow when you build the model  By default, hmmbuild builds models which allow alignments that are global with respect to the HMM, local with respect to the sequence, and allows multiple domains to hit per sequence. (13) Adapted from Eddy, Sean R., HMMER User Guide, Version 2.3.2; Oct 2003. http://hmmer.wustl.edu/.http://hmmer.wustl.edu/

26 Experimental Observations My tests on the clipped SH3 Domain sequence in the Krogh paper. 15 The insert gap penalty was small but significant. The number of inserts had a linear, negative affect on the score. Relative to the overall score, the inserts and deletes had a small effect. (15) Krogh, “Computational Methods in molecular biology, pages 45-63, Elsevier, 1998. Avg Log Odds by Domain 8.63 -1.46 14.77


Download ppt "Hidden Markov Model Ed Anderson and Sasha Tkachev."

Similar presentations


Ads by Google