Presentation on theme: "A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University."— Presentation transcript:
A Practical Introduction to Hidden Markov Models for Bioinformaticians Thomas R. Ioerger Department of Computer Science Texas A&M University
HMM’s - What are They? useful for modeling protein/DNA sequence patterns probabilistic state-transition diagrams Markov processes - independence from history “hidden” states heads tails 0.4 0.6 0.4 HTHTTHTTTHTTHTHHHTHT... A G CT AACATGGTACATGTTAG... intronexon A G C T A G C T 0.09 0.06
Historical Context voice compression, information theory –how to fit full range of human voice in 8kbs of multiplexed/wireless bandwidth? –encode/decode into sequence of “codebook vectors” speech = string of phonemes –probability of what follows what... statistical sampling, convergence properties (“mixing”) “lets ‘gO tu the‘pär-tE” PARTY POTTY PATTY p artE
Essential References Rabiner paper: –Rabiner, L.R. (1989). A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 77:257-286. Durbin book: –Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. (1998). Biological Sequence Analysis. Cambridge University Press. Anders Krogh –Ch. 4 in Computational Biology: Pattern Analysis and Machine Learning Methods (1998; Salzberg, Searls, and Kasif, eds.) –Krogh, Brown, Mian, Sojlander, Haussler (1994). Hidden Markov models in computational biology: applications to protein modeling. Journal of Molecular Biology, 235:1501-1531. David Haussler (UCSB) –Sjolander, K., Karplus, K., Brown, M., Hughey, R., Krogh, A., Mian, I.S, and Haussler, D. (1996). Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Computer Applications in the Biosciences, 12:327-345.
Basic Mathematics Markov chains: prob. of a sequence: S=a 1 a 2...a n –P(S)=P(a 1 )P(a 2 | a 1 )P(a 3 | a 1 a 2 )...P(a n |a 1... a n-1 ) –P(S)=P(a 1 )P(a 2 | a 1 )P(a 3 | a 2 )...P(a n |a n-1 ) –P(S)=P(a 1 ) P(a i |a i-1 ) HMMs: prob. depends on states passed thru –if known: (states= s 1, s 2... s n ) P(S)=P(a 1 |s 1 )P(s 2 |s 1 )P(a 2 |s 2 )P(s 3 |s 2 )...P(s i |s i-1 )P(a i |s i ) –if unknown, sum over all possible paths see the Forward algorithm below important for determining how consistent (probable) an observed sequence is, given a model
Most probable path: Viterbi algorithm if the path of states is unknown, can be recovered path that gives the highest probability to sequence define p i,j as the probability of the most probable path ending in state j after emitting element i compute recursively: –suppose you knew p i-1,j for all states up to previous char –update: p i,k =P(a i |s k ) * max j (p i-1,j *P(s k |s j )) dynamic programming and “traceback” –keep table of state probs, start with 1st char, assign prob to each state, iterate updates...
Forward Algortihm probability of a sequence summed over all possible paths similar to Viterbi, except replace ‘max’ with probabilistic ‘sum’ –let q i,j be total prob of seeing a 1..a i and ending in state j –q i+1,k =P(a i+1 |s k )* q i,j *P(s k |s j )) at end of computation, probability of each state p n,j gives total probability of having generated the whole sequence and ended in that state total probability of sequence given model is p n,j
Backward Algorithm estimates probability of going through state j at time i (i.e. observation i generated by state j) P(s i =j|a 1..a n )=Prob(s i =j|a 1..a i )*Prob(a i+1..a n |s i =j) Prob(a 1..a n ) prefix probability: comes from Forward algorithm (total prob, to normalize) suffix: work backwards from last char, summing probs over states
Training HMM’s easy if you know the states (e.g. multiple alignment) –just compute prob. distribution over elements at each site problem: dealing with unobservable state info –raw sequences: which element came from which state? EM (expectation maximization, aka ‘Baum-Welch’) –uses forward and backward algorithms –start with random transition probabilities –compute most likely path and state probabilities use best current guesses to estimate hidden information –update state probabilities based on that; iterate –converges on transition probabilities that maximizes likelihood of observed input (training) sequences
ACTVHLLRKMP ASTIHILRKMA ACSVHILKKQP ACIVHMLKKMP A: 1.0 C:.75 S:.25 T:.5 I:.25 V:.75 I:.25 ACTTTSPVVHLLRKMP ASTIHILKMA MNFIYPQSACSVHILKKQP CIVLKKMP Which letters came from which states? Let the model tell you (using Backward algorithm, based on current params)
Advanced Issues topology searching –similar to Bayesian networks? (Chickering, Heckerman, Buntine...) –“duration”, length (# states), and exit probabilities higher-order HMM’s (probabilities depend on window of several recent states) –P(a i |a i-1,a i-2,a i-3 ) Dirichlet priors –pseudo-counts (because 0 doesn’t necessarily mean 0, but Prob=0 kills the whole computation)
Applications I: Protein Families PFAM database –http://pfam.wustl.edu, http://sanger.ac.uk/Software/Pfam –models for each family available –files consist of transition probabilities –3 states per “site”: match, insert, delete –constructed from multiple sequences (unaligned) –can use for searching (more sensitive than homology detection) or alignment start site 1 site 2 site 3 site 4 site 5 ins del ins del ins del ins del ins del ins A, C, D...
HMM Software HMMER –Sean Eddy; Wash. Univ. St. Louis –http://hmmer.wustl.edu –commands: hmmalign, hmmsearch –ls/fs, calibration other software: SAM –Hughey and Krogh (1996), CABIOS, 12:95-107. –http://www.cse.ucsc.edu/research/compbio/sam.html
Splice junctions: GeneSplicer (Salzberg) –Petrea, M., Lin, X., and Salzberg, S. (2001). GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Research, 29(5):1185-1190. Gene finding: GLIMMER (Salzberg) –Salzberg, S., Delcher, A., Kasif, S., White, O., and Salzberg, S. (1998). Microbial gene identification using interpolated Markov models. Nucleic Acids Research, 26(2):544-548. –use HMMs to refine classification of ORFs (e.g. GC content) –build 1 HMM for each of 6 reading frames (different nuc. probabilities) –interpolated 5th-order Markov chain (depends on preceeding 5-mer) –use combined scores to indicate consistency - help pick right frame Applications II: DNA patterns...........GT[....................]AG.......... HMM-donor HMM-accept HMM-coding
Applications III: Secondary Structure Goldman, N., Thorne, J.L., and Jones, D.T. (1996). Using evolutionary trees in protein secondary structure prediction and other comparative analyses. JMB, 263:196-208. old method: amino acid biases; probability based on window of local residues (Chou-Fasman) new method: HMM –3 models: helix, strand, coil (each has 20x20 table with transition frequencies between neighbors a i =>a i+1 )
problem: AA replacement probabilities depend on evolutionary distance –given a set of training sequences, probs depend on how similar/redundant the sequences are –not independent sample (homology weighting?) –correct the probabilities for distance between sequence (% homology) - determines instantaneous substitution “rates” using the HMMs to predict secondary structure –input: multiple alignment –step 1: estimate phylogeny by maximum likelihood –step 2: probability tables are scaled for PAM distance –step 3: estimate Prob(class|site i ) based on summing over Prob(class|site i-1 ) and Prob(a i |class) and Prob(a i |a i-1 ) and at site i+1 (for consistency) –takes forward and backward pass to determine P(class|site i )
Bienkowska, Yu, Zarakhovich, Rogers, and Smith (2000). “Protein fold recognition by total alignment probability”. Proteins: Structure, Function, and Genetics, 40(30):451-462. improved approach to threading/family 3D profiles model proteins as segments of secondary structure –build HMM with set of states for each segment (DSSP) –calculate amino acid probabilities in each (like 3D profile), also function of exposure P(model|seq)=P(seq|model)*P(model)/P(seq) –P(model) based on SCOP distribution –P(seq|model) method 1: most probable path (Viterbi alg): 64% accuracy method 2: sum over all paths (Forward alg): 90% accuracy Applications IV: Protein Fold Recognition strand helix loop helix loop strand
Applications V: Fold classification by Support Vector Machine Jaakkola, T. Diekhans, M., and Haussler, D. (1999). Using the Fisher kernel method to detect remote protein homologies. Proceedings of Conf. on Intelligent Systems for Molecular Biology, 149-158. Support Vector Machines (SVMs) –classification based on maximum-margin hyperplanes between positive and negative examples –efficient computation uses only pairwise distance globins non-globin sequences what space is this? what distance metric?
HMM distance between sequences: Fisher kernel –re-represent each sequence by vector of “sufficient statistics” (probabilities of internal parameters in model) –e.g. how likely was state i used by sequence A? –Fisher score is derivative (sensitivity) of log-likelihood score of sequence with respect to internal parameters U X = log P(X|H1, ) gradient = vector of derivatives, one for each parameter –dist(X,X’) = exp[-1/(2 2 )(U X -U X’ ) T (U X -U X’ )] like dot-product or Euclidean distance between two vectors
Example I: KaiA, N-terminal domain protein involved in regulating circadian rhythms in cyanobacteria 284 amino acids: 2 domains –N-terminal domain solved by NMR (Liwang, Vakonakis) –whole molecule solved together by crystallography (Sacchettini) N-terminal domain –no apparent homology to known proteins a priori BLAST didn’t reveal family relationship; none>15% identity –when structure solved, found to be structure, similar to response receiver domains (CheY, FixR, AmiR...) chemotaxis, nitrogen fixation, amidase activity... relation detected by structural homology search (DALI)
Using HMMER on KaiA 1. Search PFAM with sequence: –returns response_reg family 2. Download HMM model for family 3. Run KaiA seq with model through hmmsearch to determine significance Run KaiA seq with representative members of family through hmmalign to build multiple alignment
HMMER2.0 [2.3.1] NAME response_reg ACC PF00072 DESC Response regulator receiver domain LENG 125 ALPH Amino RF no CS no MAP yes COM hmmbuild -F HMM_ls.ann SEED.ann COM hmmcalibrate --seed 0 HMM_ls.ann NSEQ 54 DATE Mon Jun 23 20:37:43 2003 CKSUM 3355 GA -17.0 -17.0 TC -17.0 -17.0 NC -17.1 -17.1 XT -8455 -4 -1000 -1000 -8455 -4 -8455 -4 NULT -4 -8455 NULE 595 -1558 85 338 -294 453 -1158 197 EVD -88.885818 0.177915 HMM A C D E F G H I... 1 -346 57 -5567 -1768 445 -964 1156 928 2 -1511 -4502 -477 -2328 -4823 -4003 1597 -1632 3 501 259 -117 908 1124 -390 -625 4024... states transition probabilities The response_reg HMM model
>hmmsearch respregls.hmm combined.fa hmmsearch - search a sequence database with a profile HMM HMMER 2.2g (August 2001) Copyright (C) 1992-2001 HHMI/Washington University School of Medicine Freely distributed under the GNU General Public License (GPL) - - - - - - - - - - - - - - - - - - HMM file: respregls.hmm [response_reg] Sequence database: combined.fa per-sequence score cutoff: [none] per-domain score cutoff: [none] per-sequence Eval cutoff: <= 10 per-domain Eval cutoff: [none] - - - - - - - - - - - - - - - - Query HMM: response_reg Accession: PF00072 Description: Response regulator receiver domain [HMM has been calibrated; E-values are empirical estimates] Scores for complete sequences (score includes all domains): Sequence Description Score E-value N -------- ----------- ----- ------- --- 3CHY: CheY 156.2 3e-47 1 KaiA, domain 1 -51.3 0.0037 1 1amx -97.6 3 1 significant if <0.1
>hmmalign respregls.hmm combined3.fa DrrD YAL--NEP.-FDVVILDI-LPV.....HDGWE.ILKS-RESGVNTPV--- CheY LNKLQAGG.-YGFVISDWNMPN.....MDGLE.LLKTIRADGAMSALPVL NarL IELAESLD.-PDLILLDLNMPG.....MNGLE.TLDKLREKSLSGRI--V FixJ LAFAPDVR.-NGVLVT-LRMPD.....MSGVE.LLRNLGDLKINIPS--I Etr1 LRVVSHEH.--KVVFMDVCMPGvenyqIA-LR.IHEKFTQRHQRPLL--V AmiR FDV----P.-VDVVFTSI-FQN.....RHHDE.IAALLAAGTPRTTL--V KaiA LEYAQTHRdQIDCLILVAANPS.....---FRaVVQQLCFEGVVVPA--I #=GC RF xxxxxxxx.xxxxxxxxxxxxx.....xxxxx.xxxxxxxxxxxxxxxxx DrrD LLTALSDVEYRVKGLN-GADDYLPKPFDLRELIARVRALIRRkSeskstk CheY MVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKlGm..... NarL VFSVSNHEEDVVTALKRGADGYLLKDMEPEDLLKALHQAAAGeMvlseal FixJ VITGHGDVPMAVEAMKAGAVDFIEKPFEDTVIIEAIERASEHlV...... Etr1 ALSGNTDKSTKEKCMSFGLDGVLLKPVSLDNIRDVLSDLLEPrVlye... AmiR ALVEYESPAVLSQIIELECHGVITQPLDAHRVLPVLVSARRIsEemaklk KaiA VVGDRDP---AKEQLYHSAELHLGIH-QLEQLPYQVDAALAEfLrlapve #=GC RF xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.x...... Multiple Alignment with Response_Reg family members HMM states
Summary of KaiA Weak homology could have been detected using HMMs (through PFAM search) –HMM model is more general representation of family –allows more sensitive searches HMMs allow pairwise alignment to other fold family members –No reasonable global alignment could be constructed using Smith-Waterman –tried various gap parameters, similarity matrices –produced “random” gap placement –yet alignment of CheY and KaiA to model gives meaningful pairwise alignment