Presentation on theme: "Profiles for Sequences"— Presentation transcript:
1 Profiles for Sequences Hidden Markov Models in Computational BiologyProfiles for Sequences
2 Sequence ProfilesOften, sequences are characterized by similarities that are not well captured through matching algorithms.For example, identification of genes in the presence of exons/introns, gene features (CpG islands, etc.), domain profiles in proteins, among others.For such sequences, Markov chains provide useful abstractions.
3 Markov Chains State transition matrix : The probability of RainSunnyCloudyState transition matrix : The probability ofthe weather given the previous day's weather.States : Three states - sunny, cloudy, rainy.Initial Distribution : Defining the probability of the system being in each of the states at time 0.
4 Hidden Markov ModelsHidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather).Observable states : the states of the process that are `visible’.
5 Hidden Markov ModelsInitial Distribution : Initial state probability vector.State transition MatrixEmission Probabilities: containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state.
6 Hidden Markov Models Output Prob. Transition Prob.Output Prob.Observed sequences can be scored if their state transitions are known.The probability of ACCY along this path is:.4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.
7 Methods for Hidden Markov Models Scoring problem:Given an existing HMM and observed sequence , what is the probability that the HMM can generate the sequence
8 Methods, contd. Alignment Problem Given a sequence, what is the optimal state sequence that the HMM would use to generate it
9 Methods, contd. Training Problem How do we estimate the structure and parameters of a HMM from data.
10 HMMs– Some Applications Gene finding and predictionProtein-Profile AnalysisSecondary Structure predictionCopy Number VariationCharacterizing SNPs
12 HMMs: ApplicationsClassification: Classifying observations within a sequenceOrder: A DNA sequence is a set of ordered observationsStructure : can be intuitively defined:Measure of success: # of complete exons correctly labeledTraining data: Available from various genome annotation projects
13 Hidden Markov Models in Computational Biology HMMs for Gene FindingTraining - Expectation Maximization (EM)Parsing – Viterbi algorithmfrom Salzberg, Chapter 4. pp 50.An HMM for unspliced genes.x : non-coding DNAc : coding state
14 Genefinders: a Comparison Sn = SensitivitySp = SpecificityAc = Approximate CorrelationME = Missing ExonsWE = Wrong ExonsGENSCAN Performance Data,
15 Protein Profile HMMs Motivation What is a Profile? Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein. Use Profile SimilarityWhat is a Profile?Proteins families of related sequences and structuresSame functionClear evolutionary relationshipPatterns of conservation, some positions are more conserved than the others
16 Transition probabilities HMMs From AlignmentACA ATGTCA ACT ATCACA C - - AGCAGA ATCACC G - - ATCinsertionTransition probabilitiesOutput ProbabilitiesA HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.
17 HMMs from Alignments Deletion states Matching states Insertion states No of matching states = average sequence length in the familyPFAM Database - of Protein families(
18 Database SearchingGiven HMM, M, for a sequence family, find all members of the family in data base.LL – score LL(x) = log P(x|M)(LL score is length dependent – must normalize or use Z-score)
19 Querying a SequenceSuppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities.Consensus sequence:P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x0.8x1 x 0.8 = 4.7 x 10 -2ACAC - - ATC
20 Multiple AlignmentsTry every possible path through the model that would produce the target sequencesKeep the best one and its probability.Output : Sequence of match, insert and delete statesViterbi alg. Dynamic Programming
21 HMMs from Unaligned Sequences Baum-Welch Expectation-maximization methodStart with a model whose length matches the average length of the sequences and with random output and transition probabilities.Align all the sequences to the model.Use the alignment to alter the output and transition probabilitiesRepeat. Continue until the model stops changingBy-product: a multiple alignment
22 PHMM ExampleAn alignment of 30 short amino acid sequences chopped out of a alignment of the SH3 domain. The shaded area are the most conserved and were represented by the main states in the HMM. The unshaded area was represented by an insert state.