Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
Learning HMM parameters
Hidden Markov Model.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models.
Hidden Markov Models Eine Einführung.
MNW2 course Introduction to Bioinformatics
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Statistical NLP: Lecture 11
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Lecture 5: Learning models using EM
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Introduction to Profile Hidden Markov Models
Computer vision: models, learning and inference
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Hidden Markov Models for Sequence Analysis 4
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models BMI/CS 576
Pairwise Sequence Alignment
Hidden Markov Models Part 2: Algorithms
Presentation transcript:

Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014

What you’ve seen thus far The pairwise sequence alignment task Notion of a “best” alignment – scoring schemes Dynamic programming algorithms for efficiently finding a “best” alignment Variants on the task –global, local, different gap penalty functions Heuristic methods for large-scale alignment - BLAST

Tasks addressed today How can we express the uncertainty of an alignment? How can we estimate the parameters for alignment? How can we align multiple sequences to each other?

Picking Alignments mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** mel CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG CGCAGTCGGCGGCGGCGGC pse CATTCGGCATGTGAAAA TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** ** mel GCGCAGTCAGC GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA--TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * * mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC pse GCCAAAGAGAGCGATTTTATTTATGTG ACTGCGCTGCCTG GTCCTCGGC ********** ************* ** **** * * * * * ** * mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATC pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * * mel CCAGCGAGA------ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAAT pse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC * ** * * *************** * mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTG pse GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * **** mel GCAGCGA pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** * mel CTGCGAC pse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** * Alignment summary: 27 mismatches, 12 gaps, 116 spaces Alignment summary: 45 mismatches, 4 gaps, 214 spaces Alignment 1 Alignment 2

An Elusive Cis-Regulatory Element Drosophila melanogaster polytene chromosomes >chr3R: TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCA TTCAAT TCAAAGCATATACATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAGCGC AGTCGGC GGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAATAAAGTCTTA TAAGAAACT CGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACA GCGCCGTCA GCACTGGCAGCGACTGCGAC Adf1→Antp:06447: Binding site for transcription factor Adf1 Antp: antennapedia

Alignment 1 Alignment 2 The Conservation of Adf1→Antp:06447 mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATA pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** mel CATGTGAAAATCCCAGCGAGAACTCCTTATTAATCCAG CGCAGTCGGCGGCGGCGGC pse CATTCGGCATGTGAAAA TCCTTATTAATCCAGAACGTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGC ********** *************** *** ** **** ** ** mel GCGCAGTCAGC GGTGGCAGCGCAGTATATAAATAAAGTCTTATAAGAAACTCGTGAGCG pse -CGCAG-CAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAA-- TCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCG ***** **** * ******** ************* ****************** * * mel ---AAAGAGAGCG-TTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTGGCAGCGACTGCGAC pse GCCAAAGAGAGCGATTTTATTTATGTG ACTGCGCTGCCTG GTCCTCGGC ********** ************* ** **** * * * * * ** * mel TGTTGTGTGATGTTGATTTCTTTACGACTCCTATCAAACTAAACCCATAAAGCATTCAATTCAAAGCATATACATGTGAAAATC pse T----TTTGATGTTGATTTCTTTACGAGTTTGATAGAACTAAACCCATAAAGCATTCAATTCGTAGCATATAGCTCTCCTCTGC * * ******************** * ** ************************** ******** * * * mel CCAGCGAGA ACTCCTTATTAATCCAGCGCAGTCGGCGGCGGCGGCGCGCAGTCAGCGGTGGCAGCGCAGTATATAAAT pse CATTCGGCATGTGAAAATCCTTATTAATCCAGAAC * ** * * *************** * mel AAAGTCTTATAAGAAACTCGTGAGCGAAAGAGAGCGTTTTATTTATGTGCGTCAGCGTCGGCCGCAACAGCGCCGTCAGCACTG pse GTGTGCGCCAGCGTCAGCGCCAGCGCCGGCAGCAGCCGCA ****** ******* ** ** * ** * **** mel GCAGCGA pse GCAGCAAAACGGCACGCTGGCAGCGGAGTATATAAATAATCTTATAAGAAACTCGTGTGAGCGCAACCGGGCAGCGGCCAAAGA ***** * mel CTGCGAC pse GAGCGATTTTATTTATGTGACTGCGCTGCCTGGTCCTCGGC * ** * Alignment summary: 27 mismatches, 12 gaps, 116 spaces Alignment summary: 45 mismatches, 4 gaps, 214 spaces TGTGCGTCAGCGTCGGCCGCAACA GCG TGTG ACTGCG **** ** *** 33% identity TGTGCGTCAGCGTCGGCCGCAACA GCG TGTGCGCCAGCGTCAGCGCCAGCG CCG ****** ******* ** ** * ** 74% identity

The Polytope TGTGCGTCAGCGTCGGCCGCAACA GCG TGTGCGCCAGCGTCAGCGCCAGCG CCG ****** ******* ** ** * ** TGTGCGTCAGCGTCGGCCGCAACA GCG TGTG ACTGCG **** ** *** 364 Vertices 760 Ridges 398 Facets Sequence lengths = 260bp, 280bp

Methodological machinery to be used Hidden Markov models (HMMs) –Viterbi and Forward algorithms –Profile HMMs –Pair HMMs –Connections to classical sequence alignment

Hidden Markov models Generative probabilistic models of sequences Explicitly models unobserved (hidden) states that “emit” the characters of the observed sequence Primary task of interest is to infer the hidden states given the observed sequence Alignment case: hidden states = alignment

Two HMM random variables Observed sequence Hidden state sequence HMM: –Markov chain over hidden sequence –Dependence between

The Parameters of an HMM since we’ve decoupled states and characters, we also have emission probabilities probability of emitting character b in state k probability of a transition from state k to l represents a path (sequence of states) through the model as in Markov chain models, we have transition probabilities

A Simple HMM with Emission Parameters 0.8 probability of emitting character A in state 2 probability of a transition from state 1 to state A 0.4 C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 A 0.2 C 0.3 G 0.3 T 0.2 beginend A 0.4 C 0.1 G 0.1 T 0.4

Three Important Questions How likely is a given sequence? the Forward algorithm What is the most probable “path” (sequence of hidden states) for generating a given sequence? the Viterbi algorithm How can we learn the HMM parameters given a set of sequences? the Forward-Backward (Baum-Welch) algorithm

How Likely is a Given Path and Sequence? the probability that the path is taken and the sequence is generated: (assuming begin/end are the only silent states on path)

How Likely Is A Given Path and Sequence? A 0.1 C 0.4 G 0.4 T 0.1 A 0.4 C 0.1 G 0.1 T 0.4 beginend A 0.4 C 0.1 G 0.2 T 0.3 A 0.2 C 0.3 G 0.3 T 0.2

How Likely is a Given Sequence? We usually only observe the sequence, not the path To find the probability of a sequence, we must sum over all possible paths but the number of paths can be exponential in the length of the sequence... the Forward algorithm enables us to compute this efficiently

How Likely is a Given Sequence: The Forward Algorithm A dynamic programming solution subproblem: define to be the probability of generating the first i characters and ending in state k we want to compute, the probability of generating the entire sequence (x) and ending in the end state (state N) can define this recursively

The Forward Algorithm because of the Markov property, don’t have to explicitly enumerate every path e.g. compute using A 0.4 C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 A 0.4 C 0.1 G 0.1 T 0.4 A 0.2 C 0.3 G 0.3 T 0.2 beginend

The Forward Algorithm initialization: probability that we’re in the start state and have observed 0 characters from the sequence

The Forward Algorithm recursion for silent states: recursion for emitting states (i =1…L):

The Forward Algorithm termination: probability that we’re in the end state and have observed the entire sequence

Forward Algorithm Example A 0.4 C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 A 0.4 C 0.1 G 0.1 T 0.4 A 0.2 C 0.3 G 0.3 T 0.2 beginend given the sequence x = TAGA

Forward Algorithm Example given the sequence x = TAGA initialization computing other values

Three Important Questions How likely is a given sequence? What is the most probable “path” for generating a given sequence? How can we learn the HMM parameters given a set of sequences?

Finding the Most Probable Path: The Viterbi Algorithm Dynamic programming approach, again! subproblem: define to be the probability of the most probable path accounting for the first i characters of x and ending in state k we want to compute, the probability of the most probable path accounting for all of the sequence and ending in the end state can define recursively can use DP to find efficiently

Finding the Most Probable Path: The Viterbi Algorithm initialization:

The Viterbi Algorithm recursion for emitting states (i =1…L): recursion for silent states: keep track of most probable path

The Viterbi Algorithm traceback: follow pointers back starting at termination:

Forward & Viterbi Algorithms beginend Forward/Viterbi algorithms effectively consider all possible paths for a sequence – Forward to find probability of a sequence –Viterbi to find most probable path consider a sequence of length 4…

HMM parameter estimation Easy if the hidden path is known for each sequence In general, the paths are unknown Baum-Welch (Forward-Backward) algorithm is used to compute maximum likelihood estimates Backward algorithm is analog of forward algorithm for computing probabilities of suffixes of a sequence

Learning Parameters: The Baum-Welch Algorithm algorithm sketch: –initialize the parameters of the model –iterate until convergence calculate the expected number of times each transition or emission is used adjust the parameters to maximize the likelihood of these expected values

How can we use HMMs for pairwise alignment? What is the observed sequence? –one of the two sequences? –both sequences? What is the hidden path? –the alignment

Profile HMM for pairwise alignment Select one sequence to be observed (the query) The other sequence (the reference) defines the states of the HMM Three classes of states –Match: corresponds to aligned positions –Delete: positions of the reference that are deleted in the query –Insert: positions on the query that are insertions relative to the reference

Profile HMMs i 2 i 3 i 1 i 0 d 1 d 2 d 3 m 1 m 3 m 2 startend Match states represent key conserved positions Insert states account for extra characters in some sequences Delete states are silent; they Account for characters missing in some sequences A0.01 R0.12 D0.04 N0.29 C0.01 E0.03 Q0.02 G0.01 Insert and match states have emission distributions over sequence characters

Example Profile HMM Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences match states delete states (silent) insert states

Profile HMM considerations Odd asymmetry: have to pick one sequence as reference Models conditional probability P(X|Y) of query sequence (X) given reference sequence (Y) Is there something more natural here? –Yes, Pair HMMs We will revisit Profile HMMs for multiple alignment a bit later

Pair Hidden Markov Models each non-silent state emits one or a pair of characters I: insert state D: delete state H: homology (match) state

PHMM Paths = Alignments HAAHAA HATHAT IGIG ICIC HGGHGG DTDT HCCHCC hidden: observed: sequence 1 : AAGCGC sequence 2 : ATGTC BE

Transition Probabilities probabilities of moving between states at each stepBHIDEB 1-2δ-τδδτ H δδτ I 1-ε-τετ D ετ E state i+1 state i

Emission Probabilities A 0.3 C 0.2 G 0.3 T 0.2 A 0.1C 0.4 G T 0.1ACGTA C G T Homology (H)Insertion (I)Deletion (D) single character pairs of characters

Pair HMM Viterbi probability of most likely sequence of hidden states generating length i prefix of x and length j prefix of y, with the last state being: H I D note that the recurrence relations here allow I  D and D  I transitions

PHMM Alignment calculate probability of most likely alignment traceback, as in Needleman-Wunsch (NW), to obtain sequence of state states giving highest probability HIDHHDDIIHH...

Correspondence with Needleman-Wunsch (NW) NW values ≈ logarithms of Pair HMM Viterbi values

Posterior Probabilities there are similar recurrences for the Forward and Backward values from the Forward and Backward values, we can calculate the posterior probability of the event that the path passes through a certain state S, after generating length i and j prefixes

Uses for Posterior Probabilities sampling of suboptimal alignments posterior probability of pairs of residues being homologous (aligned to each other) posterior probability of a residue being gapped training model parameters (EM)

Posterior Probabilities plot of posterior probability of each alignment column

Parameter Training supervised training –given: sequences and correct alignments –do: calculate parameter values that maximize joint likelihood of sequences and alignments unsupervised training –given: sequence pairs, but no alignments –do: calculate parameter values that maximize marginal likelihood of sequences (sum over all possible alignments)

Multiple Alignment with Profile HMMs given a set of sequences to be aligned –use Baum-Welch to learn parameters of model –may also adjust length of profile HMM during training to compute a multiple alignment given the profile HMM –run the Viterbi algorithm on each sequence –Viterbi paths indicate correspondences among sequences

Multiple Alignment with Profile HMMs

More common multiple alignment strategy: Progressive alignment TGTAACTGTAC ATGT--C ATGTGGC ATGTCATGTGGC TGTAAC TGT-AC -TGTAAC -TGT-AC ATGT--C ATGTGGC TGTTAAC -TGTTAAC -TGT-AAC -TGT--AC ATGT---C ATGT-GGC

Classification w/ Profile HMMs To classify sequences according to family, we can train a profile HMM to model the proteins of each family of interest Given a sequence x, use Bayes’ rule to make classification β-galactosidase β-glucanase β-amylase α-amylase

PFAM Large database of protein families Each family has a trained Profile HMM Example search with a globin sequence:

Summary Probabilistic models for alignment are more powerful than classical combinatorial alignment algorithms –Captures uncertainty in alignment –Allows for principled estimation of parameters –Easily used in classification settings