Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
Learning HMM parameters
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
MNW2 course Introduction to Bioinformatics
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Lecture 6, Thursday April 17, 2003
Master’s course Bioinformatics Data Analysis and Tools Lecture 12: (Hidden) Markov models Centre for Integrative Bioinformatics.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Markov Models Charles Yan Spring Markov Models.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Master’s course Bioinformatics Data Analysis and Tools
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
More about Markov model.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Hw1 Shown below is a matrix of log odds column scores made from an alignment of a set of sequences. (A) Calculate the alignment score for each of the four.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Bioinformatics Sequence Analysis III
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Markov Chain Models BMI/CS 576 Fall 2010.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
Hidden Markov Models for Sequence Analysis 4
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Free for Academic Use. Jianlin Cheng.
Markov Chain Models BMI/CS 776
Presentation transcript:

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz Bioinformatics and Systems Biology Group

Ulf Schmitz, Statistical methods for aiding alignment2 Outline 1.Expectation Maximization Algorithm 2.Markov Models 3.Hidden Markov models

Ulf Schmitz, Statistical methods for aiding alignment3 Expectation Maximization Algorithm an algorithm for locating similar sequence patterns in a set of sequences suspected parts are then aligned an expected scoring matrix representing the distribution of sequence characters in each column of the alignment will be generated the pattern is matched to each sequence and the scoring matrix values are then updated to maximize the alignment of the matrix to the sequences this procedure is repeated until there is no further improvement

Ulf Schmitz, Statistical methods for aiding alignment4 Expectation Maximization Algorithm Seq1: Seq2: Seq3: Seq4:. Seq10: 100 nucleotides long seq1: … … … … … … TCAGAATGCAGCATAG … … … … … … … … … … … … seq2: … … … … … … CGCATAGAGCATAGAC … … … … … … … … … … … … seq3: … … … … … … ACAGACAAAAAAATAC … … … … … … … … … … … … seq4: … … … … … … CATAGCAGATACAGCA … … … … … … … … … … … … preliminary local alignment of the sequences Columns not in motif provide background frequencies ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG TAGACCATAGACCGATACGCGCATAGAGCATAGACACGATAGCATAGCATAGCAT TACAGATCAGCAAGAGCCGACAGACAAAAAAATACGAGCAAAACGAGCATTATCG TAGGGGACACAGATACAGACATAGCAGATACAGCATAGACATAGACAGATAGCAG. seq10: provides initial estimates of frequencies of nucleotides in each motif column

Ulf Schmitz, Statistical methods for aiding alignment5 Expectation Maximization Algorithm seq1: … … … … … … TCAGAATGCAGCATAG … … … … … … … … … … … … seq2: … … … … … … CGCATAGAGCATAGAC … … … … … … … … … … … … seq3: … … … … … … ACAGACAAAAAAATAC … … … … … … … … … … … … seq4: … … … … … … CATAGCAGATACAGCA … … … … … … … … … … … … BackgroundSite column 1Site column 2… G … C … A … T … The first column gives the background frequencies in the flanking sequence seq10: 4x17 matrix

Ulf Schmitz, Statistical methods for aiding alignment6 Expectation Maximization Algorithm Each sequence is scanned for all possible locations for the site to find the most probable location of the site the EM algorithm consists of two steps which are repeated consecutively step 1 - expectation step – –column-by-column composition of the found site is used to estimate the probability of finding the site at any position in each of the sequences –these probabilities are used to provide expected base or amino acid distribution for each column of the site Sequence 1xxxxoooooooooooooooooo xxxx |||| |||||||||||||||||| oxxxxooooooooooooooooo |||| | ||||||||||||||||| ooxxxxoooooooooooooooo |||| || |||||||||||||||| A B C Use estimates of residue frequencies for each column in the motif to calculate probability of motif in this position and multiply by… Backgroung frequencies in the remaining positions

Ulf Schmitz, Statistical methods for aiding alignment7 Expectation Maximization Algorithm seq1: ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG … T … A … C … G …Site column 2Site column 1Background (for a in pos. 1) (for C in pos. 2) for the next 14 positions in site for G in flanking pos. 1 for A in flanking pos.2 for the next 82 flanking positions Table of column frequencies of each base

Ulf Schmitz, Statistical methods for aiding alignment8 Expectation Maximization Algorithm seq1: ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG seq10: The probability of this best location in seq1, say at site k, is the ratio of the site probability at k divided by the sum of all other site probabilities. P (site k in seq1) = P site k, seq1 / P site 1, seq1 + P site 2, seq1 + … + P site 85, seq1 The probability of the site location in each sequence is then calculated in this manner.

Ulf Schmitz, Statistical methods for aiding alignment9 Expectation Maximization Algorithm step 2 – maximization step – –the new counts of bases or amino acids for each position in the site found in step 1 are substituted for the previous set seq1: ACATAGACAGTATAGAGAATCAGAATGCAGCATAGCAGCACATAGAGCAGCATAG (e.g.)P(site 1 in seq1) = 0.01 and P(site 2 in seq1) = 0.02 seq1: … … … … … … TCAGAATGCAGCATAG … … … … … … … … … … … … seq2: … … … … … … CGCATAGAGCATAGAC … … … … … … … … … … … … seq3: … … … … … … ACAGACAAAAAAATAC … … … … … … … … … … … … seq4: … … … … … … CATAGCAGATACAGCA … … … … … … … … … … … …. seq10: ACATAGACAGTATAGA CATAGACAGTATAGAG

Ulf Schmitz, Statistical methods for aiding alignment10 Expectation Maximization Algorithm This procedure is repeated for all other site locations and all other sequences. A new version of the table of residue frequencies can be build. The expectation and maximation steps are repeated until the estimates of the base frequencies do not change. MEME (Multiple EM for Motif Elication) : is a tool for performing msa‘s by the em method see

Ulf Schmitz, Statistical methods for aiding alignment11 Expectation Maximization Algorithm the EM algorithm consists of two steps which are repeated consecutively step 1 - expectation step – –column-by-column composition of the found site is used to estimate the probability of finding the site at any position in each of the sequences –these probabilities are used to provide expected base or amino acid distribution for each column of the site step 2 – maximization step – –the new counts of bases or amino acids for each position in the site found in step 1 are substituted for the previous set step 1 is then repeated until the algorithm converges on a solution

Ulf Schmitz, Statistical methods for aiding alignment12 Markov chain models –a Markov chain model is defined by: a set of states some states emit symbols other states (e.g. the begin state) are silent a set of transitions with associated probabilities the transitions emanating from a given state define a distribution over the possible next states

Ulf Schmitz, Statistical methods for aiding alignment13 Markov chain models given some sequence x of length L, we can ask how probable the sequence is, based on our model for any probabilistic model of sequences, we can write this probability as: key property of a (1st order) Markov chain: the probability of each X i depends only on X i-1

Ulf Schmitz, Statistical methods for aiding alignment14 Markov chain models 1st order Markov chain

Ulf Schmitz, Statistical methods for aiding alignment15 Markov chain models Example Application CpG islands –CG-dinucleotides are rarer in eukaryotic genomes than expected given the independent probabilities of C, G –but the regions upstream of genes are richer in CG dinucleotides than elsewhere – CpG islands –useful evidence for finding genes Could predict CpG islands with Markov chains –one to represent CpG islands –one to represent the rest of the genome

Ulf Schmitz, Statistical methods for aiding alignment16 Markov chain models

Ulf Schmitz, Statistical methods for aiding alignment17 Markov chain models Selecting the Order of a Markov Chain Model Higher order models remember more “history” Additional history can have predictive value Example: – predict the next word in this sentence fragment “…finish __” (up, it, first, last, …?) – now predict it given more history “Fast guys finish __”

Ulf Schmitz, Statistical methods for aiding alignment18 Hidden Markov Models (HMM) Hidden State We will distinguish between the observed parts of a problem and the hidden parts In the Markov models we have considered previously, it is clear which state accounts for each part of the observed sequence In another model, there are multiple states that could account for each part of the observed sequence – this is the hidden part of the problem – states are decoupled from sequence symbols

Ulf Schmitz, Statistical methods for aiding alignment19 Hidden Markov models Markov model: Move from state to state according to probability distribution of each state and emit states visited: Hidden Markov model: Move from state to state in the same way, but emit a symbol according to probability distribution instead:

Ulf Schmitz, Statistical methods for aiding alignment20 Hidden Markov Model Red square, match state green diamond, insert state blue circle, delete state Arrows indicate the probability of transition from one state to the next.

Ulf Schmitz, Statistical methods for aiding alignment21 Hidden Markov Model N * F L S N K Y L T Q * W - T A. Sequence alignment B. Hidden Markov model for sequence alignment BEGM1M2M3M4END I0I1 I2I3I4 D1 D2 D3 D4 Probability of sequence: N K Y L T BEG -> M -> I1 -> M2 -> M3 -> M4 -> END 0.33 * 0.05 * 0.33 * 0.05 * 0.33 * 0.05 * 0.33 * 0.05 * 0.33 * 0.05 * 0.5 = 6.1 *

Ulf Schmitz, Statistical methods for aiding alignment22 Hidden Markov Models Three Important Questions How likely is a given sequence? What is the most probable “path” for generating a given sequence? How can we learn the HMM parameters given a set of sequences?

Ulf Schmitz, Statistical methods for aiding alignment23 Hidden Markov Models HMM-based homology searching formal probabilistic basis and consistent theory behind gap and insertion scores HMMs good for profile searches, bad for alignment (due to parametrisation of the models) HMMs are slow HMMER - Tools: SAM -

Ulf Schmitz, Statistical methods for aiding alignment24 Outlook Machine learning Clustering

Ulf Schmitz, Statistical methods for aiding alignment25 Sequence Alignment Thanx for your attention!!!