HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Hidden Markov Model.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Hidden Markov Models.
Measuring the degree of similarity: PAM and blosum Matrix
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models Modified from:
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Profiles for Sequences
Hidden Markov Models in Bioinformatics Applications
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Profile HMMs for sequence families and Viterbi equations Linda Muselaars and Miranda Stobbe.
Lecture 6, Thursday April 17, 2003
Heuristic alignment algorithms and cost matrices
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Progressive MSA Do pair-wise alignment Develop an evolutionary tree Most closely related sequences are then aligned, then more distant are added. Genetic.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Tutorial 5 Motif discovery.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Similar Sequence Similar Function Charles Yan Spring 2006.
Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
An Introduction to Bioinformatics
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.
Profile Searches Revised 07/11/06. Overview Introduction Motif representation Motif screening Motif Databases Exercise.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Chapter 3 Computational Molecular Biology Michael Smith
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Motif discovery and Protein Databases Tutorial 5.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Hidden Markov Models An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Construction of Substitution matrices
Step 3: Tools Database Searching
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Pairwise Sequence Alignment and Database Searching
Free for Academic Use. Jianlin Cheng.
Sequence similarity, BLAST alignments & multiple sequence alignments
Learning Sequence Motif Models Using Expectation Maximization (EM)
Intro to Alignment Algorithms: Global and Local
Sequence Based Analysis Tutorial
Lecture 13: Hidden Markov Models and applications
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics

Contents  Motifs –We have seen motifs in regular expression –Profiles & consensus  Motif search –sequence motifs represent critical positions that are conserved in evolution, so search algorithms employing motifs may be used to identify more divergent sequences than methods based on global sequence similarity  PSI-BLAST (similarity search using PSSM, Position Specific Scoring Matrix )  HMM of protein family (a very brief introduction)

Motifs: Profiles and Consensus a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G A Profile C G T Consensus A C G T A C G T  Line up the patterns by their start indexes s = (s 1, s 2, …, s t )  Construct matrix profile with frequencies of each nucleotide in columns  Consensus nucleotide in each position has the highest score in column

Profile Representation of Protein Families Aligned DNA sequences can be represented by a 4 ·n profile matrix reflecting the frequencies of nucleotides in every aligned position. Protein family can be represented by a profile representing frequencies of amino acids. Protein family can be represented by a 20·n profile representing frequencies of amino acids.

Profiles and HMMs  HMMs can also be used for aligning a sequence against a profile representing protein family.  A 20·n profile P corresponds to n sequentially linked match states M 1,…,M n in the profile HMM of P.

Multiple Alignments and Protein Family Classification  Multiple alignment of a protein family shows variations in conservation along the length of a protein  Example: after aligning many globin proteins, the biologists recognized that the helices region in globins are more conserved than others.

What are Profile HMMs ?  A Profile HMM is a probabilistic representation of a multiple alignment.  A given multiple alignment (of a protein family) is used to build a profile HMM.  This model then may be used to find and score less obvious potential matches of new protein sequences.

Profile HMM A profile HMM

Building a Profile HMM  Multiple alignment is used to construct the HMM model.  Assign each column to a Match state in HMM. Add Insertion and Deletion state.  Estimate the emission probabilities according to amino acid counts in column. Different positions in the protein will have different emission probabilities.  Estimate the transition probabilities between Match, Deletion and Insertion states  The HMM model gets trained to derive the optimal parameters.

States of Profile HMM  Match states M 1 …M n (plus begin/end states)  Insertion states I 0 I 1 …I n  Deletion states D 1 …D n

Transition Probabilities in Profile HMM  log(a MI )+log(a IM ) = gap initiation penalty  log(a II gap extension penalty  log(a II ) = gap extension penalty

Emission Probabilities in Profile HMM Probabilty of emitting a symbol a at an Probabilty of emitting a symbol a at an insertion state I j : insertion state I j : e Ij (a) = p(a) where p(a) is the frequency of the where p(a) is the frequency of the occurrence of the symbol a in all the occurrence of the symbol a in all the sequences. sequences.

Profile HMM Alignment  Define v M j (i) as the logarithmic likelihood score of the best path for matching x 1..x i to profile HMM ending with x i emitted by the state M j.  v I j (i) and v D j (i) are defined similarly.

Profile HMM Alignment: Dynamic Programming v M j-1 (i-1) + log(a M j-1, M j ) v M j-1 (i-1) + log(a M j-1, M j ) v M j (i) = log (e M j (x i )/p(x i )) + max v I j-1 (i-1) + log(a I j-1, M j ) v D j-1 (i-1) + log(a D j-1, M j ) v D j-1 (i-1) + log(a D j-1, M j ) v M j (i-1) + log(a M j, I j ) v M j (i-1) + log(a M j, I j ) v I j (i) = log (e I j (x i )/p(x i )) + max v I j (i-1) + log(a I j, I j ) v D j (i-1) + log(a D j, I j ) v D j (i-1) + log(a D j, I j )

Paths in Edit Graph and Profile HMM A path through an edit graph and the corresponding path through a profile HMM

Making a Collection of HMM for Protein Families  Use Blast to separate a protein database into families of related proteins  Construct a multiple alignment for each protein family.  Construct a profile HMM model and optimize the parameters of the model (transition and emission probabilities).  Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

Application of Profile HMM to Modeling Globin Proteins  Globins represent a large collection of protein sequences  400 globin sequences were randomly selected from all globins and used to construct a multiple alignment.  Multiple alignment was used to assign an initial HMM  This model then get trained repeatedly with model lengths chosen randomly between 145 to 170, to get an HMM model optimized probabilities.

hmmer package  Tools for making HMMs and for hmmscan  hmmer3 (as fast as blast)

Sequence Pattern (Motif) Discovery  Finding patterns in multiple alignments, or in unaligned sequences  eMotif (a protein pattern database); eBLOCKs  Gibbs and MEME –To infer patterns in unaligned sequences –Gibbs program starts with a fixed pattern length of W and a random set of locations of the pattern in given input sequences (i.e., the initial pattern is random); and then one sequence is selected at a time randomly and an attempt is made to improve its pattern position. –MEME uses many similar concepts, but uses the EM (expectation maximization) method.

Utilization of Multiple Alignments  Residue conservation –Jalview  Subfamilies –SCI-PHY –FunShift

Readings  Chapter 6