More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
BNFO 602 Multiple sequence alignment Usman Roshan.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
Multiple Sequence Comparison.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
1 Protein Multiple Alignment by Konstantin Davydov.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Multiple Sequence alignment Chitta Baral Arizona State University.
BNFO 602 Multiple sequence alignment Usman Roshan.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Practical multiple sequence algorithms Sushmita Roy BMI/CS 576 Sushmita Roy Sep 23rd, 2014.
Multiple Sequence Alignment
Multiple Sequence Alignments
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Probabilistic Sequence Alignment BMI 877 Colin Dewey February 25, 2014.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Multiple Alignment Modified from Tolga Can’s lecture notes (METU)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Chapter 3 Computational Molecular Biology Michael Smith
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PHMMs for Metamorphic Detection Mark Stamp 1PHMMs for Metamorphic Detection.
Multiple Sequence Alignment Colin Dewey BMI/CS 576 Fall 2015.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Multiple Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 13, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Hidden Markov Models BMI/CS 576
INTRODUCTION TO BIOINFORMATICS
Multiple sequence alignment (msa)
Learning Sequence Motif Models Using Expectation Maximization (EM)
Multiple Sequence Alignment (I)
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY
Presentation transcript:

More on HMMs and Multiple Sequence Alignment BMI/CS Mark Craven March 2002

Announcements readings for the week after Spring break –Brown & Botstein, Nature Genetics Supplement –Eisen et al., Proc. National Academy of Sciences –and more

Multiple Sequence Alignment: Task Definition Given –a set of more than 2 sequences –a method for scoring an alignment Do: –determine the correspondences between the sequences such that the similarity score is maximized

Motivation characterizing a set of sequences (e.g. some class of DNA signals) characterizing a protein family –what is conserved –what varies building profiles for searching

Multiple Alignment of SH3 Domain Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

The Structure of a Profile HMM i 2 i 3 i 1 i 0 d 1 d 2 d 3 m 1 m 3 m 2 startend

The Structure of a Profile HMM match states: represent mostly conserved positions in the sequence family insert states: represent subsequences that have been inserted in some members of the family delete states: silent states representing subsequences that have been deleted in some members of the family

A Profile HMM Trained for the SH3 Domain Figure from A. Krogh, An Introduction to Hidden Markov Models for Biological Sequences

Model Selection for Profile HMMs we have assumed we are given a model of a specified length; how do we determine this length? heuristic approach –choose an initial length; learn parameters –if more than x-del% of Viterbi paths go through delete state at position k, remove that position from model –if more than x-ins% go through insertions at position k, add new positions to the model –iterate

Classifying Sequences: Three Approaches choose threshold on Pr(x) that allows good discrimination between “positive” cases and “negative” cases –depends on length of x construct a “null” model; run query sequence x through both to see which results in greater Pr(x) construct a set of models for disjoint families; run query sequence x through all models to see which results in highest Pr(x)

Choosing a Threshold Figure from Krogh et al., Journal of Molecular Biology 235, 1994

Modeling Protein Domains with an HMM domain model begin 1 - p p i b p end i e 1 - p p p there are lots of ways we can modify the basic profile HMM architecture for particular modeling tasks – one such case is modeling protein domains

Other Methods: Scoring a Multiple Alignment key issue: how do we assess the quality of a multiple sequence alignment? usually, the assumption is made that the individual columns of an alignment are independent we’ll discuss two methods –sum of pairs (SP) –minimum entropy

Scoring an Alignment: Sum of Pairs compute the sum of the pairwise scores character of the kth sequence in the i th column substitution matrix

Scoring an Alignment: Minimum Entropy basic idea: try to minimize the entropy of each column another way of thinking about it: columns that can be communicated using few bits are good information theory tells us that an optimal code uses bits to encode a message of probability p

Scoring an Alignment: Minimum Entropy the messages in this case are the characters in a given column the entropy of a column is given by: the i th column of an alignment m count of character a in column i probability of character a in column i

Dynamic Programming Approach can find optimal alignments using dynamic programming generalization of methods for pairwise alignment –consider n-dimension matrix for n sequences (instead of 2-dimensional matrix) –each matrix element represents alignment score for n subsequences (instead of 2 subsequences) given n sequences of length L –space complexity is

Dynamic Programming Approach given n sequences of length L –time complexity is if we use SP if column scores can be computed in

Heuristic Alignment Methods since complexity of DP approach is exponential in the number of sequences, heuristic methods are usually used progressive alignment: construct a succession of pairwise alignments – CLUSTALW –star approach –etc. iterative refinement –given a multiple alignment (say from a progessive method) remove a sequence, realign it to profile of other sequences repeat until convergence

Star Alignment Approach given: n sequences to be aligned –pick one sequence as the “center” –for each determine an optimal alignment between and –aggregate pairwise alignments return: multiple alignment resulting from aggregate

Star Alignments: Picking the Center try each sequence as the center, return the best multiple alignment compute all pairwise alignments and select the string that maximizes:

Star Alignments: Aggregating Pairwise Alignments “once a gap, always a gap” shift entire columns when incorporating gaps

Star Alignment Example ATTGCCATT ATGGCCATT ATTGCCATT ATC-CAATTTT ATTGCCATT-- ATCTTC-TT ATTGCCATT ATTGCCGATT ATTGCC-ATT ATTGCCATT ATGGCCATT ATCCAATTTT ATCTTCTT ATTGCCGATT Given:

Star Alignment Example ATTGCCATT ATGGCCATT ATTGCCATT 1. ATC-CAATTTT ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT 2. merging pairwise alignments alignmentpresent pair

Star Alignment Example ATCTTC-TT ATTGCCATT ATTGCCATT-- ATGGCCATT-- ATC-CAATTTT ATCTTC-TT-- 3. ATTGCCGATT ATTGCC-ATT ATTGCC-ATT-- ATGGCC-ATT-- ATC-CA-ATTTT ATCTTC--TT-- ATTGCCGATT-- 4. alignmentpresent pair

Methods for Multiple Sequence Alignment

Probabilistic vs. Other Multiple Alignment Methods conventional methods use uniform substitution scores & gap penalties for all regions of sequences an HMM can score things differently in different regions (e.g. highly conserved vs. other regions)