Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
DNA sequences alignment measurement
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
Overview of sequence database searching techniques and multiple alignment May 1, 2001 Quiz on May 3-Dynamic programming- Needleman-Wunsch method Learning.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Protein Modules An Introduction to Bioinformatics.
Similar Sequence Similar Function Charles Yan Spring 2006.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
HMMER tutorial 羅偉軒 Account IP: Account: binfo2005 Password: 2005binfo.
Single Motif Charles Yan Spring Single Motif.
Profile HMMs Biology 162 Computational Genetics Todd Vision 16 Sep 2004.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Hidden Markov Models As used to summarize multiple sequence alignments, and score new sequences.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
MCB 5472 Lecture #4: Probabilistic models of homology: Psi-BLAST and HMMs February 17, 2014.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Lab7 QRNA, HMMER, PFAM. Sean Eddy’s Lab
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Motif discovery and Protein Databases Tutorial 5.
Lab7 Twinscan, HMMER, PFAM. TWINSCAN TwinScan TwinScan finds genes in a "target" genomic sequence by simultaneously maximizing the probability of the.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
(H)MMs in gene prediction and similarity searches.
PORTING HMMER AND INTERPROSCAN TO THE GRID Daniel Alberto Burbano Sefair ( ) Michael Angel Pérez Cabarcas.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Pairwise Sequence Alignment and Database Searching
bacteria and eukaryotes
Sequence similarity, BLAST alignments & multiple sequence alignments
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Presentation transcript:

Chapter 6 Profiles and Hidden Markov Models

The following approaches can also be used to identify distantly related members to a family of protein (or DNA) sequences Position-specific scoring matrix (PSSM) Profile Hidden Markov Model These methods work by providing a statistical frame where the probability of residues or nucleotides at specific sequences are tested Thus, in multiple alignments, information on all the members in the alignment is retained.

Position-specific scoring matrices Position Sequence 1A T G T C G Sequence 2A A G A C T Sequence 3T A C T C A Sequence 4C G G A G G Sequence 5 A A C C T G Pos123456Overall Freq. A T G C Pos A T G C Pos A T G C Frequencies of observations in a position Normalised to overall frequencies Converted to log 2

Pos A T G C Match AACTCG to the PSSM matrix: = = ~80 Thus, the sequence AACTCG is 80 times more likely to fit than a random 6 nucleotide sequence

Profiles Profiles are PSSMs that include gap penalty information This is not a trivial problem, and is incorporated in Position specific iterated (PSI) BLAST A normal BLASTP is performed with the query sequence, homologs obtained, and a multiple alignment performed A Profile is based on this alignment The profile is used to search the database again, and a new profile is created by adding in newly identified homologs This process is repeated until no new homologs are identified PSI-BLAST very sensitive approach to search for distant relatives of a family High sensitivity can generate high false positive count Inclusion of false positives can lead to profile drift User can visually inspect each iteration result to decide on inclusion of sequences Typically 3-5 iterations sufficient to identiofy distant homologs

Markov Model and Hidden Markov Model A Markov chain described a series of events or states There is a certain probability to move from one state to the next state This is known as the transition probability Sequences can also be seen as Markov chains where the occurrence of a given nucleotide may depend on the preceding nucleotide Zero order Markov model described a state that is independent of a previous state First order Markov model state is dependent on direct precursor (i.e., di- nucleotide sequences) Second order Markov model, depends on three nucleotides, for example codons Thus frequency of transitions in tri-mers may be different in coding and non- coding regions of the genome The Markov model is therefore applicable to finding genes in genomes In a Markov model all states are observable 1 P 12 2 P 23 3 P 34 4 P 45 5

Hidden Markov model 1 2 2’ 3 3’ 4 4’ 5 P 12 P 23 P 34 P 45 P 12’ P 23’ P 34’ P 45’ A Markov model may consist of observable states and unobservable or “hidden” states The hidden states also affect the outcome of the observed states In a sequence alignment, a gap is an unobserved state that influences to probability of the next nucleotide The probability of going from one state to the next state is called the transition probability In DNA, there are four symbols or states: G, A, T and C (20 in proteins) The probability value associated with each symbol is the emission probability To calculate the probability of a particular path, the transition and emission probabilities of all possible paths have to be considered Begin stateEnd state Observable states Hidden states

A0.80 C0.02 G0.10 T0.08 Emission probability A0.11 C0.08 G0.32 T Transition probability State 1State 2 This particular Markov model has a probability of 0.80 X 0.40 X 0.32 = 0.102to generate the sequence AG This particular model shows that the sequence AT has the highest probability to occur Where do these numbers come from? A Markov model has to be “trained” with examples A simple two state example

Training The frequencies of occurrence of nucleotides in a multiply aligned sequence is used to calculate the emission and transition probabilities of each symbol at each state The trained HMM is then used to test how well a new sequence fits to the model The use a HMM for gaps sequence alignments, a state can either be a match/mismatch (mismatch is low probability match) (observable) Insertion (hidden) Deletion (hidden) B D1 I1 M1 D2 I2 M2 D3 I3 M3 I4 E There is one optimal path from B to E that describes the most probable sequence and the optimal alignment to the multiply aligned sequence family

Viterbi algorithm

Bubblesort indexvalue The algorithm Two loops The outer loop starts at index max-1 and decrements by -1 with every loop The inner loop starts at 0 and increments by +1 to the value of the outer loop Compare values at index and at index+1 in the inner loop If value[index]<value[index+1], swap them Continue until outer loop is 1 max 0 A brief interlude, looking at algorithms…

indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 Outerloop=9 Smallest number is now at the bottom

indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 indexvalue Outer loop = 9 Inner loop = 0 Outerloop=9 Next smallest number is now at the bottom-1

import random def bubblesort(list_of_numbers): for outer_loop in range(len(list_of_numbers)-1, 0, -1): for index in range(outer_loop): if list_of_numbers[index] < list_of_numbers[index + 1]: temporary = list_of_numbers[index] list_of_numbers[index] = list_of_numbers[index + 1] list_of_numbers[index + 1] = temporary return list_of_numbers numbers=range(10) #get a list of numbers from 0 to 9 random.shuffle(numbers) # shuffle the numbers print "In random order: ", numbers print "In order: ", bubblesort(numbers) Python code for Bubblesort algorithm

def qsort2(L): if len(L)<=1: return L pivot=L[0] less= [x for x in L if x<pivot] equal= [x for x in L if x==pivot] greater= [x for x in L if x>pivot] return qsort2(less)+equal+qsort2(greater) Quicksort

Applications of HMMs HMMs include predictive information of insertions and deletions separately Not arbitrary “gap penalties” Once HMMs are trained, can be used to identify distant family members in a database Can be used for protein family classification Advanced gene and promoter prediction Transmembrane protein prediction Protein fold recognition Nucleosome positions HMMer ( suite of linux programshttp://hmmer.wustl.edu/ hmmalign, aligns sequences to an HMM profile.hmmalign hmmbuild, build a hidden Markov model from an alignment.hmmbuild hmmcalibrate, calibrate HMM search statistics.hmmcalibrate hmmconvert, convert between profile HMM file formats.hmmconvert hmmemit, generate sequences from a profile HMM.hmmemit hmmfetch, retrieve specific HMM from an HMM database.hmmfetch hmmindex, create SSI index for an HMM database.hmmindex hmmpfam, search one or more sequences against HMM database.hmmpfam hmmsearch, search a sequence database with a profile HMM.hmmsearch

Chapter 7 Protein Motif and Domain Prediction

A motif is a conserved sequence aa long Eg. Zn-finger motif Domain is aa in length Eg. transmembrane domain Motifs and domains are often evolutionally conserved Useful to identify functions of proteins that should little homology over full sequence Motifs and domains often identified by PSSM and HMMs Motifs or domains can be stored in a database Unknown proteins can be matched to this database to identify motifs and domains and illuminate possible protein fundctions Motifs domains can be stored as regular expression ([ST]-X-[RK]) Or as PSSM or HMMs

Regular expressions E-X(2)-[FHM]-X(4)-{P}-L Invariant Conserved in square [] brackets Disallowed in curly {} brackets Nonspecific shown by X Repetions by number in round () brackets PROSITE ( High number of false negatives Database must be continually updated PSSM, profiles and HMMs incorporate statistical information and are much more accurate

PRINTS Matches smaller regions of a motifs called “fingerprints” to query BLOCKS PSSM or aligned sequences used to define blocks that are larger than motifs ProDom Database generated with PSI-BLAST Pfam Contains HMMs of seeded smaller alignment from SWISSPROT and trEMBL SMART Database of HMMs based on manual structural alignments or PSI- BLAST profiles

Protein family databases COG (Cluster of orthologous groups) All against all comparison of all sequenced genomes If best fit is obtained in prokaryotes, archeae and eukaryotes, defined as cluster Clusters can be searched to identify possible function of unknown protein ProtoNet Pairwise BLAST alignment of all protein sequences in SWISSPROT Query sequence searched against this database

Finding distant/little conserved motifs Expectation Maximization Use predicted alignment of sequences Calculate PSSM Iterate over used sequences and modify PSSM to better fit each in turn Gibbs Motif sampling Use estimated alignment of all but one sequence Calculate PSSM Recalculate PSSM with one left-out sequence Iterate process to convergence setting

Weblogo Graphical representation of the motif sequence Highly conserved residues are shown as larger symbols Ambiguity indicated Helix-turn-helix motif of E. coli CAP family protein