Sequence motifs, information content, logos, and HMM’s

Slides:



Advertisements
Similar presentations
CALENDAR.
Advertisements

The 5S numbers game..
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
Markov models and applications
The basics for simulations
Factoring Quadratics — ax² + bx + c Topic
Thomas Jellema & Wouter Van Gool 1 Question. 2Answer.
Hypothesis Tests: Two Independent Samples
Copyright © [2002]. Roger L. Costello. All Rights Reserved. 1 XML Schemas Reference Manual Roger L. Costello XML Technologies Course.
Artificial Intelligence
Before Between After.
25 seconds left…...
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Subtraction: Adding UP
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
1 Number of substitutions between two protein- coding genes Dan Graur.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence information, logos and Hidden Markov Models Morten Nielsen, CBS, BioCentrum,
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models What are the good for? Morten Nielsen CBS.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
SNU BioIntelligence Lab. ( 1 Ch 5. Profile HMMs for sequence families Biological sequence analysis: Probabilistic models of proteins.
Heuristic alignment algorithms and cost matrices
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Profile-profile alignment using hidden Markov models Wing Wong.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Position-Specific Substitution Matrices. PSSM A regular substitution matrix uses the same scores for any given pair of amino acids regardless of where.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS,
Similar Sequence Similar Function Charles Yan Spring 2006.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Sequence encoding, Cross Validation Morten Nielsen BioSys, DTU
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Sequence Alignment. How to score a MSA? Very commonly: Sum of Pairs = SP Compute the pairwise score of all pairs of sequences and sum them. Gap.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
What is a Project Purpose –Use a method introduced in the course to describe some biological problem How –Construct a data set describing the problem –Define.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
Construction of Substitution matrices
Blosum matrices What are they? Morten Nielsen BioSys, DTU
Hidden Markov model BioE 480 Sept 16, In general, we have Bayes theorem: P(X|Y) = P(Y|X)P(X)/P(Y) Event X: the die is loaded, Event Y: 3 sixes.
Psi-Blast Morten Nielsen, Department of systems biology, DTU.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Blast heuristics, Psi-Blast, and Sequence profiles Morten Nielsen Department of systems biology, DTU.
Free for Academic Use. Jianlin Cheng.
Sequence motifs, information content, logos, and HMM’s
Ab initio gene prediction
Alignment IV BLOSUM Matrices
Presentation transcript:

Sequence motifs, information content, logos, and HMM’s Morten Nielsen, CBS, BioCentrum, DTU

Outline Multiple alignments and sequence motifs Weight matrices and consensus sequence Sequence weighting Low (pseudo) counts Information content Sequence logos Mutual information Example from the real world HMM’s and profile HMM’s TMHMM (trans-membrane protein) Gene finding Links to HMM packages

Multiple alignment and sequence motifs Core Consensus sequence Weight matrices Problems Sequence weights Low counts ----------MLEFVVEADLPGIKA-------- ----------MLEFVVEFALPGIKA-------- ----------MLEFVVEFDLPGIAA-------- -------------YLQDSDPDSFQD-------- ---GSDTITLPCRMKQFINMWQE---------- ---RNQEERLLADLMQNYDPNLR---------- -------YDPNLRPAERDSDVVNVSLK------ ----------NVSLKLTLTNLISLNEREEA--- ----EREEALTTNVWIEMQWCDYR--------- ----------WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN----------- ------------IVLENNVDGVFEVALYCNVL- -------------YCNVLVSPDGCIYWLPPAIF ---------PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG Consensus

Sequences weighting 1 - Clustering ----------MLEFVVEADLPGIKA-------- ----------MLEFVVEFALPGIKA-------- ----------MLEFVVEFDLPGIAA-------- -------------YLQDSDPDSFQD-------- ---GSDTITLPCRMKQFINMWQE---------- ---RNQEERLLADLMQNYDPNLR---------- -------YDPNLRPAERDSDVVNVSLK------ ----------NVSLKLTLTNLISLNEREEA--- ----EREEALTTNVWIEMQWCDYR--------- ----------WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN----------- ------------IVLENNVDGVFEVALYCNVL- -------------YCNVLVSPDGCIYWLPPAIF ---------PPAIFRSACSISVTYFPFDW---- ********* } Homologous sequences Weight = 1/n (1/3) Consensus sequence YRQELDPLV Previous FVVEFDLPG

Sequences weighting 2 - (Henikoff & Henikoff) FVVEADLPG 0.37 FVVEFALPG 0.43 FVVEFDLPG 0.32 YLQDSDPDS 0.59 MKQFINMWQ 0.90 LMQNYDPNL 0.68 PAERDSDVV 0.75 LKLTLTNLI 0.85 VWIEMQWCD 0.84 YRLRWDPRD 0.51 WRPDIVLEN 0.71 VLENNVDGV 0.59 YCNVLVSPD 0.71 FRSACSISV 0.75 Waa’ = 1/rs r: Number of different aa in a column s: Number occurrences Normalize so S Waa= 1 for each column Sequence weight is sum of Waa F: r=7 (FYMLPVW), s=4 w’=1/28, w = 0.055 Y: s=3, w`=1/21, w = 0.073 M,P,W: s=1, w’=1/7, w = 0.218 L,V: s=2, w’=1/14, w = 0.109

Low count correction Limited number of data P1 Limited number of data Poor sampling of sequence space I is not found at position P1. Does this mean that I is forbidden? No! Use Blosum matrix to estimate pseudo frequency of I --------MLEFVVEADLPGIKA-------- --------MLEFVVEFALPGIKA-------- --------MLEFVVEFDLPGIAA-------- -----------YLQDSDPDSFQD-------- -GSDTITLPCRMKQFINMWQE---------- -RNQEERLLADLMQNYDPNLR---------- -----YDPNLRPAERDSDVVNVSLK------ --------NVSLKLTLTNLISLNEREEA--- --EREEALTTNVWIEMQWCDYR--------- --------WCDYRLRWDPRDYEGLWVLR--- LWVLRVPSTMVWRPDIVLEN----------- ----------IVLENNVDGVFEVALYCNVL- -----------YCNVLVSPDGCIYWLPPAIF -------PPAIFRSACSISVTYFPFDW---- *********

Low count correction using Blosum matrices Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important Blosum62 substitution frequencies # I L V L 0.1154 0.3755 0.0962 V 0.1646 0.1303 0.2689 NL = 2, NV=2, Neff=12 => fI = (2*0.1154 + 2*0.1646)/12 = 0.05 pI* = (Neff * pI + b * fI)/(Neff+b) = (12*0 + 10*0.05)/(12+10) = 0.02

Information content Information and entropy Shannon information Conserved amino acid regions contain high degree of information (high order == low entropy) Variable amino acid regions contain low degree of information (low order == high entropy) Shannon information D = log2(N) + S pi log2 pi (for proteins N=20, DNA N=4) Conserved residue pA=1, pi<>A=0, D = log2(N) ( = 4.3 for proteins) Variable region pA=0.05, pC=0.05, .., D = 0

Sequence logo Height of a column equal to D MHC class II Logo from 10 sequences Height of a column equal to D Relative height of a letter is pA Highly useful tool to visualize sequence motifs High information position http://www.cbs.dtu.dk/~gorodkin/appl/plogo.html

Frequency matrix Frequencies x 100 A R N D C Q E G H I L K M F P S T W Y V 2 1 1 1 1 1 1 1 1 4 16 1 6 15 7 1 2 7 18 13 8 19 1 1 7 2 2 2 1 3 15 13 6 2 1 2 2 7 1 8 3 2 7 2 1 17 13 2 1 8 14 3 1 1 7 7 2 0 1 8 8 13 13 14 1 2 13 2 1 2 3 3 1 7 1 3 7 0 1 7 4 1 7 7 7 1 2 2 1 13 15 2 6 6 1 7 2 7 7 4 5 2 8 23 1 6 3 2 1 3 3 2 1 1 1 13 8 0 1 18 2 1 7 13 1 1 2 2 1 8 14 2 6 1 20 7 2 7 1 3 3 7 7 8 7 1 7 8 1 2 8 2 1 1 13 7 2 7 1 7 3 2 7 19 1 6 2 8 1 9 9 2 1 1 1 7 2 0 1 18

More on Logos Information content Shannon, qi = 1/N = 0.05 D = S pi log2 (pi/qi) Shannon, qi = 1/N = 0.05 D = S pi log2 (pi) - S pi log2 (1/N) = log2 N - S pi log2 (pi) Kullback-Leibler, qi = background frequency V/L/A more frequent than for instance C/H/W

Mutual information ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS I(i,j) = Saai Saaj P(aai, aaj) * log[P(aai, aaj)/P(aai)*P(aaj)] P(G1) = 2/9 = 0.22, .. P(V6) = 4/9 = 0.44,.. P(G1,V6) = 2/9 = 0.22, P(G1)*P(V6) = 8/81 = 0.10 log(0.22/0.10) > 0

Mutual information   313 binding peptides 313 random peptides

Weight matrices Wij = log(pij/qj) Estimate amino acid frequencies from alignment inc. sequence weighting and pseudo counts Now a weight matrix is given as Wij = log(pij/qj) Here i is a position in the motif, and j an amino acid. qj is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length Score sequences to weight matrix by looking up and adding L values from matrix

Example from real life 10 peptides from MHCpep database Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Example from real life (cont.) Raw sequence counting No sequence weighting No pseudo count Prediction accuracy 0.45 Sequence weighting Prediction accuracy 0.5

Example from real life (cont.) Sequence weighting and pseudo count Prediction accuracy 0.60 Motif found on all data (485) Prediction accuracy 0.79

Hidden Markov Models Weight matrices do not deal with insertions and deletions In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension HMM is a natural frame work where insertions/deletions are dealt with explicitly

HMM (a simple example) Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics is derived from the non-core part of the alignment (blue) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Core of alignment

HMM construction ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC .4 5 matches. A, 2xC, T, G 5 transitions in gap region C out, G out A-C, C-T, T out Out transition 3/5 Stay transition 2/5 A C G T .2 .4 .2 .2 .6 .6 A C G T .8 A C G T A C G T .8 A C G T 1 A C G T A C G T 1. 1. .4 1. 1. .8 .2 .8 .2 .2 .2 .2 .8 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2

Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10-2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = 0.0075x10-2 ACAC--AGC = 1.2x10-2 AGA---ATC = 3.3x10-2 ACCG--ATC = 0.59x10-2 Consensus: ACAC--ATC = 4.7x10-2, ACA---ATC = 13.1x10-2 Exceptional: TGCT--AGG = 0.0023x10-2

Align sequence to HMM - Null model Score depends strongly on length Null model is a random model. For length L the score is 0.25L Log-odd score for sequence S Log( P(S)/0.25L) ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = -0.97 Note!

HMM’s and weight matrices Note. In the case of un-gapped alignments HMM’s become simple weight matrices It still might be useful to use a HMM tool package to estimate a weight matrix Sequence weighting Pseudo counts

Profile HMM’s Insertion Deletion EM55_HUMAN WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAK CSKP_HUMAN WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAK KAPB_MOUSE -----PENLLIDHQGYIQVTDFGFAKRVKG------------------------------ NRC2_NEUCR -----PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF EM55_HUMAN HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTT CSKP_HUMAN HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTT KAPB_MOUSE RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSG NRC2_NEUCR RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE EM55_HUMAN RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAIL CSKP_HUMAN RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAIL KAPB_MOUSE KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQR NRC2_NEUCR DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R EM55_HUMAN NNGVDETLKKLQEAFDQACSSPQWVPVSWVY CSKP_HUMAN NNEIDETIRHLEEAVELVCTAPQWVPVSWVY KAPB_MOUSE EKCGKEFCEF--------------------- NRC2_NEUCR ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR Deletion

All M/D pairs must be visited once Profile HMM’s All M/D pairs must be visited once

TMHMM (trans-membrane HMM) (Sonnhammer, von Heijne, and Krogh) Model TM length distribution. Power of HMM. Difficult in alignment.

Combination of HMM’s - Gene finding Start codon Stop codon x xxxxxxxxATGccc ccc cccTAAxxxxxxxx Inter-genic region Region around start codon Coding region Region around stop codon

HMM packages NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html) HMMER (http://hmmer.wustl.edu/) S.R. Eddy, WashU St. Louis. Freely available. SAM (http://www.cse.ucsc.edu/research/compbio/sam.html) R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. META-MEME (http://metameme.sdsc.edu/) William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. NET-ID, HMMpro (http://www.netid.com/html/hmmpro.html) Freely available to academia, nominal license fee for commercial users. Allows HMM architecture construction.

Simple Hmmer command hmmbuild --gapmax 0.0 --fast A2.hmmer A2.fsa hmmbuild - build a hidden Markov model from an alignment HMMER 2.2g (August 2001) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Alignment file: A2.fsa File format: a2m Search algorithm configuration: Multiple domain (hmmls) Model construction strategy: Fast/ad hoc (gapmax 0.0) Null model used: (default) Sequence weighting method: G/S/C tree weights - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Alignment: #1 Number of sequences: 232 Number of columns: 9 Determining effective sequence number ... done. [192] Weighting sequences heuristically ... done. Constructing model architecture ... done. Converting counts to probabilities ... done. Setting model name, etc. ... done. [A2.fasta] Constructed a profile HMM (length 9) Average score: -6.42 bits Minimum score: -15.47 bits Maximum score: -0.84 bits Std. deviation: 2.72 bits hmmbuild --gapmax 0.0 --fast A2.hmmer A2.fsa >HLA-A.0201 16 Example_for_Ligand SLLPAIVEL YLLPAIVHI TLWVDPYEV SXPSGGXGV GLVPFLVSV

Weight matrix A R N D C Q E G H I L K M F P S T W Y V -3 -4 -4 -5 -4 -3 -4 -5 -2 -1 1 -4 2 3 1 -3 -3 4 5 1 0 3 -3 -4 2 -2 -3 -4 -3 -2 1 2 2 -3 -4 -3 -3 4 -3 0 -2 -2 1 -2 -4 4 2 -4 -2 0 1 -2 -2 -3 1 0 -2 -4 -3 0 0 2 3 2 -4 -1 2 -3 -2 -3 -3 -2 -3 1 -3 -2 0 -4 -3 0 -1 -4 1 0 2 -3 -3 -4 -3 1 1 -3 2 0 -4 0 -2 4 2 -1 -1 -3 1 4 -3 1 -1 -3 -3 -2 -3 -2 -3 -4 -3 2 1 -4 -4 2 -3 -3 1 2 -4 -3 -2 -4 -3 0 1 -3 2 -3 4 0 -2 4 -3 -2 -2 0 1 0 2 -2 0 0 -3 -3 0 -2 -3 -3 3 0 -2 4 -3 0 -2 -3 1 3 -3 1 -2 0 -3 0 0 -3 -2 -3 -4 0 -2 -4 -3 2