Download presentation

Presentation is loading. Please wait.

Published byRachel Cantrell Modified over 3 years ago

1
Sequence motifs, information content, logos, and HMMs Morten Nielsen, CBS, BioCentrum, DTU

2
Outline Multiple alignments and sequence motifs Weight matrices and consensus sequence – Sequence weighting – Low (pseudo) counts Information content – Sequence logos – Mutual information Example from the real world HMMs and profile HMMs – TMHMM (trans-membrane protein) – Gene finding Links to HMM packages

3
Multiple alignment and sequence motifs Core Consensus sequence Weight matrices Problems – Sequence weights – Low counts MLEFVVEADLPGIKA MLEFVVEFALPGIKA MLEFVVEFDLPGIAA YLQDSDPDSFQD GSDTITLPCRMKQFINMWQE RNQEERLLADLMQNYDPNLR YDPNLRPAERDSDVVNVSLK NVSLKLTLTNLISLNEREEA EREEALTTNVWIEMQWCDYR WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN IVLENNVDGVFEVALYCNVL YCNVLVSPDGCIYWLPPAIF PPAIFRSACSISVTYFPFDW---- ********* FVVEFDLPG Consensus

4
Sequences weighting 1 - Clustering MLEFVVEADLPGIKA MLEFVVEFALPGIKA MLEFVVEFDLPGIAA YLQDSDPDSFQD GSDTITLPCRMKQFINMWQE RNQEERLLADLMQNYDPNLR YDPNLRPAERDSDVVNVSLK NVSLKLTLTNLISLNEREEA EREEALTTNVWIEMQWCDYR WCDYRLRWDPRDYEGLWVLR--- --LWVLRVPSTMVWRPDIVLEN IVLENNVDGVFEVALYCNVL YCNVLVSPDGCIYWLPPAIF PPAIFRSACSISVTYFPFDW---- ********* } Homologous sequences Weight = 1/n (1/3) Consensus sequence YRQELDPLV Previous FVVEFDLPG

5
Sequences weighting 2 - (Henikoff & Henikoff) W FVVEADLPG 0.37 FVVEFALPG 0.43 FVVEFDLPG 0.32 YLQDSDPDS 0.59 MKQFINMWQ 0.90 LMQNYDPNL 0.68 PAERDSDVV 0.75 LKLTLTNLI 0.85 VWIEMQWCD 0.84 YRLRWDPRD 0.51 WRPDIVLEN 0.71 VLENNVDGV 0.59 YCNVLVSPD 0.71 FRSACSISV 0.75 W aa = 1/rs r: Number of different aa in a column s: Number occurrences Normalize so W aa = 1 for each column Sequence weight is sum of W aa F: r=7 (FYMLPVW), s=4 w=1/28, w = Y: s=3, w`=1/21, w = M,P,W: s=1, w=1/7, w = L,V: s=2, w=1/14, w = 0.109

6
Low count correction MLEFVVEADLPGIKA MLEFVVEFALPGIKA MLEFVVEFDLPGIAA YLQDSDPDSFQD GSDTITLPCRMKQFINMWQE RNQEERLLADLMQNYDPNLR YDPNLRPAERDSDVVNVSLK NVSLKLTLTNLISLNEREEA--- --EREEALTTNVWIEMQWCDYR WCDYRLRWDPRDYEGLWVLR--- LWVLRVPSTMVWRPDIVLEN IVLENNVDGVFEVALYCNVL YCNVLVSPDGCIYWLPPAIF PPAIFRSACSISVTYFPFDW---- ********* Limited number of data Poor sampling of sequence space I is not found at position P1. Does this mean that I is forbidden? No! Use Blosum matrix to estimate pseudo frequency of I P1

7
Low count correction using Blosum matrices # I L V L V Blosum62 substitution frequencies Every time for instance L/V is observed, I is also likely to occur Estimate low (pseudo) count correction using this approach As more data are included the pseudo count correction becomes less important N L = 2, N V =2, N eff =12 => f I = (2* *0.1646)/12 = 0.05 p I * = (N eff * p I + * f I )/(N eff + ) = (12*0 + 10*0.05)/(12+10) = 0.02

8
Information content Information and entropy – Conserved amino acid regions contain high degree of information (high order == low entropy) – Variable amino acid regions contain low degree of information (low order == high entropy) Shannon information D = log 2 (N) + p i log 2 p i (for proteins N=20, DNA N=4) Conserved residue p A =1, p i<>A =0, D = log 2 (N) ( = 4.3 for proteins) Variable region p A =0.05, p C =0.05,.., D = 0

9
Sequence logo Height of a column equal to D Relative height of a letter is p A Highly useful tool to visualize sequence motifs High information position MHC class II Logo from 10 sequences

10
Frequency matrix A R N D C Q E G H I L K M F P S T W Y V Frequencies x 100

11
More on Logos Information content D = p i log 2 (p i /q i ) Shannon, q i = 1/N = 0.05 D = p i log 2 (p i ) - p i log 2 (1/N) = log 2 N - p i log 2 (p i ) Kullback-Leibler, q i = background frequency – V/L/A more frequent than for instance C/H/W

12
Mutual information I(i,j) = aa i aa j P(aa i, aa j ) * log[P(aa i, aa j )/P(aa i )*P(aa j )] P(G 1 ) = 2/9 = 0.22,.. P(V 6 ) = 4/9 = 0.44,.. P(G 1,V 6 ) = 2/9 = 0.22, P(G 1 )*P(V 6 ) = 8/81 = 0.10 log(0.22/0.10) > 0 ALWGFFPVA ILKEPVHGV ILGFVFTLT LLFGYPVYV GLSPTVWLS YMNGTMSQV GILGFVFTL WLSLLVPFV FLPSDFFPS P1P6

13
Mutual information 313 binding peptides313 random peptides

14
Weight matrices Estimate amino acid frequencies from alignment inc. sequence weighting and pseudo counts Now a weight matrix is given as W ij = log(p ij /q j ) Here i is a position in the motif, and j an amino acid. q j is the background frequency for amino acid j. W is a L x 20 matrix, L is motif length Score sequences to weight matrix by looking up and adding L values from matrix

15
Example from real life 10 peptides from MHCpep database Bind to the MHC complex Relevant for immune system recognition Estimate sequence motif and weight matrix Evaluate on 528 peptides ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

16
Example from real life (cont.) Raw sequence counting – No sequence weighting – No pseudo count – Prediction accuracy 0.45 Sequence weighting – No pseudo count – Prediction accuracy 0.5

17
Example from real life (cont.) Sequence weighting and pseudo count – Prediction accuracy 0.60 Motif found on all data (485) – Prediction accuracy 0.79

18
Hidden Markov Models Weight matrices do not deal with insertions and deletions In alignments, this is done in an ad-hoc manner by optimization of the two gap penalties for first gap and gap extension HMM is a natural frame work where insertions/deletions are dealt with explicitly

19
HMM (a simple example) ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC Example from A. Krogh Core region defines the number of states in the HMM (red) Insertion and deletion statistics is derived from the non-core part of the alignment (blue) Core of alignment

20
.8.2 ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT ACGTACGT HMM construction ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC 5 matches. A, 2xC, T, G 5 transitions in gap region C out, G out A-C, C-T, T out Out transition 3/5 Stay transition 2/5 ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10 -2

21
Align sequence to HMM ACA---ATG 0.8x1x0.8x1x0.8x0.4x1x0.8x1x0.2 = 3.3x10 -2 TCAACTATC 0.2x1x0.8x1x0.8x0.6x0.2x0.4x0.4x0.4x0.2x0.6x1x1x0.8x1x0.8 = x10 -2 ACAC--AGC = 1.2x10 -2 AGA---ATC = 3.3x10 -2 ACCG--ATC = 0.59x10 -2 Consensus: ACAC--ATC = 4.7x10 -2, ACA---ATC = 13.1x10 -2 Exceptional: TGCT--AGG = x10 -2

22
Align sequence to HMM - Null model Score depends strongly on length Null model is a random model. For length L the score is 0.25 L Log-odd score for sequence S Log( P(S)/0.25 L ) ACA---ATG = 4.9 TCAACTATC = 3.0 ACAC--AGC = 5.3 AGA---ATC = 4.9 ACCG--ATC = 4.6 Consensus: ACAC--ATC = 6.7 ACA---ATC = 6.3 Exceptional: TGCT--AGG = Note!

23
HMMs and weight matrices Note. In the case of un-gapped alignments HMMs become simple weight matrices It still might be useful to use a HMM tool package to estimate a weight matrix – Sequence weighting – Pseudo counts

24
EM55_HUMAN WWQGRVEGSSKESAGLIPSPELQEWRVASMAQSAP--SEAPSCSPFGKKKK-YKDKYLAK CSKP_HUMAN WWQGKLENSKNGTAGLIPSPELQEWRVACIAMEKTKQEQQASCTWFGKKKKQYKDKYLAK KAPB_MOUSE -----PENLLIDHQGYIQVTDFGFAKRVKG NRC2_NEUCR -----PENILLHQSGHIMLSDFDLSKQSDPGGKPTMIIGKNGTSTSSLPTIDTKSCIANF EM55_HUMAN HSSIFDQLDVVSYEEVVRLPAFKRKTLVLIGASGVGRSHIKNALLSQNPEKFVYPVPYTT CSKP_HUMAN HNAVFDQLDLVTYEEVVKLPAFKRKTLVLLGAHGVGRRHIKNTLITKHPDRFAYPIPHTT KAPB_MOUSE RTWTLCGTPEYLAPEIILSKGYNKAVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSG NRC2_NEUCR RTNSFVGTEEYIAPEVIKGSGHTSAVDWWTLGILIYEMLYGTTPFKGKNRNATFANILRE EM55_HUMAN RPPRKSEEDGKEYHFISTEEMTRNISANEFLEFGSYQGNMFGTKFETVHQIHKQNKIAIL CSKP_HUMAN RPPKKDEENGKNYYFVSHDQMMQDISNNEYLEYGSHEDAMYGTKLETIRKIHEQGLIAIL KAPB_MOUSE KVRFPSHF-----SSDLKDLLRNLLQVDLTKRFGNLKNGVSDIKTHKWFATTDWIAIYQR NRC2_NEUCR DIPFPDHAGAPQISNLCKSLIRKLLIKDENRRLG-ARAGASDIKTHPFFRTTQWALI--R EM55_HUMAN NNGVDETLKKLQEAFDQACSSPQWVPVSWVY CSKP_HUMAN NNEIDETIRHLEEAVELVCTAPQWVPVSWVY KAPB_MOUSE EKCGKEFCEF NRC2_NEUCR ENAVDPFEEFNSVTLHHDGDEEYHSDAYEKR Profile HMMs Insertion Deletion

25
Profile HMMs All M/D pairs must be visited once

26
TMHMM (trans-membrane HMM) (Sonnhammer, von Heijne, and Krogh) Model TM length distribution. Power of HMM. Difficult in alignment.

27
Combination of HMMs - Gene finding x ccc xxxxxxxxATGccc cccTAAxxxxxxxx Inter-genic region Region around start codon Coding region Region around stop codon Start codon Stop codon

28
HMM packages HMMER (http://hmmer.wustl.edu/) – S.R. Eddy, WashU St. Louis. Freely available. SAM ( – R. Hughey, K. Karplus, A. Krogh, D. Haussler and others, UC Santa Cruz. Freely available to academia, nominal license fee for commercial users. META-MEME ( – William Noble Grundy, UC San Diego. Freely available. Combines features of PSSM search and profile HMM search. NET-ID, HMMpro ( – Freely available to academia, nominal license fee for commercial users. – Allows HMM architecture construction.

29
Simple Hmmer command hmmbuild --gapmax fast A2.hmmer A2.fsa hmmbuild - build a hidden Markov model from an alignment HMMER 2.2g (August 2001) Alignment file: A2.fsa File format: a2m Search algorithm configuration: Multiple domain (hmmls) Model construction strategy: Fast/ad hoc (gapmax 0.0) Null model used: (default) Sequence weighting method: G/S/C tree weights Alignment: #1 Number of sequences: 232 Number of columns: 9 Determining effective sequence number... done. [192] Weighting sequences heuristically... done. Constructing model architecture... done. Converting counts to probabilities... done. Setting model name, etc.... done. [A2.fasta] Constructed a profile HMM (length 9) Average score: bits Minimum score: bits Maximum score: bits Std. deviation: 2.72 bits >HLA-A Example_for_Ligand SLLPAIVEL >HLA-A Example_for_Ligand YLLPAIVHI >HLA-A Example_for_Ligand TLWVDPYEV >HLA-A Example_for_Ligand SXPSGGXGV >HLA-A Example_for_Ligand GLVPFLVSV

30
Weight matrix A R N D C Q E G H I L K M F P S T W Y V

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google