Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

Similar presentations


Presentation on theme: "Entropy, Information contents & Logo plots By Thomas Nordahl Petersen"— Presentation transcript:

1 Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

2 Biological information
GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Exon Intron Exon Mutiple alignment of acceptor sites from 268 yeast DNA sequences What is the biological signal around the site ? What are the important positions How can it be visualized ? Logo plot with Information Content Sequence-logo

3 Entropy - Definition Entropy of random variable is a measure of the uncertainty In Thermodynamics G=H-TS The entropy S of a system is the degree of disorder

4 Entropy - Definition Entropy of a distribution of amino acids
The Shannon entropy: H(p) = - a pa log2(pa), where p is an amino acid distribution. H(p) is measured in bits: log2(2) = 1, log2(4)=2 Mutiple alignment of 3 sequences Seq1: A L P K Seq2: A V P R Seq3: A I K R High entropy - high disorder Low entropy - low disorder

5 Entropy - example H(p) = - a pa log2(pa)
Mutiple alignment of 3 sequences Seq1: A L R Seq2: A V R Seq3: A I K Pos1: H(p)= -[1*log2(1)] = 0 Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=1.58 Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =0.92

6 Relative Entropy The Kullback-Leiber distance D
How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them. D(p||q) = a pa log2(pa/qa) Normally a background distribution of the amino acids is obtained as frequencies from a large database like UniProt. Ala (A) Gln (Q) Leu (L) Ser (S) 6.87 Arg (R) Glu (E) Lys (K) Thr (T) 5.46 Asn (N) Gly (G) Met (M) Trp (W) 1.16 Asp (D) His (H) Phe (F) Tyr (Y) 3.07 Cys (C) Ile (I) Pro (P) Val (V) 6.71

7 Information content D(p||q) = a pa log2(pa/qa)
Often the Information content is used as a measure of the degree of conservation. I = a pa log2(pa/qa) A special case is that where all amino acids have the same background distribution: qa = 1/20

8 Information contents amino acids
I = a pa log2(pa/(1/20)) = a pa [log2pa - log2(1/20)] = -H(p) - a palog2(1/20) = -H(p) + a palog2(20) = -H(p) + log2(20) = -H(p) = a palog2pa

9 Information content I = -H(p) + 4.32 = a palog2pa + 4.32
General formula: a palog2pa + log2(N), where N is number of letters The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment. Mutiple alignment of 3 sequences: Seq1: A L R Seq2: A V R Seq3: A I K Pos1: I = [1*log2(1)] = 4.32 Pos2: I = [1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] =2.74 Pos3: I = [2/3*log2(2/3)+ 1/3*log2(1/3) = 3.40

10 Logo plots - HowTo Count nucleotides at each position:
GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Count nucleotides at each position: Convert to frequencies: Frequency-logo:

11 Logo plots - Information Content
Calculate Information Content I = apalog2pa + log2(4), Maximal value is 2 bits Sequence-logo Completely conserved ~0.5 each Total height at a position is the ‘Information Content’ measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment.

12 Programs to make a Logo plot
WebLogo Requires a mutiple alignment as input Protein or DNA sequences More output formats Blast2Logo Requires a fasta file as input Only protein sequences Runs PSI-blast and makes a table of frequencies pdf logo plot

13 WebLogo - http://weblogo.berkeley.edu/

14 WebLogo - http://weblogo.berkeley.edu/

15 Find important positions
>sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesterase MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGR SARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV NETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG VEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVL TTTSFEGTCL What is the next step ? Find homologous sequences - how ? Blast or PsiBlast Download sequences Make a mutiple alignment ClustalW, Mafft or others or use Blast2Logo program

16 Mutiple alignment programs

17 Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/

18 Important positions Important positions in proteins are conserved
positions => high Information Content. Conserved for a reason: Functionally important positions Catalytic residues Structurally important positions Manitain the correct fold of the protein

19 Blast2logo Runs iterative blast i.e. Psi-Blast
Searching for homologues sequences by use of Position Specific Scoring Matrices (PSSM). Iteration - use Blosum62 scoring matrix Iteration - make profile of seq found in iteration 1 Iteration - make profile of seq found in iteration 2 Iteration - Calculate aa freq at each position in query sequence. Correct for low counts and weight seq such that very similar seq are down weighted

20 Psi-Blast Iterative Blast
An iterative process to search for remote homologs Capture and use evolutionary conserved information Scoring matrix is refined by use of gap-free multiple alignment Input sequence Sequence database Blast E < threshold 4 iterations PSSM Multiple alignment PSSM: Position Specific Scoring Matrix 20

21 Important positions - counting

22 Blast2logo Important amino acids: G24, D25 & S26
G89, N91 & D92 Important amino acids: D209 & H212 22

23 Exercise Calculate nucleotide frequencies from a mutiple alignment of human donor sites Calculate Entropy and Information content Draw (by hand) a Logo plot Use 2 Logo plot programs Learn to interpret Logo & frequency plots


Download ppt "Entropy, Information contents & Logo plots By Thomas Nordahl Petersen"

Similar presentations


Ads by Google