Presentation is loading. Please wait.

Presentation is loading. Please wait.

Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

Similar presentations


Presentation on theme: "Entropy, Information contents & Logo plots By Thomas Nordahl Petersen"— Presentation transcript:

1 Entropy, Information contents & Logo plots By Thomas Nordahl Petersen

2 Biological information
GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Exon Intron Exon Mutiple alignment of acceptor sites from 268 yeast DNA sequences What is the biological signal around the site ? What are the important positions How can it be visualized ? Logo plot with Information Content Sequence-logo

3 Entropy - Definition Entropy of random variable is a measure of the uncertainty In Thermodynamics G=H-TS The entropy S of a system is the degree of disorder

4 Entropy - Definition Entropy of a distribution of amino acids
The Shannon entropy: H(p) = - a pa log2(pa), where p is an amino acid distribution. H(p) is measured in bits: log2(2) = 1, log2(4)=2 Mutiple alignment of 3 sequences Seq1: A L P K Seq2: A V P R Seq3: A I K R High entropy - high disorder Low entropy - low disorder

5 Entropy - example H(p) = - a pa log2(pa)
Mutiple alignment of 3 sequences Seq1: A L R Seq2: A V R Seq3: A I K Pos1: H(p)= -[1*log2(1)] = 0 Pos2: H(p)= -[1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)]=1.58 Pos3: H(p)= -[2/3*log2(2/3)+ 1/3*log2(1/3) =0.92

6 Relative Entropy The Kullback-Leiber distance D
How different is an amino acid distribution pa compared to a background distribution qa - i.e. distance D between them. D(p||q) = a pa log2(pa/qa) Normally a background distribution of the amino acids is obtained as frequencies from a large database like UniProt. Ala (A) Gln (Q) Leu (L) Ser (S) 6.87 Arg (R) Glu (E) Lys (K) Thr (T) 5.46 Asn (N) Gly (G) Met (M) Trp (W) 1.16 Asp (D) His (H) Phe (F) Tyr (Y) 3.07 Cys (C) Ile (I) Pro (P) Val (V) 6.71

7 Information content D(p||q) = a pa log2(pa/qa)
Often the Information content is used as a measure of the degree of conservation. I = a pa log2(pa/qa) A special case is that where all amino acids have the same background distribution: qa = 1/20

8 Information contents amino acids
I = a pa log2(pa/(1/20)) = a pa [log2pa - log2(1/20)] = -H(p) - a palog2(1/20) = -H(p) + a palog2(20) = -H(p) + log2(20) = -H(p) = a palog2pa

9 Information content I = -H(p) + 4.32 = a palog2pa + 4.32
General formula: a palog2pa + log2(N), where N is number of letters The Information content is at its maximum when then the entropy is zero - i.e. A fully conserved position in a multiple alignment. Mutiple alignment of 3 sequences: Seq1: A L R Seq2: A V R Seq3: A I K Pos1: I = [1*log2(1)] = 4.32 Pos2: I = [1/3*log2(1/3)+ 1/3*log2(1/3)+ 1/3*log2(1/3)] =2.74 Pos3: I = [2/3*log2(2/3)+ 1/3*log2(1/3) = 3.40

10 Logo plots - HowTo Count nucleotides at each position:
GTTCTTCGTGTTTATTTTTAGGAAATTGATGA TTGTTTCTCCTTTTAAAATAGTACTGCTGTTT TTTACTAACGACACATTGAAGAAATCACTTTG GATACGCTTACCGTTATCCAGAGCTACAGCGC TACTAATATGTAATACTTCAGCTCCCCTTAAT ATTGAGATCTTTTTTAACTAGTTAGGTCTACC TTCTCCCCTTCTTCATTTTAGCCTGTTTGGAC TAACATAACTTATTTACATAGTGCCATTGAAC GATATTTCCCGTTGTGTTAAGGCTGAGAAGAA TTTTCCCGACCATCAAGACAGGTGATTTATCA TGCAAAAACTTTTTTTCACAGGGCTAACTTGC GTTTATTGTGTTTCCACTCAGTTAAAAAACGA AACGTACTTTAATATTTATAGTACTTCATTCG AACATGCTATTTTTCATACAGCAACCTCACAT CTGCACTCATCATTAGATTAGAGGAACATGGA TACTTTTCTTTATCTAAGCAGCTAACTCAACT ATCAACATGCTATTGAACTAGAGATCCACCTA TAACTAACATGACTTTAACAGGGCTAATTTAC AGTACTAACTAATTAACTTAGAACATTAACAT GATCACCGTCACATTTATTAGAATTTCAAACG CAGTGGAATTTTTTTTTCTAGAAATGGTATCG CTCTATGACCAATAAAAACAGACTGTACTTTC AAATGGTATTATTTATAACAGTTGAACATTTC ATAAATATGCGATCAATATAGACCGTTGATAT ATTTTACTTTTTTTTTTTTAGGAGCTCCAAGA ATTTATTTCCTTATAATACAGACACGGTTACA TCGCAATTAATTTTCTAATAGTTTTTCATTTT GACCATCTTTCTTTTCCCCAGTGCTAAACACG AACCTTCTTTCTCATTCGTAGATTACTGTTGC AATTACTAACAGCTGTAATAGCCGACAAATTT CTCTCTGCGCGTCCAATTTAGCTATACTGTTG TTGTTTTGTTTTGTCGTACAGTGTTTGGAGAA AAACTTCCATTTCTTACATAGATCATCGCCAT TCCTTTCCATAATTTATTCAGCGCTTTGGTAT CGATTTACTATTTCCATTTAGACGTTGTTCAA AATTTACTAACAATACTTCAGTTTATAATGGA TCCTATACTAACAATTTGTAGTTCATAAATAA Count nucleotides at each position: Convert to frequencies: Frequency-logo:

11 Logo plots - Information Content
Calculate Information Content I = apalog2pa + log2(4), Maximal value is 2 bits Sequence-logo Completely conserved ~0.5 each Total height at a position is the ‘Information Content’ measured in bits. Height of letter is the proportional to the frequency of that letter. A Logo plot is a visualization of a mutiple alignment.

12 Logo Plot – DNA Splice sites
28-Dec-18 Acceptor site Doror site Exon

13 BLAT genome Browser ”Details”
28-Dec-18 BLAT genome Browser ”Details” Exon Correct splice site ?

14 BLAT genome Browser ”Details”
28-Dec-18 BLAT genome Browser ”Details” Donor site | Acceptor site exon... . G | GT ...intron ...AG | exon...

15 Programs to make a Logo plot
WebLogo Requires a mutiple alignment as input Protein or DNA sequences More output formats Blast2Logo Requires a fasta file as input Only protein sequences Runs PSI-blast and makes a table of frequencies pdf logo plot

16 WebLogo - http://weblogo.berkeley.edu/

17 WebLogo - http://weblogo.berkeley.edu/

18 Find important positions
>sp|Q00017|RHA1_ASPAC Rhamnogalacturonan acetylesterase MKTAALAPLFFLPSALATTVYLAGDSTMAKNGGGSGTNGWGEYLASYLSATVVNDAVAGR SARSYTREGRFENIADVVTAGDYVIVEFGHNDGGSLSTDNGRTDCSGTGAEVCYSVYDGV NETILTFPAYLENAAKLFTAKGAKVILSSQTPNNPWETGTFVNSPTRFVEYAELAAEVAG VEYVDHWSYVDSIYETLGNATVNSYFPIDHTHTSPAGAEVVAEAFLKAVVCTGTSLKSVL TTTSFEGTCL What is the next step ? Find homologous sequences - how ? Blast or PsiBlast Download sequences Make a mutiple alignment ClustalW, Mafft or others or use Blast2Logo program

19 Mutiple alignment programs

20 Blast2logo - http://www.cbs.dtu.dk/biotools/Blast2logo-1.0/

21 Important positions Important positions in proteins are conserved
positions => high Information Content. Conserved for a reason: Functionally important positions Catalytic residues Structurally important positions Manitain the correct fold of the protein

22 Blast2logo Runs iterative blast i.e. Psi-Blast
Searching for homologues sequences by use of Position Specific Scoring Matrices (PSSM). Iteration - use Blosum62 scoring matrix Iteration - make profile of seq found in iteration 1 Iteration - make profile of seq found in iteration 2 Iteration - Calculate aa freq at each position in query sequence. Correct for low counts and weight seq such that very similar seq are down weighted

23 Psi-Blast Iterative Blast
An iterative process to search for remote homologs Capture and use evolutionary conserved information Scoring matrix is refined by use of gap-free multiple alignment Input sequence Sequence database Blast E < threshold 4 iterations PSSM Multiple alignment PSSM: Position Specific Scoring Matrix 23

24 Important positions - counting

25 Blast2logo Important amino acids: G24, D25 & S26
G89, N91 & D92 Important amino acids: D209 & H212 25

26 Blast2logo Db=nr.70 Important amino acids: G24, D25 & S26
D209 & H212 Db=nr.70 26

27 Exercise Calculate nucleotide frequencies from a mutiple alignment of human donor sites Calculate Entropy and Information content Draw (by hand) a Logo plot Learn to interpret Logo & frequency plots


Download ppt "Entropy, Information contents & Logo plots By Thomas Nordahl Petersen"

Similar presentations


Ads by Google