Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia

Similar presentations


Presentation on theme: "Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia"— Presentation transcript:

1 Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

2 Motif characterization Input: site-specific sequences Approaches: –Consensus sequence (the chance of having NNN… increases with increasing number of sequences) –Frequency Profile (the problem of mutation bias) –PWM, Sequence logo, Perceptron and Gibbs sampler (cannot detect column association) –Multiple correspondence analysis Slide 2 1234567890123 A4GALT ATACCATGTCCAA ACO2 ACAAAATGGCGCC ACR GGAGTATGGTTGA ADM2 CCGCCATGGCCCG...

3 Consensus sequence Slide 3 Sequences flanking the initiation codon of 508 CDSs: 1234567890123 A4GALT AUACCAUGUCCAA ACO2 ACAAAAUGGCGCC ACR GGAGUAUGGUUGA ADM2 CCGCCAUGGCCCG......... SiteACGUCons 17517317189N 210521614443N 31997021227N 41242368365N 56027614329N 6502600A 7000508U 800 0G 9987127465N 101412278951N 114915422184N 128914419679N 13121151134102N Sum1563172421751142N Our objective is to find if sites flanking AUG contribute to the start codon recognition. The consensus sequence does not give us the answer

4 Consensus sequence Slide 4 Sequences flanking the initiation codon of 508 CDSs: 1234567890123 A4GALT AUACCAUGUCCAA ACO2 ACAAAAUGGCGCC ACR GGAGUAUGGUUGA ADM2 CCGCCAUGGCCCG......... SiteACGUACGU 175173171890.14760.34060.33660.1752 2105216144430.20670.42520.28350.0846 319970212270.39170.13780.41730.0531 412423683650.24410.46460.16340.1280 560276143290.11810.54330.28150.0571 65026000.98820.01180.0000 70005080.0000 1.0000 80050800.0000 1.00000.0000 99871274650.19290.13980.53940.1280 1014122789510.27760.44690.17520.1004 1149154221840.09650.30310.43500.1654 1289144196790.17520.28350.38580.1555 131211511341020.23820.29720.26380.2008 Sum15631724217511420.23670.26110.32930.1729 RCCaugGCGG -3R +4G Are the red numbers red herrings? The problem of mutation bias. What background frequencies to use as control?

5 Consensus sequence Sequences flanking the initiation codon of 508 CDSs: 1234567890123 A4GALT AUACCAUGUCCAA ACO2 ACAAAAUGGCGCC ACR GGAGUAUGGUUGA ADM2 CCGCCAUGGCCCG......... SiteACGUACGU 175173171890.14760.34060.33660.1752 2105216144430.20670.42520.28350.0846 319970212270.39170.13780.41730.0531 412423683650.24410.46460.16340.1280 560276143290.11810.54330.28150.0571 65026000.98820.01180.0000 70005080.0000 1.0000 80050800.0000 1.00000.0000 99871274650.19290.13980.53940.1280 1014122789510.27760.44690.17520.1004 1149154221840.09650.30310.43500.1654 1289144196790.17520.28350.38580.1555 131211511341020.23820.29720.26380.2008 Sum15631724217511420.23670.26110.32930.1729 S = ACGGTACCACGTT Likelihood, odds ratio, log-odds, PWMS

6 Xuhua Xia Slide 6 Position weight matrix (PWM) Two major purposes of PWM –To characterize the sequence pattern (the motif) –to facilitate the computation of log-odds (or PWM score), e.g., computing the PWMS for ATACCATGTCCAA SiteACGUStd 1 -0.67840.38440.03290.01980.4453 2 -0.19390.7044-0.2147-1.02770.7076 3 0.7275-0.91880.3426-1.69681.1214 4 0.04570.8320-1.0080-0.43280.7786 5 -0.99961.0578-0.2247-1.59411.1452 6 2.0617-4.4258-9.5877-9.58815.5288 7 -9.5879-9.5878-9.58772.53136.0595 8 -9.5879-9.58781.6025-9.58815.5952 9 -0.2933-0.89840.7124-0.43280.6782 10 0.23090.7760-0.9075-0.78210.8112 11 -1.29100.21670.4025-0.06350.7626 12 -0.43200.12000.2295-0.15190.2961 13 0.01050.1884-0.31840.21630.2459 RCCAUGG PWMS = -0.6784-1.0277+0.7275 + …+0.0105

7 Slide 7 PWMS over sites Figure 5-1. Illustration of scanning the 5’-end of the NCF4 gene (30 bases upstream of the initiation codon ATG and 27 bases downstream of ATG. The highest peak, with PWMS = 12.3897, corresponds to the 13-mer with 5 bases flanking the ATG. PWMS computed with  = 0.01. 12345678901234567890123456789012345678901234567890123456789012345678901234567890 GGACUGGCUGGGCGAGACUCUCCACCUGCUCCCUGGGACCAUCGCCCACCAUGGCUGUGGCCCAGCAGCUGCGGGCCGAG -------------

8 PWM: position weight matrix Also called position-specific scoring matrix (PSSM) Used in –Characterizing sequence motifs Eukaryotic translation initiation consensus Splicing sites Branchpoint sites Shine-Dalgarno sequences –Database searches PHI-BLAST PSI-BLAST RPS-BLAST) Slide 8

9 Slide 9 BLAST Programs ProgramDatabaseQueryTypical Uses BLASTN/ME GABLAST Nucleotide MEGABLAST has longer word size than BLASTN BLASTPProtein Query a protein/peptide against a protein database. BLASTXProteinNucleotideTranslate a nuc sequence into a “protein” in six frames and search against a protein database TBLASTNNucleotideProteinUnannotated nuc sequences (e.g., ESTs) are translated in six frames against which the query protein is searched TBLASTXNucleotide 6-frame translation of both query and database PHI-BLASTProtein Pattern-hit iterated BLAST PSI-BLASTProtein Position-specific iterated BLAST RPS-BLASTProtein Reverse PSI-BLAST

10 Yeast 5’ ss PWM Slide 10 Table 3: Site-specific frequencies and position weight matrix (PWM) for 275 5′ ss. The consensus sequence (UAAAG ∣ GUAUGUU UAAUU) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The χ 2 test is performed for each site against the background frequencies (A = 0.3279, C = 0.1915, G = 0.2043, and U = 0.2763). The nucleotide sites are labeled with the five exon nucleotides as −5 to −1 and the 12 intron nucleotides as 1 to 12. The PWM is nearly identical when the introns in 5′ UTR were excluded. SiteACGUχ 2 pACGU −59432579211.7980.00810880.0641−0.71170.02450.2792 −411947486114.1170.00275050.4032−0.1599−0.2225−0.3115 −313938435539.6720.00000010.6268−0.4651−0.3805−0.4601 −213840366138.8990.00000010.6164−0.3915−0.6355−0.3115 −19145885127.2700.00000520.0174−0.22230.6492−0.5685 10127401060.4260.0000004−8.1042−5.46752.2855−8.1044 2090266658.0960.0000003−8.1042−2.5200−8.10481.8081 3268124522.7540.00000031.5723−5.4675−4.6732−4.1523 417291228428.6070.0000002−2.3805−0.8528−5.54541.5859 52027211041.0470.0000004−5.2765−8.10492.2750−5.8967 61082255583.5450.0000003−3.1271−2.6862−4.67321.7472 797183912155.5700.00000010.1092−1.5351−0.52060.6734 89554359111.3630.00991800.07930.0397−0.67590.2635 912345347322.1720.00006010.4508−0.2223−0.7175−0.0534 1011841387817.3340.00060340.3911−0.3560−0.55790.0418 1110533439417.3670.00059400.2232−0.6676−0.38050.3101 129044429912.1090.00701800.0015−0.2546−0.41420.3847 Ma and Xia 2011

11 Yeast 3’ss PWM Slide 11 SiteACGU 22 pACGU -1261533413056.00-0.56080.0033-0.72050.7648 -1170472014183.20-0.3649-0.1682-1.46960.8813 -1079421214599.90-0.1926-0.3285-2.18020.9214 -9383023187219.40-1.2308-0.8067-1.27311.2867 -8514227158121.60-0.8149-0.3285-1.04701.0447 -791332812653.800.0093-0.6715-0.99560.7200 -695423510622.00.00010.0707-0.3285-0.67940.4722 -593332312963.300.0403-0.6715-1.27310.7537 -413625387943.300.5842-1.0647-0.56260.0517 -3121210145272.30-2.82231.1862-6.64800.9214 -2277100563.701.6056-5.1232-6.6480-6.6469 0027801082.70-6.6464-6.64832.2900-6.6469 1933773759.70.02170.0403-0.50890.3691-0.0226 2726454888.00.0466-0.32480.2729-0.06190.2059 3905448862.50.4771-0.00650.0300-0.22990.1730 4834354988.70.0337-0.1221-0.2950-0.06190.3599 59065378610.60.0140-0.00650.2951-0.60050.1730 Table 4. Site-specific frequencies and position weight matrix (PWM) for 278 3’ ss. The consensus sequence (UUUUUUUUAYAG|GCUUC) can be obtained from those large site-specific PWM entries, with the most important sites in bold italics. The  2 test is performed for each site against the expected background frequencies. The sites are labeled with first exon site as 1. Ma and Xia 2011

12 Slide 12

13 PWMS as a proxy of splicing strength Slide 13 5' ss3' ss NRGRGNRGRG PWMS Mean8.813811.19785.31297.1762 PWMS Var.31.50694.864613.30178.2077 N4420249229 t-4.6346-3.9257 p0.00000.0001 Table 6. Position weight matrix scores (PWMS, as a proxy for splicing strength) is significantly smaller for splice sites from intron-containing genes (ICGs) whose transcripts failed to recruit U1 snRNPs (NRG for non-recruiting group) than for those from ICGs whose transcripts binds well to U1 snRNPs (RG for recruiting group). The pattern is consistent for both 5' ss and 3' ss, based on two-sample t-tests assuming equal variances. Mann-Whitney tests yield the same conclusion.

14 Highly expressed genes should have high splicing efficiency. Lowly expressed genes could have their splicing sites drifting to low efficiency Predictions: (1) Highly transcribed genes should, on average, have introns with greater splicing efficiency (2) Lowly transcribed genes should have greater variance in splicing efficiency than highly transcribed genes.

15 PWMS and Gene Expression Slide 15

16 PWMS and Splicing Mechanisms Expected PWMS is 0 when there is no site-specific difference in nucleotide frequency distribution What does a strongly negative PWMS mean? 5’ ss: –HAC1: -8.8291 –HFM1: -7.3825 –HOP2: -7.8898 3’ ss: –HAC1: -4.4039 –REC102: -3.4464 Slide 16

17 Slide 17 Perceptron The perceptron is one of the simplest artificial neural networks invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt (Rosenblatt, 1958). Perceptron has been used in bioinformatics research since 1980s: –The identification of translational initiation sites in E. coli (Stormo et al., 1982a). –Characterizing the ATP/GTP-binding motif (Hirst and Sternberg, 1991). –More recent publications use multi-layer perceptrons which is more complicated than what we cover here.

18 Slide 18 What perceptron does Positive sequences POS1 ACGT POS2 GCGC Negative sequences NEG1 AGCT NEG2 GGCC Objective: Find a scoring matrix that can distinguish between the two groups (positive and negative) of sequences

19 Slide 19 Definitions Table 5-3. The weighting matrix (W) for the fictitious example with two sequences of length 4 in each group, initialized with values of 1. The first row designates sites 1-4. Base1234 A1111 C1111 G1111 T1111 POS1 ACGT POS2 GCGC NEG1 AGCT NEG2 GGCC For amino acid sequences, the matrix would be 20 by 4.

20 Slide 20 Iterations and convergence

21 Slide 21 Post-processing Base1234 A0111 C110 G0 11 T1110 Base1234 A0000 C010 G0 10 T0000 POS1 ACGT POS2 GCGC NEG1 AGCT NEG2 GGCC What is the score for: TAAA? A W Si,j = 0 means either there is no data on that cell or the cell has no discriminant power

22 Doublet perceptron Slide 22 1234567890 P1 ACGUAUACGU P2 ACGUCUACGU P3 ACGUGUACGU P4 ACGUUAACGU P5 ACGUUCACGU P6 ACGUUGACGU N1 ACGUAAACGU N1 ACGUACACGU N1 ACGUAGACGU N1 ACGUCAACGU N1 ACGUCCACGU N1 ACGUCGACGU N1 ACGUGAACGU N1 ACGUGCACGU N1 ACGUGGACGU N1 ACGUUUACGU 123456789 ACCGGUUAAUUAACCGGU ACCGGUUCCUUAACCGGU ACCGGUUGGUUAACCGGU ACCGGUUUUAAAACCGGU ACCGGUUUUCCAACCGGU ACCGGUUUUGGAACCGGU ACCGGUUAAA ACCGGU ACCGGUUAACCAACCGGU ACCGGUUAAGGAACCGGU ACCGGUUCCAAAACCGGU ACCGGUUCCCCAACCGGU ACCGGUUCCGGAACCGGU ACCGGUUGGAAAACCGGU ACCGGUUGGCCAACCGGU ACCGGUUGGGGAACCGGU ACCGGUUU UAACCGGU

23 Doublet Perceptron Slide 23 Doublet 123456789 AA0000-6-4.3000 AC0000-40000 AG0000-20000 AU00008.330000 CA0000-4000 CC00000000 CG00000000 CU000050000 GA0000-0.7000 GC00000000 GG00000000 GU00003.330000 UA000-3.76.675.67000 UC00050000 UG0000.333.330000 UU0004-110000 Large amount of data are needed to avoid the problem of overfitting

24 Gene/Motif Prediction Objective: given molecular sequence, find its biological function (preferably in terms of gene ontology). –Cellular localization –Biological processes the gene (its product) participates in –The biological reaction Related terms: –Motif: e.g., RccAUGG –Fingerprint: a set of aligned sequences from which a position weight matrix or the like can be constructed to predict the motif effectively Gene/Motif prediction methods –Position weight matrix –Perceptrons –Supervised learning –Hidden Markov Models (HMMs) –Neural networks (e.g., self-organizing map or SOM)


Download ppt "Motif characterization: Position weight matrix (PWM), Perceptron and their applications Xuhua Xia"

Similar presentations


Ads by Google