Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motif identification with Gibbs Sampler Xuhua Xia

Similar presentations


Presentation on theme: "Motif identification with Gibbs Sampler Xuhua Xia"— Presentation transcript:

1 Motif identification with Gibbs Sampler Xuhua Xia xxia@uottawa.ca http://dambe.bio.uottawa.ca

2 Xuhua Xia Slide 2 Background Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of RSL in 1901. One of Markov chain Monte Carlo algorithms Biological applications –Identification of regulatory sequences of genes (Aerts et al., 2005; Coessens et al., 2003; Lawrence et al., 1993; Qin et al., 2003; Thijs et al., 2001; Thijs et al., 2002a; Thijs et al., 2002b; Thompson et al., 2004; Thompson et al., 2003) and functional motifs in proteins (Mannella et al., 1996; Neuwald et al., 1995; Qu et al., 1998) –Classification of biological images (Samso et al., 2002) –Ppairwise sequence alignment (Zhu et al., 1998) and multiple sequence alignment (Holmes and Bruno, 2001; Jensen and Hein, 2005).

3 Xuhua Xia Slide 3 Motif Identification by Gibbs sampler Other outputs of Gibbs sampler: Position weight matrix that can be used to scan other sequences for motifs, the associated significance tests Position weight matrix scores for identified motifs.

4 Xuhua Xia Slide 4 Gibbs sampler in motif finding Site sampler Motif sampler

5 Xuhua Xia Slide 5 Algorithm details: Initialization 1 2 3 4 1234567890123456789012345678901234567890123 S1TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT S2CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG S3TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG S4AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC S5GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC...... S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG...... Table 7-1. Site-specific distribution of nucleotides from the 29 random motifs of length 6. The second column lists the distribution of nucleotides outside the 29 random motifs. Site NucC0123456 A2788796107 C2793851065 G2307565311 T24811998106 Randomly choose motif start A i. F A : 325 F C : 316 F G : 267 F T : 301 Sum: 1209

6 Xuhua Xia Slide 6 Algorithm details: Predictive update Site NucC0123456 A2797796107 C2793851065 G2337464310 T2501198896 Site NucC0123456 A2788796107 C2793851065 G2307565311 T24811998106 S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

7 Xuhua Xia Slide 7 Predictive update: Frequencies Table 7-3. Site-specific distribution of nucleotide frequencies derived from data in Table 7-2, with  = 0.0001 The second column lists the distribution of nucleotide frequencies outside the 28 random motifs. Site NucQ0123456 A0.26800.2501 0.32120.21450.35680.2501 C0.26800.10780.28560.17890.35670.21450.1789 G0.22380.24990.14320.21430.14320.10760.3566 T0.24020.39220.32110.2856 0.32110.2144

8 Xuhua Xia Slide 8 Predictive update: PWM Site NucQ0123456 A0.26800.2501 0.32120.21450.35680.2501 C0.26800.10780.28560.17890.35670.21450.1789 G0.22380.24990.14320.21430.14320.10760.3566 T0.24020.39220.32110.2856 0.32110.2144 123456 A-0.0693 0.1811-0.22280.2862-0.0693 C-0.91130.0637-0.40420.2862-0.2228-0.4042 G0.1102-0.4469-0.0434-0.4469-0.73270.4659 T0.49070.29060.1731 0.2906-0.1135 S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Odds ratio for CATGCC = e -0.9113-0.0693+0.1731-0.4469-0.2228-0.4042 = 0.153

9 Xuhua Xia Slide 9 Predictive update S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Table 7-4. Possible locations of the 6-mer motif along S 11, together with the corresponding motifs and their position weight matrix scores expressed as odds ratios. The last column lists the odds ratios normalized to have a sum of 1. Site6-merOdds RatioP Norm 1CATGCC0.1530.004 2ATGCCC0.8500.021 3TGCCCT0.6640.016 4GCCCTC0.9440.023 5CCCTCA0.2540.006 6CCTCAA0.8430.021 7CTCAAG0.6090.015 8TCAAGT0.7170.018 9CAAGTG0.6130.015 10AAGTGT0.4260.011 11AGTGTG0.9670.024... 35TCAAGG1.2790.032 40 – 6 + 1 = 35 Scaled to sum to 1 Pick up the one with the largest odds ratio, update the A i value, and generate a new frequency matrix Randomly pick up another sequence to do updating to obtain a new frequency matrix and new A i value. Once all sequences are updated and a new set of A i values obtained, compute Update all the sequences again to obtain a new set of A i and a new F. If the new F is greater the old F, replace the new set of A i values by the new set of A i values. Repeat until F value no long increases or when the maximum number of local iterations is reached. This (from initiation to this slide) completes one global cycle of iteration Repeat a number of global cycles until F does not increase.

10 Xuhua Xia Slide 10 Final Report: Final Frequency Final site-specific counts: A C G U 1 3 11 0 15 2 0 0 8 21 3 21 0 8 0 4 0 0 0 29 5 10 18 0 1 6 17 0 1 11 Final site-specific frequencies: A C G U 1 0.10413 0.37882 0.00092 0.51613 2 0.00112 0.00109 0.27563 0.72217 3 0.72225 0.00109 0.27563 0.00103 4 0.00112 0.00109 0.00092 0.99688 5 0.34451 0.61920 0.00092 0.03537 6 0.58489 0.00109 0.03526 0.37877 Final PWM [ln(Qij/Q0)]: A C G U 1 -0.93304 0.31199 -5.57384 0.86909 2 -5.46894 -5.54337 0.13202 1.20499 3 1.00364 -5.54337 0.13202 -5.34419 4 -5.46894 -5.54337 -5.57384 1.52737 5 0.26340 0.80335 -5.57384 -1.81131 6 0.79269 -5.54337 -1.92440 0.55966

11 Xuhua Xia Slide 11 Motif alignment Seq V V 1 UCAGAACCAGUUAUAAAUUUAUCAUUUCCUUCUCCACUCCU 2 CCCACGCAGCCGCCCUCCUCCCCGGUCACUGACUGGUCCUG 3 UCGACCCUCUGAACCUAUCAGGGACCACAGUCAGCCAGGCAAG 4 AAAACACUUGAGGGAGCAGAUAACUGGGCCAACCAUGACUC 5 GGGUGAAUGGUACUGCUGAUUACAACCUCUGGUGCUGC 6 AGCCUAGAGUGAUGACUCCUAUCUGGGUCCCCAGCAGGA 7 GCCUCAGGAUCCAGCACACAUUAUCACAAACUUAGUGUCCA 8 CAUUAUCACAAACUUAGUGUCCAUCCAUCACUGCUGACCCU 9 UCGGAACAAGGCAAAGGCUAUAAAAAAAAUUAAGCAGC 10 GCCCCUUCCCCACACUAUCUCAAUGCAAAUAUCUGUCUGAAACGGUUCC 11 CAUGCCCUCAAGUGUGCAGAUUGGUCACAGCAUUUCAAGG 12GAUUGGUCACAGCAUUUCAAGGGAGAGACCUCAUUGUAAG 13 UCCCCAACUCCCAACUGACCUUAUCUGUGGGGGAGGCUUUUGA 14 CCUUAUCUGUGGGGGAGGCUUUUGAAAAGUAAUUAGGUUUAGC 15 AUUAUUUUCCUUAUCAGAAGCAGAGAGACAAGCCAUUUCUCUUUCCUCCC 23 GAAAAAAAAUAAAUGAAGUCUGCCUAUCUCCGGGCCAGAGCCCCU 24 UGCCUUGUCUGUUGUAGAUAAUGAAUCUAUCCUCCAGUGACU 25 GGCCAGGCUGAUGGGCCUUAUCUCUUUACCCACCUGGCUGU 26 CAACAGCAGGUCCUACUAUCGCCUCCCUCUAGUCUCUG 27 CCAACCGUUAAUGCUAGAGUUAUCACUUUCUGUUAUCAAGUGGCUUCAGC 28 GGGAGGGUGGGGCCCCUAUCUCUCCUAGACUCUGUG 29 CUUUGUCACUGGAUCUGAUAAGAAACACCACCCCUGC

12 Xuhua Xia Slide 12 Motif scores SeqName Motif Start PWMS S1 UUAUCA 18 493.3101 S2 CGGUCA 22 40.4251 S3 CUAUCA 14 282.6008 S4 AGAUAA 17 16.2174 S5 UGAUUA 16 12.3482 S6 CUAUCU 18 223.8567 S7 UUAUCA 20 493.3101 S8 UUAUCA 2 493.3101 S9 CUAUAA 17 164.6933 S10 CUAUCU 14 223.8567 S11 UGGUCA 21 70.5663 S12 UUGUAA 33 120.2498 S13 UUAUCU 20 390.7660 S14 UUAUCU 2 390.7660 S15 UUAUCA 10 493.3101...... S27 UUAUCA 19 493.3101 S28 CUAUCU 15 223.8567 S29 UUGUCA 2 206.3393

13 Xuhua Xia Slide 13 Motif sampler output SeqNameN123 Seq1210(TTATAA,93.4541)18(TTATCA,163.6602) Seq2122(CGGTCA,14.5511) Seq3114(CTATCA,101.8203) Seq40 Seq5116(TGATTA,12.9266) Seq6118(CTATCT,90.7790) Seq7120(TTATCA,163.6602) Seq822(TTATCA,163.6602)24(CCATCA,10.2098) Seq9117(CTATAA,58.1420) Seq10314(CTATCT,90.7790)28(ATATCT,41.4438)32(CTGTCT,37.7888) Seq11121(TGGTCA,23.3886) Seq1223(TGGTCA,23.3886)33(TTGTAA,38.9024) Seq13120(TTATCT,145.9129) Seq1412(TTATCT,145.9129) Seq1531(TTATTT,33.5700)10(TTATCA,163.6602)36(TTCTCT,17.7407) Seq25117(TTATCT,145.9129) Seq26115(CTATCG,21.2368) Seq27319(TTATCA,163.6602)25(CTTTCT,13.3635)32(TTATCA,163.6602) Seq28115(CTATCT,90.7790) Seq2922(UUGUCA,68.1272)15(TGATAA,32.0835)


Download ppt "Motif identification with Gibbs Sampler Xuhua Xia"

Similar presentations


Ads by Google