Motif identification with Gibbs Sampler Xuhua Xia
Xuhua Xia Slide 2 Background Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of RSL in One of Markov chain Monte Carlo algorithms Biological applications –Identification of regulatory sequences of genes (Aerts et al., 2005; Coessens et al., 2003; Lawrence et al., 1993; Qin et al., 2003; Thijs et al., 2001; Thijs et al., 2002a; Thijs et al., 2002b; Thompson et al., 2004; Thompson et al., 2003) and functional motifs in proteins (Mannella et al., 1996; Neuwald et al., 1995; Qu et al., 1998) –Classification of biological images (Samso et al., 2002) –Ppairwise sequence alignment (Zhu et al., 1998) and multiple sequence alignment (Holmes and Bruno, 2001; Jensen and Hein, 2005).
Xuhua Xia Slide 3 Motif Identification by Gibbs sampler Other outputs of Gibbs sampler: Position weight matrix that can be used to scan other sequences for motifs, the associated significance tests Position weight matrix scores for identified motifs.
Xuhua Xia Slide 4 Gibbs sampler in motif finding Site sampler Motif sampler
Xuhua Xia Slide 5 Algorithm details: Initialization S1TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT S2CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG S3TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG S4AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC S5GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Table 7-1. Site-specific distribution of nucleotides from the 29 random motifs of length 6. The second column lists the distribution of nucleotides outside the 29 random motifs. Site NucC A C G T Randomly choose motif start A i. F A : 325 F C : 316 F G : 267 F T : 301 Sum: 1209
Xuhua Xia Slide 6 Algorithm details: Predictive update Site NucC A C G T Site NucC A C G T S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG
Xuhua Xia Slide 7 Predictive update: Frequencies Table 7-3. Site-specific distribution of nucleotide frequencies derived from data in Table 7-2, with = The second column lists the distribution of nucleotide frequencies outside the 28 random motifs. Site NucQ A C G T
Xuhua Xia Slide 8 Predictive update: PWM Site NucQ A C G T A C G T S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Odds ratio for CATGCC = e = 0.153
Xuhua Xia Slide 9 Predictive update S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG Table 7-4. Possible locations of the 6-mer motif along S 11, together with the corresponding motifs and their position weight matrix scores expressed as odds ratios. The last column lists the odds ratios normalized to have a sum of 1. Site6-merOdds RatioP Norm 1CATGCC ATGCCC TGCCCT GCCCTC CCCTCA CCTCAA CTCAAG TCAAGT CAAGTG AAGTGT AGTGTG TCAAGG – = 35 Scaled to sum to 1 Pick up the one with the largest odds ratio, update the A i value, and generate a new frequency matrix Randomly pick up another sequence to do updating to obtain a new frequency matrix and new A i value. Once all sequences are updated and a new set of A i values obtained, compute Update all the sequences again to obtain a new set of A i and a new F. If the new F is greater the old F, replace the new set of A i values by the new set of A i values. Repeat until F value no long increases or when the maximum number of local iterations is reached. This (from initiation to this slide) completes one global cycle of iteration Repeat a number of global cycles until F does not increase.
Xuhua Xia Slide 10 Final Report: Final Frequency Final site-specific counts: A C G U Final site-specific frequencies: A C G U Final PWM [ln(Qij/Q0)]: A C G U
Xuhua Xia Slide 11 Motif alignment Seq V V 1 UCAGAACCAGUUAUAAAUUUAUCAUUUCCUUCUCCACUCCU 2 CCCACGCAGCCGCCCUCCUCCCCGGUCACUGACUGGUCCUG 3 UCGACCCUCUGAACCUAUCAGGGACCACAGUCAGCCAGGCAAG 4 AAAACACUUGAGGGAGCAGAUAACUGGGCCAACCAUGACUC 5 GGGUGAAUGGUACUGCUGAUUACAACCUCUGGUGCUGC 6 AGCCUAGAGUGAUGACUCCUAUCUGGGUCCCCAGCAGGA 7 GCCUCAGGAUCCAGCACACAUUAUCACAAACUUAGUGUCCA 8 CAUUAUCACAAACUUAGUGUCCAUCCAUCACUGCUGACCCU 9 UCGGAACAAGGCAAAGGCUAUAAAAAAAAUUAAGCAGC 10 GCCCCUUCCCCACACUAUCUCAAUGCAAAUAUCUGUCUGAAACGGUUCC 11 CAUGCCCUCAAGUGUGCAGAUUGGUCACAGCAUUUCAAGG 12GAUUGGUCACAGCAUUUCAAGGGAGAGACCUCAUUGUAAG 13 UCCCCAACUCCCAACUGACCUUAUCUGUGGGGGAGGCUUUUGA 14 CCUUAUCUGUGGGGGAGGCUUUUGAAAAGUAAUUAGGUUUAGC 15 AUUAUUUUCCUUAUCAGAAGCAGAGAGACAAGCCAUUUCUCUUUCCUCCC 23 GAAAAAAAAUAAAUGAAGUCUGCCUAUCUCCGGGCCAGAGCCCCU 24 UGCCUUGUCUGUUGUAGAUAAUGAAUCUAUCCUCCAGUGACU 25 GGCCAGGCUGAUGGGCCUUAUCUCUUUACCCACCUGGCUGU 26 CAACAGCAGGUCCUACUAUCGCCUCCCUCUAGUCUCUG 27 CCAACCGUUAAUGCUAGAGUUAUCACUUUCUGUUAUCAAGUGGCUUCAGC 28 GGGAGGGUGGGGCCCCUAUCUCUCCUAGACUCUGUG 29 CUUUGUCACUGGAUCUGAUAAGAAACACCACCCCUGC
Xuhua Xia Slide 12 Motif scores SeqName Motif Start PWMS S1 UUAUCA S2 CGGUCA S3 CUAUCA S4 AGAUAA S5 UGAUUA S6 CUAUCU S7 UUAUCA S8 UUAUCA S9 CUAUAA S10 CUAUCU S11 UGGUCA S12 UUGUAA S13 UUAUCU S14 UUAUCU S15 UUAUCA S27 UUAUCA S28 CUAUCU S29 UUGUCA
Xuhua Xia Slide 13 Motif sampler output SeqNameN123 Seq1210(TTATAA, )18(TTATCA, ) Seq2122(CGGTCA, ) Seq3114(CTATCA, ) Seq40 Seq5116(TGATTA, ) Seq6118(CTATCT, ) Seq7120(TTATCA, ) Seq822(TTATCA, )24(CCATCA, ) Seq9117(CTATAA, ) Seq10314(CTATCT, )28(ATATCT, )32(CTGTCT, ) Seq11121(TGGTCA, ) Seq1223(TGGTCA, )33(TTGTAA, ) Seq13120(TTATCT, ) Seq1412(TTATCT, ) Seq1531(TTATTT, )10(TTATCA, )36(TTCTCT, ) Seq25117(TTATCT, ) Seq26115(CTATCG, ) Seq27319(TTATCA, )25(CTTTCT, )32(TTATCA, ) Seq28115(CTATCT, ) Seq2922(UUGUCA, )15(TGATAA, )