Motif identification with Gibbs Sampler

Motif identification with Gibbs Sampler
Xuhua Xia

Background Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of the Royal Society of London in 1901. One of Markov chain Monte Carlo algorithms Biological applications Identification of regulatory sequences of genes (Aerts et al., 2005; Coessens et al., 2003; Lawrence et al., 1993; Qin et al., 2003; Thijs et al., 2001; Thijs et al., 2002a; Thijs et al., 2002b; Thompson et al., 2004; Thompson et al., 2003) and functional motifs in proteins (Mannella et al., 1996; Neuwald et al., 1995; Qu et al., 1998) Classification of biological images (Samso et al., 2002) Pairwise sequence alignment (Zhu et al., 1998) and multiple sequence alignment (Holmes and Bruno, 2001; Jensen and Hein, 2005). RSL: Royal Society London Xuhua Xia

Main Objective Gibbs sampler
Name YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG YCL012C GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG HMRA1_1 GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG HMRA1_2 GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG TAN1 GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG AML1_2 GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG QCR10 GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG …… …… Site sampler Motif sampler

Rationale of Gibbs sampler
Name YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG YCL012C GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG HMRA1_1 GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG HMRA1_2 GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG TAN1 GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG AML1_2 GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG QCR10 GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Randomly sample a 6mer from each sequence: Ai: 27, 44, 30, 35, 1, 37, 7, 8, 35, 18, 32, 42, 46, 50 Progressively pick better 6mers to replace the current ones until no better 6mers to pick. The final 6mers represent a motif. How do I know which 6mer is better than an existing one in each sequence? Relevant review questions How many possible 6mers in YML059W with sequence length of 69? What information is needed to compute a PWM? Given a PWM, how to compute PWMS?

Step 1: Pick up a random 6mer
Name YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG YCL012C GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG HMRA1_1 GTATGTTTTCATTTCAAGGATAGCCTTTGAATCAATTTACTAACAATACTTCAG HMRA1_2 GTATGTAATATGAGAATCAAACTTAAATATATCCTATACTAACAATTTGTAG TAN1 GTATGTCTGTCTGCACACGAATTAGAGTTCTTTAAGTACTAACGATCAAAAGTAATAG AML1_2 GTAAGTACAGGATATTTTCAACACAGTAACGTAGAATTACTAACTAACACGAAACTTAATAG QCR10 GTAAGTATCCTATCATATTATGTGAGCTAGAACCGAATTAGTATACTAACATTTATAATACAG YHR199C-A GTATGTCACCTCACCGCAACACTCTACCCCCAACCTTCACCCGCACTAATTACTAACACAACCTCAG YIL156W-B GTAAGTACCCAAATGAGCTACTAACAACGCATCCGGTAATACTAACAAGAGAAATTGGTTAG DID4 GTATGTTGTTCTGTATTTGGATCAGTTATTTTAGTGAACATACTAACGTTAATTATTTGAGTTTTTAG YLR211C GTAAGTTTTTGAATTATGCCCCCACTTTTTTTTTGTTCATGGTGACTAACATGAATTAG TAD3_1 GTATGTATATGATTTTTGCTTTTGTATTCATGGAAGTAAACTAACTAGTAAAGTAG TAD3_2 GTATGTATCTATGAGATCTAACAGAAATCAGGATCAATTAACTAACTTTCAAACATATATAAGTGCAG YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Ci1 Ci2 Ci3 Ci4 Ci5 Ci6 Ci A C G T

Predictive update: 1 Ci1 Ci2 Ci3 Ci4 Ci5 Ci6 Ci A 4 5 4 5 3 6 274
Name YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG YCL012C GTATGTATATTTAAGCATTTGAAAATGCAAAGTGCAAACCGTATCAAATTACTAACAGCTGTAATAG … … YOL047C GTATGTTGATGTTCCAATAAGCAGATCATGTTTTTTAAGCCGTCATACTAACCGCCTTTGAAG Ci1 Ci2 Ci3 Ci4 Ci5 Ci6 Ci A C G T Take CATACT in YBL059W out of Cij and put into Ci Ci1 Ci2 Ci3 Ci4 Ci5 Ci6 Ci A C G T sum PWM pij = Cij/ pi = Ci/ PWMij = log2(pij/pi) Pseudocount to avoid log2(0) A 0.4004 C 0.4630 1.1997 G T 0.5144 0.2514

Predictive update: 2 Site 6mer PMWS OddsRatio Pnorm 1 GTATGC -0.672
YBL059W GTATGCATAGGCAATAACTTCGGCCTCATACTCAAAGAACACGTTTACTAACATAACTTATTTACATAG Site 6mer PMWS OddsRatio Pnorm 1 GTATGC -0.672 0.511 0.005 2 TATGCA 0.380 1.462 0.015 3 ATGCAT -0.146 0.864 0.009 4 TGCATA 1.379 3.970 0.040 … 27 CATACT -0.524 0.592 0.006 45 TTACTA 2.433 11.392 0.116 64 ACATAG 0.000 Scaled to sum to 1: OddsRatioi/Sum(OddsRatio) Illustrated in the previous slide Originally picked Largest odds ratio, the 6-mer to replace the originally picked 69 – = 64 Two ways to pick a new 6mer: 1. The one with the largest PWMS 2. Imagine a dart-board with target areas proportional to Pnorm. Randomly throw a dart (using a computer of course) and pick up the corresponding L-mer. Note that large Pnorm value will have a high chance of being hit). Take the new 6mer out of Ci and put into Cij, and perform predictive update for 2nd sequence. Counts with the new 6mer (TTACTA) added to Cij A 4 5 3 7 274 C 1 2 6 131 G 127 T 252 14 784

Predictive update: 3 PWM, and PWMS for each final 6mer
Now update the second sequence as before, i.e., take the current 6mer out of Cij and put into Ci. Scan with the current PWM to pick up a new 6mer. Continue with 3rd, 4th ,… and back to 1st sequence again for many times (could be millions or billions of times) When to stop? F as a criterion Each time we updated sequences 1 to N, we compute F If a new F is not greater than a previous F, stop and record the final F as F1.Max. Go all the way to be beginning to randomly generate a new set of 6mer and go through the process, and record the final F as F2.Max. Stop after typically two new Fi.Max values are not greater than what we got before. Report results associated with the largest Fi.Max: PWM, and PWMS for each final 6mer The aligned motifs Associated statistics Xuhua Xia

Calculation of F Cij pi and pij Cijlog2(pij/pi)

Summary: To find a motif of length L…
Randomly pick up an L-mer from each sequence. Record the position of these L-mers as Ai() Obtain site-specific counts of the four nucleotides (or 20 aa) from the L-mers and global counts from the sequences excluding the L-mers. Record as Cij()and Ci(). Predictive update: This is the key step that would pick better L-mers Take the L-mer from S1 out of the Cij() and put it to Ci(). Generate a PWM from Cij() and Ci() Use PWM to scan S1 to obtain PWMS and Odds-ratio from all L-mers. Use one of two methods to choose the new L-mer L-mer with the highest PWMS Use the odds-ratios to compute Pnorm. Imagine a dart-board with target areas proportional to Pnorm. Randomly throw a dart and pick up the corresponding L-mer. Note that large Pnorm value will have a high chance of being hit). Take this new L-mer out of Ci() and into Cij(). Repeat this with S2, S3, …, SN. Compute F Repeat this with S1, S2, …, SN. Compute F until F does not increase. Record the final F as F1.Max. Start from the very beginning, i.e., "Randomly pick up ……" and record the final F as F2.Max. Stop after typically 2 rounds of iterations that do not yield larger Fi.Max. Report the results associated with the largest Fi.Max. PWM and PWMS for each final 6mer The aligned motifs Associated statistics If you suspect that there might be other motifs, mask the found motif and run Gibbs sampler again.

Final Report: Final Frequency
Final site-specific counts: A C G U Final site-specific frequencies: A C G U Final PWM [ln(Qij/Q0)]: A C G U Xuhua Xia

Motif alignment Seq V V 1 UCAGAACCAGUUAUAAAUUUAUCAUUUCCUUCUCCACUCCU
CCCACGCAGCCGCCCUCCUCCCCGGUCACUGACUGGUCCUG UCGACCCUCUGAACCUAUCAGGGACCACAGUCAGCCAGGCAAG AAAACACUUGAGGGAGCAGAUAACUGGGCCAACCAUGACUC GGGUGAAUGGUACUGCUGAUUACAACCUCUGGUGCUGC AGCCUAGAGUGAUGACUCCUAUCUGGGUCCCCAGCAGGA GCCUCAGGAUCCAGCACACAUUAUCACAAACUUAGUGUCCA CAUUAUCACAAACUUAGUGUCCAUCCAUCACUGCUGACCCU UCGGAACAAGGCAAAGGCUAUAAAAAAAAUUAAGCAGC GCCCCUUCCCCACACUAUCUCAAUGCAAAUAUCUGUCUGAAACGGUUCC CAUGCCCUCAAGUGUGCAGAUUGGUCACAGCAUUUCAAGG 12GAUUGGUCACAGCAUUUCAAGGGAGAGACCUCAUUGUAAG UCCCCAACUCCCAACUGACCUUAUCUGUGGGGGAGGCUUUUGA CCUUAUCUGUGGGGGAGGCUUUUGAAAAGUAAUUAGGUUUAGC AUUAUUUUCCUUAUCAGAAGCAGAGAGACAAGCCAUUUCUCUUUCCUCCC GAAAAAAAAUAAAUGAAGUCUGCCUAUCUCCGGGCCAGAGCCCCU UGCCUUGUCUGUUGUAGAUAAUGAAUCUAUCCUCCAGUGACU GGCCAGGCUGAUGGGCCUUAUCUCUUUACCCACCUGGCUGU CAACAGCAGGUCCUACUAUCGCCUCCCUCUAGUCUCUG CCAACCGUUAAUGCUAGAGUUAUCACUUUCUGUUAUCAAGUGGCUUCAGC GGGAGGGUGGGGCCCCUAUCUCUCCUAGACUCUGUG CUUUGUCACUGGAUCUGAUAAGAAACACCACCCCUGC Xuhua Xia

Motif scores SeqName Motif Start PWMS S1 UUAUCA 18 493.3101
S CGGUCA S CUAUCA S AGAUAA S UGAUUA S CUAUCU S UUAUCA S UUAUCA S CUAUAA S CUAUCU S UGGUCA S UUGUAA S UUAUCU S UUAUCU S UUAUCA S UUAUCA S CUAUCU S UUGUCA Xuhua Xia

Motif sampler output Xuhua Xia SeqName N 1 2 3 Seq1 10(TTATAA,93.4541)
18(TTATCA, ) Seq2 22(CGGTCA, ) Seq3 14(CTATCA, ) Seq4 Seq5 16(TGATTA, ) Seq6 18(CTATCT, ) Seq7 20(TTATCA, ) Seq8 2(TTATCA, ) 24(CCATCA, ) Seq9 17(CTATAA, ) Seq10 14(CTATCT, ) 28(ATATCT, ) 32(CTGTCT, ) Seq11 21(TGGTCA, ) Seq12 3(TGGTCA, ) 33(TTGTAA, ) Seq13 20(TTATCT, ) Seq14 2(TTATCT, ) Seq15 1(TTATTT, ) 10(TTATCA, ) 36(TTCTCT, ) Seq25 17(TTATCT, ) Seq26 15(CTATCG, ) Seq27 19(TTATCA, ) 25(CTTTCT, ) 32(TTATCA, ) Seq28 15(CTATCT, ) Seq29 2(UUGUCA, ) 15(TGATAA, ) Xuhua Xia

Functional validation of the motif
A shared motif is found, what is it for? How to infer its function? By analogy with genotype, epigenotype, and phenotype: Genotype: sum of observable/inferable genetic features Epigenotype: sum of (inheritable) DNA modifications without affecting base pairing during DNA replication. Phenotype: sum of observable/measurable features dependent on genotype, epigenotype, environment and the interaction among the three Statistical validation of genotypic/epigenotypic effects: phenotypic differences can be explained by genotypic/epigenotypic differences At gene level: Genocule: gene sequence with all site-specific signals (promotor, transcription start site, translation initiation signals, splice sites, codons, etc.) Cellular environment: all cellular machinery (ribosome, tRNA pool, spliceosome, lysosome, etc) that can decode/interpret the signals Phenocule: characterizable gene features (transcription and translation activities, splicing efficiency, function of gene products, etc.) Statistical validation of genoculic effect: phenoculic differences (e.g., in translation initiation efficiency) can be explained by genoculic differences (e.g., in location and strength of Shine-Dalgarno sequences). Xuhua Xia

A word of caution Many bioinformatic tools have been developed for identifying genes and regulatory motifs, or for assigning proteins to different cellular locations. These tools will specify their performance. For example, a gene will be marked as a gene with a probability of, say 0.9, and a non-gene will be marked as a gene with a probability of 0.01. If such a bioinformatic tool marked a 300mer as a gene, what is the probability that the 300mer is truly a gene? (No people want to experimentally validate a putative gene if the putative gene has a low probability of being a true gene.)

Bayes’ Theorem and inverse probability
Two containers: C1 and C2 C1 C2 The chance of getting a red ball by random sampling is 0.8 for C1, and 0.2 for C2. What is the probability of a ball is from C1 given that it is red (Bred)? If C1 and C2 are equally likely to be sampled, then the probability is p(C1), p(C2): prior probability distribution, p(C1) = p(C2) = 0.5 (uninformative prior, so the outcome will depend on the likelihood only) p(Bred|C1), p(Bred|C2): likelihood, p(Bred|C1) = 0.8 , p(Bred|C2) = 0.2 The denominator is the prior probability of Bred. Posterior probability results from revising the prior by the data

Bayes’ Theorem and inverse probability
Two containers: C1 and C2 C1 C2 The chance of getting a red ball by random sampling is 0.9 given C1, and 0.2 given C2. What is the probability of a ball is from C1 given that it is red (Bred)? If C1 and C2 are equally likely to be sampled, then the probability is Imagine that C1 is a "box" of genes and that C2 is a "box" of non-genes. A gene-predicting program has false negative probability of 0.1 and a false positive rate of 0.2. Then, if the program reports a sequence as a gene, the probability that the sequence is indeed a gene (i.e., sampled from C1) is , and the probability that the sequence is a non-gene is (false positive) if the program reports a sequence as a non-gene, then the probability that the sequence is indeed a non-gene is , and that it is a gene is (false negative). p(C1), p(C2): prior probability distribution, p(C1) = p(C2) = 0.5 (uninformative prior, so the outcome will depend on the likelihood only) p(Bred|C1), p(Bred|C2): likelihood, p(Bred|C1) = 0.9 , p(Bred|C2) = 0.2 The denominator is the prior probability of Bred.

Gene prediction Imagine that C1 is a "box" of real protein-coding genes and that C2 is a "box" of non-genes. A gene-predicting program: if the query sequence is from C1: A probability of 0.9 of reporting Gene, and 0.1 of reporting Non-gene (false negative) if the query sequence is from C2: A probability of 0.8 of reporting Non-gene, and 0.2 of reporting Gene (false positive). Test outcome and associated probabilities: Real gene Report TRUE FALSE Gene Non-gene The same applied to motif predictions

When most of genome is not coding
Coding (C): P(Gene|C): 0.9 Fraction of non-coding genome (J for Junk): P(Gene|J) = 0.015 Real gene Report TRUE FALSE Gene Non-gene

Disease diagnosis and prior effect
Disease (D): True + (P): 0.9 People with no disease (H for healthy): False + (P): 0.015 Inappropriate use of uninformative priors: use P(H) = P(D) = 0.5 Sick people are much more likely to take the test than healthy people, so the sick people accounts for almost 100% of people who actually take the test, i.e., P(D)  1 instead of People with AIDS are much less likely to take the test than healthy people, so P(D) < 0.036

Bayesian inference on breast cancer
Population: women aged 40+ A woman has a chance of 0.01 of getting breast cancer. 80% of those with breast cancer will get positive mammographies. 10% of those without breast cancer will also get a positive mammography. What is the probability that a woman with a positive mammography actually has breast cancer? posterior priors 0.8 0.008 0.008 0.075=0.008/ ( ) 0.01 0.2 0.002 0.099 0.099 0.99 0.925 0.891 0.1 0.9 Xuhua Xia Slide 22

Motif identification with Gibbs Sampler

Similar presentations

Presentation on theme: "Motif identification with Gibbs Sampler"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Motif identification with Gibbs Sampler

Similar presentations

Presentation on theme: "Motif identification with Gibbs Sampler"— Presentation transcript:

Similar presentations

About project

Feedback