A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
Published byModified over 4 years ago
Presentation on theme: "A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute."— Presentation transcript:
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute
What? What is a motif? What is the biological point? What is the goal? What is a limitation with dynamic programming? What is Gibbs sampling? What is the program going to do? What is the program missing? What is a much better program?
What is a motif? A motif is a sequence pattern that occurs repeatedly in a group of related dna or protein sequences.
What is the biological point? Encoding the “structural motif” of a protein (dna in exons) Prediction of function (protein) Protein binding sites (dna) Transcription binding factors
What is the goal? Perform local multiple sequence alignment to find consensus sequences (motifs)
What is a limitation with dynamic programming? Memory and time complexity issues Size of search space = (L-W+1)^N L = length of a sequence W = width of motif N = number of sequences ex: L=30, W=7, N=10 (30 -7+1)^10 = 1333735776850284124449081472843776 Alignment of even four such sequences will take a few hours ~10^4 seconds
What is Gibbs sampling? Stochastic optimization method Works well with local multiple alignment without gaps (motif searching) Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space
What is the program going to do? 1. Ask user for : a) file containing multiple dna or protein sequences b) motif width c) how many motifs wanted 2. Calculate the background frequencies of A,C,G,T from all the sequences. [0.34951456310679613, 0.17799352750809061, 0.21035598705501618, 0.23300970873786409]
What is the program going to do? 3. Generate random start positions for the motif in each sequence. ex: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width
What is the program going to do? 4. Construct position specific score matrix from all sequences except one. Motif Position 0123456 A 0.600.700.50.1 C 00.90.2 00.30 G 000.70.100.6 T 00000.30.50.2
What is the program going to do? 5. Score the left-out sequence according to the position specific score matrix:
What is the program going to do? Example: Use the position specific matrix and background from before: [A: 0.34951456310679613, C: 0.17799352750809061, G: 0.21035598705501618, T: 0.23300970873786409] Motif Position 0123456 A 0.600.70.20.50.1 C 00.90.20.10.3 0 G 000.70.100.6 T 00000.10.50.2 GATTACA:
What is the program going to do? 6. Randomly generate another start position of the motif for that left-out sequence. 7. Score that sequence with its new start position. 8. Compare this new score with its original score. 9. If newscore >= oldscore, then jump to that new start position, else jump to that new start position with probability =
What is the program going to do? 10. Start all over again with this updated start position with another sequence left out Do this many many times! ~ 1000 iterations Gibbs will converge to a stationary distribution of the start positions => a probable alignment of the multiple sequences
What is the program missing? Doesn’t do reinitializations in the middle to get out of local maxima Doesn’t optimize the width (you have to specify width explicitly) Doesn’t do the Bayesian approach – just frequentist (easier for me and for you to understand!) Doesn’t read in fasta files Doesn’t do error checking! And other things that don’t know they are missing yet!
What is a much better program? Gibbs Motif Sampler http://bayesweb.wadsworth.org/gibbs/gib bs.html http://bayesweb.wadsworth.org/gibbs/gib bs.html AlignAce http://atlas.med.harvard.edu/cgi- bin/alignace.pl http://atlas.med.harvard.edu/cgi- bin/alignace.pl