1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003

2 “ofallthewordsinthisunsegmentedphraseth erearesomehidden” The challenge is to develop an algorithm for DNA sequences that can partition the sequence into meaningful “words”

3 Presentation Outline Introduction MobyDick Stochastic Dictionary-based Data Augmentation (SDDA) Algorithm Extensions Results

4 Introduction Some new challenges now that there are publicly available databases of genome sequences: How do genes regulate the requirements of specific cells or for cells to respond to changes? How can gene regulatory networks be analyzed more efficiently?

5 Gene Regulation Transcription Factors (TF) play a critical role in gene expression Enhance it or Inhibit it Short DNA motifs 17-30 nucleotides long often correspond to TF binding sites Build model for TF binding sites given a set DNA sequences thought to be regulated together

6 MobyDick Dictionary building algorithm developed in 2000 by Bussemaker, Li and Siggia Decomposes sequences into the most probable set of words Start with dictionary of single letters Test for the concatenation of each pair of words and its frequency Update dictionary

7 MobyDick results Tested on the first 10 chapters of Moby Dick 4214 unique words, @1600 of them repeats Result had 3600 unique words Found virtually all 1600 repeated words

8 SDDA Stochastic Dictionary-based Data Augmentation Stochastic words represented by probabilistic word matrix (PWM) Some definitions D=dictionary size  =sequence data generated by concatenation of words D ={M 1.,… M D } the concatenated of words, including single letters P =p(M 1 )…p(M D ) probability vector A i ={A ik … A nk } denotes the site indicators for motifs M k (A ik =1 or 0)

9 SDDA Some definitions q=4, also A,G,C,T are the first 4 words in dictionary  ={P 1.,… P k } sequence partition so each part P i corresponds to a dictionary word N(  ) = total number of partitions N Mj (  )=number of occurrences of word type M j in the partition w j. (j=1…D) denotes word lengths The D-q motif matrices are denoted by {  q+1 …  D }=  (D) If the k th word is width w then its probability matrix is  k = {  1 k …  wk }

10 Probabilistic Word Matrix A.85.07.8.02.12 C.05.78.07.01 G.1.05.12.96.85 T0.1.01.02 ACAGG=.85*.78*.8*.96*.85=.4328 GCAGA=.1*.78*.8*.96*.12=.0072

11 General idea of Algorithm So we start with D (1) ={A,G,C,T} and estimate the likelihood of those 4 words in the dataset. Then we look at any pair of letters, say AT. If it is over-represented and in comparison to D (1) then it is added to the dictionary D (2) and this is repeated for all the pairs. Consider all the concatenations of all the pairs of words in D (n) and form a new dictionary D (n+1) by including those new words that are over-represented or more abundant than by chance.

12 SDDA Algorithm 1) Partitioning: sample for words given the current value of the stochastic word matrix and word usage probabilities Do a recursive summation of probabilities to evaluate the partial likelihood up to every point in the sequence L i (  )=  P(  [i-wk+1:j] |  )L i-wk (  ) Words are sampled sequentially backward, starting at the end of the sequence. Sample for a word starting at position i, according to the conditional probability P(A ik =1|A i+wk,  )=P(  [i:i+wk-1] |  k,p)L i-1 (  )/L i+wk-1 (  ) If none of the words are selected then the appropriate single letter word is assumed & k is decremented by 1.

13 SDDA Algorithm 2) Parameter Update: Given the partition A, update the word stochastic matrix  D update the word probabilities vector P by sampling their posterior distribution 3) Repeat steps 1 and 2 until convergence, when MAP (maximum a posteriori) score stops increasing. This is a method of “scoring” optimal alignment and is calculated with each iteration. 4) Increase dictionary size D=D+1. Repeat again from step 1 but now  D-1 is a known word matrix

14 Algorithm Extensions Phase Shift via Metropolis steps Patterns with variable insertions and deletions (gaps) Patterns of unknown widths Motif detection in the presence of “low complexity” regions

15 Phase Shift If 7,19,8,23 are strongest pattern but algorithm chooses a1=9, a2=21 early on then it is likely to also choose a3=10,a4=25 Metropolis steps solution a ={a 1 … a m } are starting positions for an occurrence of a motif Choose   1 with probability.5 each Update the motif position a+  with probability min{1, p(a+  |  )/p(a|  )

16 Patterns with: gaps/unknown widths Gaps - Additional recursive sum in the partitioning step(1) using io Insertion-opening probability ie Insertion-extension probability Do Deletion-opening probability De Deletion-extension probability Unknown Widths - The authors also enhanced their algorithm to determine the likely pattern width if it is unspecified.

17 Motif Detection with “low complexity” regions AAAAAAA… CGCGCGCG… The stochastic dictionary model is expected to control this by treating these repeats as a series of adjacent words

18 Results Two case studies are provided Simulated dataset with background polynucleotide repeats CRP binding sites

19 Relative performance of the SDDA compared to BioProspector & AlignAce SDDA BP AA EVAL2SuccessFalse- positive SuccessFalse- positive SuccessFalse- positive a).241.071.02.3.43.48.6.06.700-.72.5.12.100-.96.7.020-0- b).241.031.09.1.52.48.9.12.7.01.1.62.72.9.05.60.1.36.961.030-0-

20 Credits Slide 6,7 Bussemaker,H.J., Li, H and Siggia, E.D. (2000), “Building a Dictionary for Genomes:Identification of Presumptive Regulatory Sites by Statistical Analysis”,Proceedings of the National Academy of Science USA, 97, 10096-10100. Slide 9 Liu,J.S., Gupta,M., Liu, X., Mayerhofere, L. and Lawrence, C.E.,”Statistical Models for Biological Sequence Motif Discovery”,1-19 Slide 14 Lawrence, C.E., Altschul, S.F.,Boguski, M.S., Liu,J.S.,Neuwald, A.F., and Wootton,J.C. (1993), “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science, 262,208-214.

21 Bibliography Bussemaker,H.J., Li, H and Siggia, E.D. (2000), “Building a Dictionary for Genomes:Identification of Presumptive Regulatory Sites by Statistical Analysis”,Proceedings of the National Academy of Science USA, 97, 10096-10100. Lawrence, C.E., Altschul, S.F.,Boguski, M.S., Liu,J.S.,Neuwald, A.F., and Wootton,J.C. (1993), “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science, 262,208-214. Liu,J.S., Gupta,M., Liu, X., Mayerhofere, L. and Lawrence, C.E.,”Statistical Models for Biological Sequence Motif Discovery”,1-19

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

Similar presentations

Presentation on theme: "1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

Similar presentations

Presentation on theme: "1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003."— Presentation transcript:

Similar presentations

About project

Feedback