Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Similar presentations


Presentation on theme: "A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss."— Presentation transcript:

1 A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss

2 Regulation of Gene Expression

3 Difficulties of Motif Finding  Regulatory sequences don’t follow same orientation as the coding sequence or each other  Multiple binding sites might exist for each regulated gene  Large variation in the binding sites of a single factor. Variations are not well understood.

4 Previous & Proposed Methods for Finding Motifs  Previous Methods:  Find longer, general motifs  Use local search algorithms (Gibbs sampling, Expectation Maximization, greedy algorithms)  Proposed Method:  TFBS is small enough to use enumerative methods  Enumerative statistical methods guarantee global optimality and affordability

5 Proposed Method Highlights  Allows variations in the binding site instances of a given transcription factor  Allows for motifs to include “spacers”  Allows for overlapping occurrences (in both orientations), which lends to complex dependencies  Statistical significance of a motif (s) is based on the frequencies of shorter (more frequent) oligonucleotides  Use of Markov chain to model background genomic distribution  Use of z-score to measure statistical significance  Allows for multiple binding sites

6 Characteristics of a Motif  Any single TFBS has significant variation  Many motifs have spacers from 1-11bp  Variation often occurs as a transition (e.g. purine  purine) rather than a transversion (e.g. pyrimidine  purine)  Variation occurs less between a pair of complementary bases.  Indels are uncommon 

7 Proposed Motif Definition  Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N}  A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W (weak), N (spacer)  TF database (SCPD) confirms this model of variation  Of 50 binding site consensi, 31 exact fits (62%)  Another 10 fit if slight variations allowed

8 Measure of Statistical Significance  Given set of corregulated S. cerevisiae genes, the input to the problem is corresponding set of 800bp upstream sequences having 3’ end on start site of gene translation.  Model must measure from input sequences:  Absolute number of occurrences (N s ) of motif (s)  Background genomic distribution  X is a set of random DNA sequences in the same number and lengths of the input sequences  Generated by Markov chain of order m  Transition probabilities determined by (m+1)-mer frequencies in fully complement of 6000+ (800bp in length)  Background model chooses m=3

9 z-score  X s – r.v. is number of occurrences of motif (s) in X  E(X s ) – expectation, σ(X s ) – standard deviation  z s – number of S.D. by which observed value N s exceeds expectation

10 Implications  Possibility of overlap of a motif with itself (in either orientation)  Previous study of pattern autocorrelation  Generalized computation of SD, treating motif as a finite set of strings  Higher order Markov chains  Spacers handled at no extra computational cost  Handles motif in either orientation

11 Algorithm  Enumerates over each input sequence  Tabulates number N s of occurrences of each motif in either direction  Compute expectation and SD for each motif s.t. N s >0  Calculate z-score  Rank motifs by z-score

12 Algorithm Analysis  For single motif, complexity is O(c 2 k 2 )  k – # of nonspacer characters in motif  c – # of instantiations of R, Y, S, W in motif  Only modest values of k  Linear dependence on genome size  Can trim variance calculation to optimize

13 Number of Occurrences  Convert motif s into a multiset W  Add reverse complements for each string in W  Motif s only occurs at position in X iff some string in W occurs at same position  X s - # of occurrences (in X) of each member of W  Handling Palindromes  W i – member of W  |W| = T

14 Number of Occurrences Con’t

15 Expectation  Linearity of Expectation

16 Variance  B term  C term

17 C Term  A term

18 A Term

19 Overlapping Concatenation  CW (like W) is potentially a multiset  One-to-one correspondence

20 C Term Simplification

21 A Term Revisited

22 S i1 S i2 Term & Approximation  Kleffe and Borodovsky (1992) Approximation

23 B Term

24 B Term Con’t

25 Summary

26 Higher Order Markov Models  Variance calculations remain the same except for S i1 S i2 term  Experimental m = 3

27 Experimental Results & Future Considerations  17 coregulated sets of genes  Known TF with known binding site consensus  In 9 experiments, known consensus was one of 3 highest scoring motifs  Future Topics:  Non-centered spacers  Enumeration Loop optimization  Filtering repeats

28 Question  E(X s ) is more straight-forward to calculate compared to σ(X s ). Under the assumptions given in the paper, name one of the reasons for this complication.


Download ppt "A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss."

Similar presentations


Ads by Google