Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.

Similar presentations


Presentation on theme: "1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas."— Presentation transcript:

1 1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas

2 2 Decomposing KL Back to our profiles If we consider the independent marginal distributions P j and Q j for each of the n positions j, it can be shown that so relative entropy can be measured by position

3 3 What is an acceptable KL value? Assume perfect patterns and uniform background distribution –p=1 for exactly one nucleotide at each position –q=0.25 for each nucleotide everywhere Then the relative entropy will be –log 2 4=2 bits per position –2n bits overall

4 4 Comparing with the background Under the uniform assumption, the motif will appear by chance every 4 n letters Since we would expect that a real motif should not match at too many random places, so 4 n ≈ Γ (the length of the genome) for one random match Therefore, Γ = 4 n = 2 2n = 2 D(P||Q) So a good value for D(P||Q) is logΓ

5 5 Site selection by relative entropy Consider the more general problem with unknown motif instance positions Given k sequences and an integer n, find one length n substring of each of the k sequences such that for the induced profile A, the relative entropy D(A||B) is maximum. This problem is provably NP-complete.

6 6 Assumptions Same length n for all motif instances (no indels) One instance per sequence (in reality, there maybe zero or more than one) Example: Given all genes involved in digestion in yeast, use as input sequences the 1 kB upstream regions from each such gene –goal is to find transcription factor sites

7 7 Relative entropy implications The number of sequences k –does not affect relative entropy –the proportion of matching nucleotides does –k indirectly affects our confidence in the estimate of the profile A The length n –affects relative entropy –Because of the additive decomposition by position, and the fact that D(A||B)≥0, increasing n always increases relative entropy –Normalize by dividing by n

8 8 Addressing NP-completeness As we mentioned earlier, we have to look at approximate solutions Several broad strategies are available –greedy algorithms (local optimization) –statistical simulation (Gibbs sampling)

9 9 Greedy approach Choose locally optimal candidate motif instance sets Augment a starting point in the direction that locally seems the most promising Keep a limited number of candidate solutions at each iteration

10 10 Hertz-Stormo (1999) algorithm Carry around a set S of sets; each member S ij is a candidate solution Start with each S 1j being each substring of length n from one of the k input sequences (a singleton set) At step i, add to each S (i-1)j each of the substrings of length n from input sequence i (alternatively, from all sequences ≥i)

11 11 Hertz-Stormo algorithm After a new string has been added to S ij, recalculate the profile A and D(A||B) Prune S by keeping only the d best scoring S ij ’s (this is the heuristic step) Repeat until i=k

12 12 Hertz-Stormo example Input sequences ACTGA, TAGCG, CTTGC and n=4 Start with S={{ ACTG }, { CTGA }} Calculate profile A and D(A||B) for each S 1j and keep the d best sets Expanding S 11 ={ ACTG } produces S 21 ={ ACTG, TAGC } and S 22 ={ ACTG, AGCG } Possibly also consider S 23 ={ ACTG, CTTG } and S 24 ={ ACTG, TTGC }

13 13

14 14 Complexity of Hertz-Stormo There are k steps (assuming one instance per sequence) At each step, at most d profiles are extended Each extension involves m-n+1=O(m) new strings, where m is the length of the input sequences Profiles and relative entropy can be updated in O(n) time Total time is O(knmd)

15 15 Issues with the heuristic approach Pruning is crucial to keep number of candidate sets manageable Order of sets influences the incrementally constructed profiles and what is kept for later stages (randomization is an option) How good is it? Hertz and Stormo tested it on 18 genes with 24 known sites, it found 19 and 3 overlaps

16 16 Statistical sampling A very general method for solving difficult problems with many variables that cannot be solved directly, but where partial solutions can be “guessed” and improved Commonly known as “Monte Carlo” methods (from the Monaco casino) because one of the pioneers of the technique liked gambling

17 17 Random walks A random walk is a special kind of a stochastic process where the system moves from state to state according to a probability distribution In optimization problems, we construct random walks where the system moves a marker (representing the current state) randomly, performing some calculations at each step (including where to go next)

18 18 Uniform random walks Assume that the state corresponds to a position in physical space in d dimensions Each step is of the same length (1), following one of the axes The drunkard problem: If a drunk performs a random walk in a city, will he get back home? Answer: Yes with probability 1

19 19 Gambler’s ruin If a gambler wins or loses an individual round with probabilities p and q, each time gaining or losing the same amount, what is the probability of ruin (reaching 0 money)? We assume an opponent with infinite money This is a one-dimensional random walk Answer: If p≤q, the probability of ruin is 1. If p>q, then the probability of ruin is q/p. So, +10% odds have a probability of ruin of 91%, and 2:1 odds a probability of 50%.

20 20 What about a drunk bird? Answer: No The difference is that the uniform random walk is recurrent in 1 or 2 dimensions but transient in 3 or more dimensions Recurrent: There exists a unique stationary distribution where the state of the process will converge Note that a uniform random walk in one dimension corresponds to Brownian motion in physics


Download ppt "1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas."

Similar presentations


Ads by Google