Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sampling Approaches to Pattern Extraction

Similar presentations


Presentation on theme: "Sampling Approaches to Pattern Extraction"— Presentation transcript:

1 Sampling Approaches to Pattern Extraction
(Lecture for CS397-CXZ Algorithms in Bioinformatics) April 16, 2004 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

2 Pattern Extraction: Probabilistic vs. Combinatorial
Problem: Find common patterns (motifs) in sequences S={ s1,…, sN} Probabilistic Motif M = prob model p(S|M) M matches every sequence but with different probabilities E.g., M={p(x|i)}, i=1,…, w (width) P(x|i)=prob. symbol x occurs in position i Task: Find best M, and the matching positions in each Si. Best= p(S|M) is the highest. Combinatorial Motif M = deterministic pattern M either matches a sequence or not E.g., M= AT..G Task: Find best M’s Best = highly frequent

3 Probabilistic Pattern Extraction
Motif M = prob. model of sequences p(Seq|M) M matches every sequence but with different probabilities E.g., M={p(x|i)}, i=1,…, w (width) P(x|i)= prob symbol x occurs in position I Task = Find best M and the matching positions in each Si; “best” = p(S|M) is the highest Method = Search for the best model Sampling is an efficient way of searching

4 Position Weighted Matrix (PWM)
Position: w w 1.0 1.0 1.0 Essentially a simple linear HMM Parameters: qij=p(symbol j | position i) E.g., q1A=q1G=0.5; q1C=q1T=0 q9C=1.0; q9A=q9C=q9T=0 Covers a deterministic patter such as AT.G as a special case with the following Q matrix: A T C G

5 Discovering a PWM from Sequences
Given a set of sequeces S={ s1,…, sN} a pattern width w (e.g. 10) Discover the most discriminative PWM M, i.e., the M that maximizes p(S|M)/p(S|Background) P(S|M)=p(s1|M)…p(sN|M) ( roughly!) Prior could be incorporated through maximizing posterior probability of M How to discover it? Using HMM training algorithm? (not all observations are relevant) Gibbs Sampler

6 Gibbs Sampler: Basic Idea
Introduce an auxiliary variable ak to record the position of the pattern in sequence sk Randomly choose initial positions ak Iterate with the following two steps Predictive update: Using the current positions to estimate the model qij Sampling: Using the current model to improve the position in one sequence (e.g., ak) Take one sequence and compute the probability ratio of each position p(x|i)/p(x|Background) Sample a position based on the ratio weight In general, we get a high ratio position, but not always the highest Observations If a position is improved, then the model will be improved If a model is improved then all the positions will also be improved

7 Gibbs Sampler: Details of One Iteration
At every step, take one sequence out (e.g., sequence z) for position improvement Use the rest to estimate two models qij and pj (background) qij is estimatd based on the matching segments at the current positions pj is estimated based on all other regions of these sequences (negative model) For each position i in the sequence taken out, compute the probability ratio Normalize the ratios to get a probabilities and choose a position stochastically according to the probabilities Change the current position for sequence z to the new position obtained

8 Estimation of qij and pj
qij is estimated based on the sequence segments at the current “matching positions” a1, …,aN pj is estimated based on the “non-matching regions” of all the sequences (relevant frequency) In general, smoothing is necessary Total counts of symbol j in relative position i Pseudocounts

9 Example of Estimating qij
Z M A G C T N=6, W=10, without smoothing q1A= 3/5, q2G = 2/5, … q1G= 0

10 Example of Computing the Ratios
Z M A T G C q3G X q4G X q5C X q6T X q7A X q8A X q9G X q10C X q11C X q12A pG x pG x pC x pT x pA x pA x pG x pC x pC x pA Ratio = Select a set of ak’s that maximizes the product of these ratios, or F F = Σ 1≤i≤W Σ j∈ {A,T,G,C} ci,jlog(qij/pj)


Download ppt "Sampling Approaches to Pattern Extraction"

Similar presentations


Ads by Google