Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat.

Similar presentations


Presentation on theme: "Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat."— Presentation transcript:

1 Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat

2 2 Weighted Sequences A weighted sequence T, of length n, over an alphabet Σ is a | Σ |×n matrix which contains the probability of each symbol to appear at each position. 123456 A0.30.2500.40.80 C00.2510.20.050.5 G0.20.500.20.10 T0.5000.20.050.5 Also known as Position Weight Matrix

3 3 Pattern Matching in Weighted Sequences Problem Definition: Given a threshold probability , find all occurrences of the pattern P ( |P|=m ) in the weighted sequence T ( |T|=n ) where: By applying the logarithm

4 4 Naïve Algorithm Bounded Alphabet Size For each  in  1.Construct a vector P , such that P  [i]=1 if  occurs at position i in P, P  [i]=0 otherwise. 2.Calculate the sum of probabilities by convoluting the row of  in T with P . For each text position sum the results. Time: O( n | Σ | log m )

5 5 Matching in Weighted Sequences Unbounded Alphabet Size Input: Triplets ( C, I, P ), whenever P  0. s = # of triplets. Applying the naive algorithm in this case results in an O( n | Σ | log m ) = O( nm log m ) algorithm. This is worse then the trivial algorithm.

6 6 Example (a,1,0.2) (a,0,0.5) (b,1,0.7) (a,2,0.4) (a,4,0.1) T: (b,0,0.5) (c,1,0.1) (c,2,0.6) (b,3,1.0) (c,4,0.9) P: a b c (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) R: -0.67-  -0.45 -  - 

7 7 Step 1: Subset Matching Observation 1: A weighted matching can only appear in positions where a subset match can be found. Step 1a: Build a new text T s where for each text position there is a set of all the letters which have non-zero probabilities. Step 1b: Mark all the positions where a subset match is found. Time: O( s log 2 s ) (Cole & Hariharan STOC02).

8 8 Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c T’:{a,b}, {a,b,c}, {a,c}, {b}, {a,c} P’:{a}, {b}, {c} Subset match positions: 0,2

9 9 Step 2: Main Idea Linearize the input into raw vectors T’ and P’ of size O( s ), such that: T’ contains the probabilities. P’ contains 1’s and 0’s. Sum the probabilities using convolution. The linearization is done using shifting where each symbol is assigned a different shift. The same shifting will be used in both the text and the pattern.

10 10 Example (a,1,-0.7) (a,0,-0.3) (b,1,-.15) (a,2,-0.4) (a,4,-1.0) T: (b,0,-0.3) (c,1,-1.0) (c,2,-.22) (b,3,0.0) (c,4,-.05) P: a b c Shifts: a-0, b-3, c-1 (c,-1.0)(c,-.22)(b,-.15) T’:(a,-0.3)(a,-0.7)(a,-0.4)(b,-0.3)(a,-1.0)(c,-.05)(b,0.0) P’: a _ _ c b _ _ 0 1 2 3 4 5 6

11 11 Step 2: Linearization Definition: singleton – a position which assigned only 1 triplet. multiple - a position which assigned more then 1 triplet. Text - Replace all singletons with the probability of the triplet. The empty and multiple positions will be replaced by 0. Pattern - Replace all singletons with 1. The empty and multiple positions will be replaced by 0.

12 12 Example (c,-1.0)(c,-.22)(b,-.15) T’:(a,-0.3)(a,-0.7)(a,-0.4)(b,-0.3)(a,-1.0)(c,-.05)(b,0.0) P’: a _ _ c b _ _ T’’: -0.3-0.7 0 0 0 -.05 0 P’’: 1 0 0 1 1 0 0 This allow us to sum the probabilities using convolution. Question: Are we summing the right values?

13 13 Step 2: Correctness Lemma: For any position where a subset match exists, 2 aligned singletons must be originated from the same letter. Proof: Assume that there is a subset match in position i in the text, and there are 2 aligned singletons in T’(i+j), P’(j).

14 14 Step 2: Completeness Solution: Zero the probability of the triplet after the first time it appeared as a singleton. Time: O( s log 2 s ) Problem: Using a several shifting set can cause adding probabilities more then once! Solution: Use a set of O(log s ) such shifting sets. Problem: We did not sum all probabilities!

15 15 Caution!!! Do Not Delete Triplet (c,-1.0)(c,-.22)(b,-.15) T’:(a,-0.3)(a,-0.7)(a,-0.4)(b,-0.3)(a,-1.0)(c,-.05)(b,0) P’: a _ _ c b _ _ T’’: -0.3-0.7 0 0 0 0 0 P’’: 1 0 0 1 1 0 0 Deleting (c,-.22) will cause (b,-0.3) to appear as a singleton!!!

16 16 Hamming Distance – Text Errors Bounded Alphabet Size Problem Definition: Given a threshold probability , find for each text position the minimal number of probabilities, which by changing them to 1, the following will be obtained: In case of errors in the text, a match can always be found. This does not apply for the case of errors in the pattern.

17 17 Hamming Distance – Text Errors Algorithm Outline… 1.Sort the probabilities in the weighted sequence. 2.Divide the list of probabilities into block of size ( n | Σ |) 0.5. 3.Calculate the sum of probabilities for each block. 4.For each text location, 1.Start adding blocks until the sum goes below the threshold. 2.Start adding probabilities from the last block until the sum goes below the threshold. Time:

18 18 Unbounded Alphabet Size Algorithm 1 1.Divide the list of probabilities into blocks of size s 0.5. 2.For each block calculate the sum of probabilities (shifting). 3.For each text position and each block If subset matching exist, use the shifting algorithm result. Else – use brute force. Time: Where k is the number of blocks per text position, where there is no subset match.

19 19 Unbounded Alphabet Size Algorithm 2 1.Sort the probabilities in the weighted sequence. 2.Divide the list of probabilities into blocks of size s 2/3. 3.For each block, 1.Calculate the sum of non-frequent letters probabilities. O(sm 2/3 ) 2.Calculate the sum of frequent letters probabilities. O(s 1/3 m 1/3 nlogm) 4.Continue as in the previous algorithm. Time: O(sm 2/3 + s 1/3 m 1/3 nlogm)

20 20 Unbounded Alphabet Size Combined Algorithm Start with the first algorithm. If k is small – Complete the first algorithm. Else – Apply the second algorithm.


Download ppt "Pattern Matching in Weighted Sequences Oren Kapah Bar-Ilan University Joint Work With: Amihood Amir Costas S. Iliopoulos Ely Porat."

Similar presentations


Ads by Google