Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

Similar presentations


Presentation on theme: "Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002."— Presentation transcript:

1 Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002

2 Outline Introduction Model Algorithm Evaluation Conclusion

3 Introduction Pattern discovery in long sequences has many applications. The common metric used to qualify a significant pattern is support. Noise usually exists. –A symbol is misrepresented by another symbol. –An occurrence of a pattern cannot be recognized. E.g: when a sequence: d 1 d 3 d 4 d 5 is misrepresented by: d 1 d 2 d 4 d 5, pattern ‘d 1 d 3 ’ cannot be found.

4 Introduction –The observed support of a pattern may be less than the real support of it. –Some frequent patterns cannot be discovered due to the noise. The result of failing to find a frequent pattern is more critical when the pattern is long. –Long patterns are more vulnerable to distortions. If there are noises and long patterns in the database, support is not a very suitable measure for significant patterns.

5 Introduction –E.g: gene sequence analysis with amino acid as the granularity, the length of a gene expression is usually a few thousands. Noise is common: some mutation of amino acids occurs with a non-negligible probability. Compatibility Matrix –a matrix whose element shows the probability of the observed value being a real underlying substance. –each observed symbol is interpreted as an occurrence of a set of symbols with various probabilities.

6 Introduction An example of the compatibility matrix. Prob(d 1 |d 1 )=0.9 Prob(d 2 |d 1 )=0.05 Prob(d 3 |d 1 )=0.05 Prob(d 4 |d 1 )=0 Prob(d 5 |d 1 )=0 Based on the compatibility matrix, a new measurement, called match, is proposed to qualify important patterns. Observed value d1d2d3d4d5 true value d10.90.1000 d20.050.80.050.10 d30.0500.70.150.1 d400.1 0.750.05 d5000.1500.85

7 Model I = {d 1, d 2, …, d m }. A sequence (pattern) of length n is an ordered list of n symbols in I. –E.g: d 1 d 2 d 1 is a sequence (pattern) of length 3. Given a sequence S=s 1 s 2 …s l s, a pattern P=d 1 d 2 …d l p is a subsequence (subpattern) of S –if there exist a list of integers 1  i 1 <i 2 <…<i l p  l s such that d j =s i j for 1  j  l p. S is also called a supersequence (superpattern) of P. –E.g: d 1 d 4 d 5 is a subpattern of d 1 d 3 d 4 d 5.

8 Model Given I={d 1,d 2,…,d m }, the compatibility matrix is an m  m matrix –Its element C(d i,d j )=Prob(true_value = d i | observed_value = d j ), where 1  i,j  m. –The compatibility matrix is assumed to be provided by an expert in the area. Given a pattern P= d 1 d 2 …d l, and a sequence s= d 1 ’d 2 ’…d l ’, the match of P in s, denoted by M(P,s): –is defined as the conditional probability that s corresponds to an occurrence of P.

9 Model If each observed symbol is generated independently, then M(P,s) =Prob(P|s) =  1  i  l C(d i,d i ’). –E.g: if P=d 1 d 2, s=d 1 d 3, then M(P,s)=C(d 1,d 1 )  C(d 2,d 3 )=0.9  0.05=0.045. –Here P is not a subpattern of s, but M(P,s) >0. Given a sequence S of length l s and a pattern P of length l p where l s  l p, the match of P in S –is defined as the maximal match of P in every distinct subsequence (of length l p ) of S. –i.e. M(P,S)=max s is a subpattern of S M(P,s).

10 Model –as many as distinct subsequences –dynamic programming: O(l p  l s ) time –optimization to nearly O(l s ) time Given a pattern P and a database D of N sequence, the match of P in D is defined as M(P,D)=  S  D M(P,S) / N. A minimum match threshold  match is specified by a user. All patterns that meet the  match threshold are called to frequent patterns.

11 Model The match model can accommodate misrepresentation of symbols due to noise. The Apriori property also holds on the match metric. –The match of a pattern P in a database D  the match of any subpattern of P. In a noise-free environment, match model can represent support model. –Let the compatibility matrix be an identity matrix (C(d i,d j )=1 if i=j and is 0 otherwise). –The match of a pattern is equal to the support of a pattern.

12 Algorithm Problem to tackle –Large number of frequent patterns with match metric. –Long length of frequent patterns. Technique used: sampling, Chernoff bound estimation and border collapsing. Three steps: –Phase 1: finding match of individual symbols and sampling –Phase 2: ambiguous pattern discovery on samples –Phase 3: border collapsing

13 Algorithm Phase 1: finding match of each symbols and sampling –For a sequence D i in the database, the match of a symbol d in D i is: M(d, D i )=max d i  D i C(d,d i ). –e.g: if D i =d 2 d 3, then M(d 1, D i )=max{0.1, 0}=0.1 M(d 2, D i )=max{0.8, 0.05}=0.8 M(d 3, D i )=max{0, 0.7}=0.7 M(d 4, D i )=max{0.1, 0.1}=0.1 M(d 5, D i )=max{0, 0.15}=0.15 –Draw a sample of the whole database and store it in memory. Observed value d1d2d3d4d5 true value d10.90.1000 d20.050.80.050.10 d30.0500.70.150.1 d400.1 0.750.05 d5000.1500.85

14 Algorithm Phase 2: ambiguous pattern discovery on the sample dataset Chernoff bound estimation –If n is the size of the sample,  is the match of a pattern P= d 1 d 2 …d l in the sample, then P is frequent in the whole db with probability 1-  if  >  match +  infrequent in the whole db with probability 1-  if  <  match -  ambiguous if  (  match - ,  match +  ) where, R is the spread of , R=min 1  i  l match[d i ].  can be selected by users, e.g.  =0.001, 1-  =99.9%.

15 Algorithm Phase 2: ambiguous pattern discovery on the sample dataset –Use an existing algorithm to mine the sample. –For a pattern discovered in the sample, label it as frequent, ambiguous, or infrequent according to Chernoff bound estimation. –Find the border (denoted by FQT) between frequent and ambiguous patterns, and the border (denoted by INFQT) between the ambiguous and infrequent patterns.

16 Algorithm Phase 2: ambiguous pattern discovery on the sample dataset FQT={p | p is frequent  immediate superpatterns of p are either ambiguous or infrequent} INFQT={p | p is ambiguous  the superpatterns of p are all infrequent}

17 Algorithm Phase 3: Border Collapsing input output –Scan the database to count the matches of ambiguous patterns to see whether they are frequent or infrequent. Infrequent patterns INFQT Ambiguous patterns FQT Frequent patterns Infrequent patterns Border Frequent patterns processing

18 Algorithm Phase 3: border Collapsing –If memory can hold the counters associated for all ambiguous patterns, one database scan is ok. –Sometimes, there is a huge amount of ambiguous patterns, and the database have to be scanned several times. Selects a set of ambiguous patterns until the memory is filled up by the counters, scans the database to get their matches, and collapses the border. Repeat the select-scan-collapse procedure until the two borders become one. Try to minimize the No. of I/O passes needed. The ambiguous patterns which have high border collapsing power are selected.

19 Algorithm Phase 3: Border Collapsing –How to select patterns? ---- like binary search

20 Algorithm Phase3: Border Collapsing

21 Algorithm Phase 3: Border Collapsing –If there are x levels of ambiguous patterns A level-wise method needs to scan the database O(x) times; while border collapsing method only needs to scan the database O(log x) times. –For some previously ambiguous patterns, their labels (whether they are frequent or infrequent) are known, but their matches remain unknown after the step.

22 Evaluation Database –Standard database a protein database consists of 600K sequences of amino acids. the average length of a sequence is around 500. 20 different symbols –Test databases are generated from the standard database with random noises.  controls the degree of noise. A symbol d in the standard database remains the same in the test database with a probability of 1- , changes to any one of the other 19 symbols with a probability of  /19.

23 Evaluation Robustness of Match Model –Mine standard database R M ={frequent patterns found by match model} R s ={frequent patterns found by support model} R M =Rs –Mine test database R M‘ R s‘ Accuracy: |R M‘  R M | / |R M‘ |, |R s‘  R s | / |R s‘ | Completeness: |R M‘  R M | / |R M |, |R s‘  R s | / |R s |

24 Evaluation Robustness of Match Model—different noise degrees Match model: accuracy and completeness are more than 95% Support model: vulnerable to the noise

25 Evaluation Robustness of Match Model—different pattern lengths Match model: unaffected by the pattern length Support model: degrades as the pattern length becomes long

26 Evaluation Robustness of Match Model– when there is some error in the compatibility matrix When the error is 10%, match model can still achieve 88% accuracy and 85% completeness.

27 Evaluation Sample size –Patterns whose match follows in the range (  match - ,  match +  ) are ambiguous. larger sample size -> smaller  -> fewer ambiguous patterns

28 Evaluation Spread of Match R –R(P)=minimum match of its involved symbols Longer length, tighter R Higher degree of noise, smaller R

29 Evaluation Effects of Confidence 1-  –Previous experiments: 1-  =0.9999

30 Evaluation Missing Patterns

31 Evaluation Performance of Border Collapsing Algorithm –Compared with Max-miner, one of the fastest algorithm for mining frequent long patterns; A sampling method, which uses level-wise search to finalize the border. –Experiment result CPU time vs.  match No. of the database scans vs.  match No. of the database scans vs. length of the longest patterns

32 Evaluation Performance of Border Collapsing Algorithm

33 Evaluation Scalability w.r.t to the No. of distinct symbols m –synthetic database: 100K sequences, average length of 1000 –a larger m leads to less frequent patterns –a larger m leads to a larger size (m  m) of compatibility matrix

34 Evaluation Scalability w.r.t to the No. of distinct symbols

35 Conclusion In a noise environment, symbols observed may be different from the real ones. Compatibility matrix can provide a probabilistic connection from the observation to the underlying true value. A new metric, match, is proposed to measure significant patterns. Experiment results shows that –The match model is robust w.r.t. the noise. –Border collapsing algorithm is very efficient for finding long patterns.

36 End ?


Download ppt "Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002."

Similar presentations


Ads by Google