Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB.

Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB Seminar Aug 30, 2002

Presentation Outline  Introduction  The single-class ES mining problem  Data structure: merged suffix tree  Algorithms: baseline, s-pruning, g-pruning, l -pruning  Performance evaluation  Conclusions

Introduction  Emerging Substrings (ESs) A new type of KDD patterns Substrings whose supports (or frequencies) increase significantly from one class to another (measured by a growth rate) Motivation: Emerging Patterns (EPs) by Dong and Li Jumping Emerging Substrings (JESs) as a specialization of ESs  Substrings which can only be found in one class but not others

Introduction  Emerging Substrings (ESs) Usefulness  Capture sharp contrasts between datasets, or trends over time  Provide knowledge for building sequence classifiers Applications (virtually endless)  Language identification, purchase behavior analysis, financial data analysis, bioinformatics, melody track selection,web-log mining, content-based e-mail processing systems, …

Introduction  Mining ESs Brute-force approach  To enumerate all possible substrings in the database, find their support counts in each class, and check growth rate  But a huge sequence database contains millions of sequences (GenBank has 15 million sequences in 2001), and  No. of substrings in a sequence increases exponentially with sequence length (A typical human genome has 3 billion characters)  Too many candidates  Expensive in terms of time ( O(|D| 2 n 3 ) ) and memory  Other shortcomings: repeated substrings, common substrings, … (Please refer to [seminar020201])

Introduction  Mining ESs An Apriori-like approach  E.g. if both abcd & bcde are frequent in D, generate candidate abcde  Find frequent substrings and check growth rate  Still requires many database scans  A candidate may not be contained in any sequence in D  Apriori property does not hold for ESs: abcde can be an ES even if both abcd & bcde are not We need algorithms which are more efficient and which allow us to filter out ES candidates

Introduction  Mining ESs Our approach: A suffix tree-based framework  A compact way of storing all substrings, with support counters maintained  Deal with suffixes (not substrings) of sequences  Do not consider substrings not existing in the database  Time complexity: O( lg(|  |) |D| n 2 )  Techniques for pruning of ES candidates can be easily applied

Basic Definitions  Sequence An ordered set of symbols over an alphabet   Class In a sequence database, each sequence  i has a class label C i  C = the set of all class labels  does not belong to C k   belongs to C k ’  Dataset If database D is associated with m class labels, we can partition D into m datasets, such that all sequences in dataset D i have class label C i   D k    D k ’

Basic Definitions  Count and support of string s in dataset D count D (s) = no. of sequences in D that contain s supp D (s) = count D (s) / |D|  Growth rate of string s from D 1 to D 2 growthRate D1→D2 (s) = supp D2 (s) / supp D1 (s)  growth rate = 0 if supp D1 (s) = supp D1 (s) = 0  growth rate = ∞ if supp D1 (s) = 0 and supp D2 (s) > 0

ES and JES  Emerging Substring (ES) Given  s and  g, a string s is an ES from D k ’ to D k (or s is an ES of C k ) if these hold: support condition: supp Dk (s) ≥  s growth rate condition: growthRate Dk’→Dk (s) ≥  g  Jumping Emerging Substring (JES) It is an ES with ∞ growth rate JES of C k : supp Dk’ (s) = 0 and supp Dk (s) > 0

ES and JES  Example Class C 1 Class C 2 abcd bd a c abd bc cd b With  g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd

ES and JES  Example Class C 1 Class C 2 abcd bd a c abd bc cd b With  g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd growthRate D1→D2 (b) = (3/4) / (2/4) = 1.5

ES and JES  Example Class C 1 Class C 2 abcd bd a c abd bc cd b With  g = 1.5 ESs from D 2 to D 1 : a, abc, bcd, abcd ESs from D 1 to D 2 : b, abd JESs are underlined

The ES Mining Problem  The ES mining problem Given a database D, the set C of all class labels, a support threshold  s and a growth rate threshold  g, to discover the set of all ESs for each class C j  C  The single-class ES mining problem A target class C k is specified and our goal is to discover the set of all ESs of C k C k ’ : opponent class

Merged Suffix Tree  Suffix tree Represent all the substrings of a length-n sequence in O(n) space  Merged suffix tree Represent all the substrings of all sequences in a dataset D k in O(|D k | n) space Each node has a support counter for each dataset relatedEach node is associated with a substring and related to one or more substrings Each edge is denoted as an index range [i start, i end )  E.g. if  = abcd, then  [1, 3) = ab

Merged Suffix Tree  Example (c 1, c 2 ) = (count in C k, count in C k ’ ) Class C k Class C k ’ abcd bd a c abd bc cd b A

Merged Suffix Tree  Example (c 1, c 2 ) = (count in C k, count in C k ’ ) Class C k Class C k ’ abcd bd a c abd bc cd b A count Dk (a) = 2, count Dk’ (a) = 1

Merged Suffix Tree  Example Node Y is associated with abcd (concatenation) and related to abc & abcd (all share Y’s counters) An implicit node Z is associated with abc Y Class C k Class C k ’ abcd bd a c abd bc cd b Z

Algorithms  The baseline algorithm Consists of 3 phases  Three pruning techniques Support threshold pruning (s-pruning algorithm) Growth rate threshold pruning (g-pruning algorithm) Length threshold pruning ( l -pruning algorithm)

Baseline Algorithm 1. Construction Phase (C-Phase) A merged tree MT is built from all the sequences of the target class C k – each suffix s j of each sequence is matched against substrings in the tree  Update c 1 counter for substrings contained in s j (but a sequence should not contribute twice to the same counter)  Explicitize implicit nodes when necessary  When a mismatch occurs, add a new edge and a new leaf to represent the unmatched part of s j

Baseline Algorithm 1. Construction Phase (C-Phase) Example Class C k abcd ab cd (1, 0) ab (2, 0) abc 3 c d (2, 0) Update of c 1 counter Explicitization of implicit node Update of edges

Baseline Algorithm 1. Construction Phase (C-Phase) Example Class C k abcd ab abc (1, 0) ab (2, 0) abe c d Addition of new edge and leaf node (3, 0) e (1, 0) 4

Baseline Algorithm 2. Update Phase (U-Phase) MT is updated with all the sequences of the opponent class C k ’  Only update c 2 counter for substrings that are already present in the tree, but not introduce any substring that is only present in D k ’  Only internal nodes will be added (no new leaf nodes) Resultant tree: MT’

Baseline Algorithm 3. eXtraction Phase (X-Phase) All ESs of C k are extracted by a pre-order tree traversal on MT’  At each node X, we check the values of its counters,  s and  g, to determine whether its related substrings can satisfy both the support and growth rate conditions  If the related substrings of a node X cannot fulfill the support condition, we can ignore the subtree rooted at X  Baseline algorithm: C-U-X phases

s -Pruning Algorithm  Observations The c 2 counter of each substring  in MT would be updated in the U-Phase if it is contained in some sequence in D k ’ If  is infrequent with respect to D k, it is not qualified to be an ES of C k and all its descendent nodes will not even be visited in the X-Phase  Pruning idea To prune infrequent substrings in MT after the C- Phase

s -Pruning Algorithm   s -Pruning Phase (P s -Phase) With the use of  s, all substrings being infrequent in D k are pruned by a pre-order traversal on MT Resultant tree: MT s (input to the U-Phase)  s-pruning algorithm: C-P s -U-X phases

g -Pruning Algorithm  Observations As sequences in D k ’ are being added to MT, value of the c 2 counter of some nodes would become larger  Support of these nodes' related substrings in D k ’ is monotonically increasing  Ratio of the support of these substrings in D k to that in D k ’ is monotonically decreasing At some point, this ratio may become less than  g. When this happens, these substrings have actually lost their candidature for being ESs of C k

g -Pruning Algorithm  Pruning idea To prune substrings in MT as soon as they are found to be failing the growth rate requirement   g -Update Phase (U g -Phase) When the support count of a substring in D k ’ increases, check if it still satisfies the growth rate condition. If not, prune substring by path compression or node deletion Supported by [i start, i q, i end ) representation of edges  g-pruning algorithm: C-U g -X phases

l -Pruning Algorithm  Observations Longer substrings often have lower support than shorter ones  less likely to fulfill the support condition for ESs It is not desirable to append these longer substrings to the tree in the C-Phase and subsequently prune them in the P s -Phase (for the s-pruning algorithm)  Pruning idea To limit the length of substrings to be added to MT in the tree construction phase

l -Pruning Algorithm   l -Construction Phase (C l -Phase) Only match (min(|s j |,  l ) symbols of each suffix against the tree (ignore the remainder)  a smaller MT is built Unlike the previous two pruning approaches, it may result in ES loss  l -pruning algorithm: C l -U-X phases

Summary of Phases  Baseline: C-U-X s-pruning: C-P s -U-X (earlier use of  s ) g-pruning: C-U g -X (earlier use of  g ) l -pruning: C l -U-X (addition of  l )  Combination of the use of pruning techniques,,,

Performance Evaluation  Dataset: CI3 (music feature in midi tracks) ClassNo. of sequences Avg./max. sequence length No. of distinct symbols melody843 (11%)331.0 / 108529 non-melody6742 (89%)274.9 / 289161  Goal: to extract ESs from target class: melody (opponent class: non-melody)  Assumptions: all sequences are pre-stored in memory (appended in a vector, starting & ending positions of each sequence recorded)

Number of ESs Mined ss Min. no. of occurrences No. of non-jumping ESsNo. of JESs  g = 2  g = 5  g =  0.25%3522033389451222 0.50%517619726489 1.00%9819525351 2.00%1737998190

Take a look at the tree size  When  s = 0.50%,  g = 2 |MT||MT s ||MT’| baseline416,151542,094 s-pruning416,15122,582 (-94.6%)22,961 (-95.8%) g-pruning416,151510,764 (-5.8%) sg-pruning416,15122,582 (-94.6%)18,413 (-96.6%)

Baseline Algorithm [C-U-X ]  Performance: same for all  s and  g  Time: about 35s

s -Pruning Algorithm [C-P s -U-X ]  Faster than baseline alg.by 25-45%  But reduction in time < reduction in tree size  Performance: improve with  in  s, same for all  g

g -Pruning Algorithm [C-U g -X ]  When  g = , faster than baseline alg. by 2-5%  When  g = 2 or 5, slower than baseline alg. by 1-4%  Performance: improve with  in  g, same for all  s

sg -Pruning Algorithm [C-P s -U g -X ]  Faster than baseline, s-pruning, g-pruning alg.(all cases)  Faster than baseline alg. for 31-54%(2 or 5), 47-81%(  )  Performance: improve with  in  s and  g

Target Class: Melody (  g = 2)  Performance of algorithms: (fastest) sg-pruning > s-pruning > baseline > g-pruning

What If Target Class = Non-Melody? (  g = 2)  Performance of algorithms: (fastest) s-pruning > sg-pruning > baseline > g-pruning

What If Target Class = Non-Melody?  sg-pruning performs worse than s-pruning Due to overhead in node creation (g-pruning requires one more index for each edge)  Not much performance gain with s -pruning (just 3-5%) or sg-pruning (1-3%) Bottleneck: formation of MT (over 93% time is spent in the C-Phase) In fact, these pruning techniques are very effective since much time is saved in the U-Phase  42-80% (for s-pruning) and 54-85% (for sg-pruning)

l -Pruning Algorithm – % Loss of ESs  Except when  s = 0.25%, there is loss of non-jumping ESs only when  l < 20 (15 for the case of JESs)  s,  g ll ll avg. seq. length = 331 max. seq. length = 1085

l -Pruning Algorithm – % Time Saved  Time saved becomes obvious when  l < 100  For  s  0.50%, can save over 30% time without ES loss  s,  g ll ll avg. seq. length = 331 max. seq. length = 1085

To be Explored...  l s-pruning  l g-pruning  l sg-pruning

Conclusions  ESs of a class are substrings which occur more frequently in that class rather than other classes.  ESs are useful features as they capture distinguishing characteristics of data classes.  We have proposed a suffix tree-based framework for mining ESs.

Conclusions  Three basic techniques for pruning ES candidates have been described, and most of them have been proven effective  Future work: to study whether pruning techniques can be efficiently applied to suffix tree merging algorithms or other ES mining models.

Applying Pruning Techniques to Single- Class Emerging Substring Mining - The End -

Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB.

Similar presentations

Presentation on theme: "Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB.

Similar presentations

Presentation on theme: "Applying Pruning Techniques to Single- Class Emerging Substring Mining Speaker: Sarah Chan Supervisor: Dr. B. C. M. Kao M.Phil. Probation Talk CSIS DB."— Presentation transcript:

Similar presentations

About project

Feedback