Presentation on theme: "Data not in the pre-defined feature vectors that can be used to construct predictive models. Applications: Transactional database Sequence database Graph."— Presentation transcript:
Data not in the pre-defined feature vectors that can be used to construct predictive models. Applications: Transactional database Sequence database Graph database Frequent pattern is a good candidate for discriminative features, especially for data of complicated structures. Motivation: Direct Mining of Discriminative and Essential Frequent Patterns via Model-based Search Tree Wei Fan, Kun Zhang, Hong Cheng, Jing Gao, Xifeng Yan, Jiawei Han, Philip S. Yu, Olivier Verscheure Why Frequent Patterns? A non-linear conjunctive combination of single features Increase the expressive and discriminative power of the feature space Examples: Exclusive OR problem & Solution XYC x y L1 L2 Data is non- linearly separable in (x, y) XYXYC mine & transform Data is linearly separable in (x, y, xy) map data to higher space Conventional Frequent Pattern-Based Classification: Two-Step Batch Method 1. Mine frequent patterns; 2. Select most discriminative patterns; 3. Represent data in the feature space using such patterns; 4. Build classification models. F1 F2 F4 Data Data Data Data ……… represent Frequent Patterns DataSet mine Mined Discriminative Patterns select | Petal.Width< 1.75 setosa versicolor virginica Petal.Length< 2.45 Any classifiers you can name ANN DT SVM LR Basic Flows: Problems of Separated Mine & Select in Batch Method 1.Mine step: Issues of scalability and combinatorial explosion Dilemma of setting minsupport Promising discriminative candidate patterns? Tremendous number of candidate patterns? 2.Select step: Issue of discriminative power 5 Datasets: UCI Machine Learning Repository Scalability Study: Datasets#Pat using MbT supRatio (MbT #Pat / #Pat using MbT sup) Adult % Chess+ ~0% Hypo % Sick % Sonar % Itemset Mining Accuracy of Mined Itemsets Graph Mining 11 Datasets: 9 NCI anti-cancer screen datasets PubChem Project. Positive class : 1% - 8.3% 2 AIDS anti-viral screen datasets URL: H1: 3.5%, H2: 1% Scalability Study Predictive Quality of Mined Frequent Subgraphs AUC AUC of MbT, DT MbT VS Benchmarks Case Study Motivation Problems Proposed Algorithm Experiments dataset Few Data …… Divide-and-Conquer Based Frequent Pattern Mining mine & select Mined Discriminative Patterns Mine and Select most discriminative patterns; 2.Represent data in the feature space using such patterns; 3.Build classification models. F1 F2 F4 Data Data Data Data ……… represent | Petal.Width< 1.75 setosa versicolor virginica Petal.Length< 2.45 Any classifiers you can name ANN DT SVM LR Direct Mining & Selection via Model-based Search Tree Procedures as Feature Miner Or Be Itself as Classifier Analyses: 1.Scalability of pattern enumeration Upper bound Scale down ratio 2.Bound on number of returned features 3.Subspace pattern selection 4.Non-overfitting 5.Optimality under exhaustive search Take Home Message: 1.Highly compact and discriminative frequent patterns can be directly mined through Model based Search Tree without worrying about combinatorial explosion. 2.Software and datasets are available by contacting the authors.