Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan Liu, Yiming Yang, Hao Wan, Hua-Jun Zeng, Zheng Chen, and Wei-Ying Ma Support Vector Machines Classification with A Very Large-scale Taxonomy SIGKDD, 2004

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Dataset characteristic Complexity Analysis Effectiveness Analysis Experimental Settings Conclusions Personal Opinion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation very large-scale classification taxonomies Hundreds of thousands of categories Deep hierarchies Skewed category distribution over documents open question whether the state-of-the-art technologies in text categorization evaluation of SVM in web-page classification over the full taxonomy of the Yahoo! categories

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective scalability and effectiveness 1. a data analysis on the Yahoo! Taxonomy 2. development of a scalable system for large-scale text categorization 3. theoretical analysis and experimental evaluation of SVMs in hierarchical and non-hierarchical setting for classification 4. threshold tuning algorithms with respect to time complexity and accuracy of SVMs

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction TC (Text categorization), SVMs, KNN, NB,… in recent years, the scale of TC problems to become larger and larger Answer this question from the views of scalability and effectiveness

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM flat SVMs, hierarchical SVMs structure of the taxonomy tree

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separating hyperplane between the two classes by max the margin between the classes’ closest points

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Multi-class classification basically, SVMs can only solve binary classification problems fit all binary sub-classifiers one-against-all N two-class (true class and false class) one-against-one N(N-1)/2 classifiers

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM be a set of n labeled training documents linear discriminant function a corresponding classification function as margin of a weight vector

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM Optimal separation soft-margin multiclass formulation

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. SVM

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-first characteristic The full domain of the Yahoo! Directory 292,216 categories 792,601 documents

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-second characteristic Over 76% of the Yahoo! Categories have fewer than 5 labeled documents As “rare categories” increases at deeper hierarchy levels 36% are rare categories at deep levels

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE-third characteristic many documents have multiple labels average has 2.23 labels the largest number of labels for a single document is 31

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. DATABASE Yahoo! Directory into a training set and a testing set with a ratio of 7:3 Remove those categories containing only one labeled document 132,199 categories 492,617 training documents 275,364 testing documents

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Flat SVMs, with one-against-rest strategy N is the number of training documents M is the number of categories denotes the average training time per SVM model model

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness Hierarchical m i is the number of categories defined at the i-th level j is the size-based rank of the categories n ij is the number of training documents for the j-th category at the i-th level n i1 is the number of training document for the most common category at the i-th level is a level-specific parameter

22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness was used to approximate the number of categories at the i-th level

23 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness

24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness

25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness

26 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity and Effectiveness For the testing phase of hierarchical SVMs Pachinko-machine search: 從根部做起,每次從當前類中選 擇一個最可能的子類打開,直到遇到葉子為止

27 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning

28 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Complexity of SVM Classification with Threshold Tuning SCut Optimal performance of the classifier is obtained for the category Fix the per-category thresholds when applying the classifier to new documents in the test set RCut Sort categories by score and assign YES to each of the t top- ranking categories

29 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Effectiveness Analysis Compared to scalability analysis, classification effectiveness is not as clear and predictable be affected by many other factors Potential problems of SVM noisy, imbalanced Can’t expect the performance of hierarchical SVM to be very good

30 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results 10 machines, each with four 3GHz CPUs and 4 GB of memory

31 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

32 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

33 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

34 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Results

35 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Text categorization algorithms to very large problems, especially large-scale Web taxonomies

36 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions Drawback Lower performance in deep level Application combine SVMs with concept hierarchical tree Application to Text, or others domain Pachinko-machine search… Future Work learn SVMs kernel to implement ?


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan."

Similar presentations


Ads by Google