Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector machine Presenter : Shao-Wei Cheng Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang KBS 2008

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Comments

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation 3 Bag of words (BOW) in vector space model is used to represent the text using individual words obtained from the given text data set. But text representation using individual words is not interpretable and comprehensible. And how does the degree of relevance of a multi-word to a document measure? Last December LastDecember U.S. agriculture department agriculture department U.S. agriculture

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objectives Proposed two strategies to represent the documents using the extracted multi-words. Investigating the effectiveness of using multi-words for text representation on the performances of text classification. Linear kernel Non-linear kernel Multi-word extraction Experiment IG for feature selection SVM for classification Text representation Decomposition strategy Combination strategy 1 2 3 4

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Methodology Multi-word extraction Repetition pattern extraction The U.S. agriculture department last December slashed its 12 month of 1987 sugar import quota from the Philippines to 143,780 short tons from 231,660 short tons in 1986. U.S. agriculture department (NNN) U.S. agriculture (NN) agriculture department (NN) last December (AN) sugar import quota (NNN) short tons (AN)

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Methodology Text representation Decomposition strategy Using the document frequencies Combination strategy U.S. agriculture department agriculture department U.S. agriculture U.S. agriculture department agriculture department U.S. agriculture Mickey is a mouse whose name is Mickey occurrence ratio (OR) = 1 minimum scope (MS ) = 4 Multi-word : Mickey mouse

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments 7 Data from Reuters-21578 IG for feature selection SVM for classification M = multi-word C = Combination strategy D = Decomposition strategy L = Linear kernel N = Non-Linear kernel

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion 8 Firstly, it has lower dimension than individual words but its performance is acceptable. Secondly, multi-word is easy to acquire from documents by corpus learning without any support of thesaurus, dictionary or ontology. Thirdly, multi-word includes more semantics and is a larger meaningful unit than individual word.

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Comments Advantage  The content of this article and the proposed method are clear. Drawback  … Application  Text classification


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector."

Similar presentations


Ads by Google