Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction System Overview Term Extraction and Selection Discriminative Term Selection Indexing And Classification Experimental Result Conclusions Personal Opinion

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation In text categorization, terms are extracted from documents and used for estimating the textual similarity between documents. The extracted terms often determine system performance. N-grams are typically employed for textual indexing. Need comparatively higher storage space. N-gram is not a meaningful unit in linguistics Inconsistencies problem. Unknown words presented are more domain-specific than traditional words. Domain dependency

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Propose a method for extracting meaningful and highly domain-specific unknown words form Chinese text documents.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Two main methods for detecting unknown words Statistical Some of which are restricted to particular type Rule-based Using dictionary Need part-of-speech information Limited length unknown word

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. System Overview T1 新聞 T2 體育 n=1~8 document j

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. System Overview

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Term Extraction and Selection Phrase-like Unit (PLU) A frequently occurring word sequence P, if a word w i in the sequence P and the preceding word w 1 w 2 …w i is always followed by the word sequence w i+1 w i+2 … P is probably an unknown word or phrase For example, 陳水扁 PLU-base likelihood ratio PLR(p) 陳水扁 250 陳 1000 水扁 200

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Term Extraction and Selection A word sequence p is considered an unknown word if n>1 tf (p)>=c PLR(p) >= 1-εor PLR(p)*tf(p) >= d

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Further Purification Some PLUs are useless or interfering Discard stopping terms Deal with cross-included terms Reliability degree

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discriminative Term Selection Here the term “discriminative” indicates the utility in distinguishing categories. A term, 陳水扁, is used for distinguish 政治, 體育 classes. For a term t representing category g discriminability W(t, g) can be defined as

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Index machine Using for locating keywords in a text. M = (S, I, g, f, s 0, O) For example, “ 半自動套裝遊程 ”

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION For improving performance The vector space model (VSM) is used. The document is represented as a vector The member of vector is a weighted indexing feature Term weighting for training documents K categories, N k documents in k category D k, j is the jth document in kth category

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Term weighting for training documents S(w) is a smooth 0-1 function for avoiding bias problem α is a constant

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Term weighting for unclassified documents not know the category of an unclassified document, each unclassified document should be represented as multiple description vectors. unclassified document is represented as K vectors X k, k=1…K

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Classification Function Combine the vectors of each category into a mean vector Classification function f Gk (X; A) is

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT CORPUS Min-Sheng Daily News (MSDN) 44,675 text documents, consisting of over 35 million words 1997 to April 1997 was for training, and 1999 to July 1999 was for testing. Performance Evaluation

18 Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT

19 Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Baseline performance Using the words defined in dictionary

20 Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Parameter Testing The number of representative terms is variable Constrain the number of terms selected from each category or not Examine discriminablility (Nor) effect on performance

21 Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Parameter Testing The number of representative terms is variable Constrain the number of terms selected from each category or not Examine discriminablility (Nor) effect on performance

22 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Experimental Results on Purification Process

23 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Combined Approach-unknown word-based

24 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Comparative Performance

25 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Consistency between Training and Testing Data

26 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions we have proposed two new concepts, meaningful term extraction and discriminative term selection. PLUs improve the performance of text Purification process reduces the dimensionality of the feature space.

27 Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion  Advantages ─ Take into account meaningful and discriminative terms. ─ Purification process save time ─ Terms can be extracted automatically and systematically  Application ─ ICD9 codes classifications and so on. ─ May solve the problem that Patient records with Chinese and English  Limited ─ Sparse data problem need to solve.


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG."

Similar presentations


Ads by Google