Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,"— Presentation transcript:

1 Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF, LSI and multi-words for text classification

2 Intelligent Database Systems Lab Outlines Motivation Objectives Methodology Experiments Conclusions Comments

3 Intelligent Database Systems Lab Motivation Although TF*IDF, LSI and multi-word have been proposed for a long time, there is no comparative study on these indexing methods, and no results are reported concerning their classification performances.

4 Intelligent Database Systems Lab Objectives A comparative study of TF*IDF, LSI and multi-words for text classification. - information retrieval - text categorization indexing term: semantic quality ‚statistical quality

5 Intelligent Database Systems Lab Methodology - TF*IDF 1)w i,j : the weight for term i in document j 2) N : the number of documents in the collection 3) tf i,j : is the term frequency of term i in document j 4) df i : is the document frequency of term i in the collection Terms (keywords) of the document collection documents

6 Intelligent Database Systems Lab Methodology - LSI Given a term-document matrix X = [x 1, x 2,..., x n ] є R m and suppose the rank of X is r, LSI decomposes the X using SVD as follows: Terms (keywords) of the document collection documents 1. X k =U k ’Σ k V k T ’ 2.

7 Intelligent Database Systems Lab Methodology - Multi-word the length of the multi-word should be between 2 and 6 its occurrence frequency should be at least twice in a document.

8 Intelligent Database Systems Lab Experiments - Datasets Chinese corpus : TanCorpV1.0 14150 documents20 categories Select 1200 documents219,115 sentences 5,468,301 individual words agriculturehistorypoliticseconomy English corpus : Reuters-22173 distribution 1.0 22173 documents135 categories Select 2032 documents50,837 sentences 281,111 individual words Crude (520)agriculture (574)Trade (514)Interest (424)

9 Intelligent Database Systems Lab Experiments - Evaluation

10 Intelligent Database Systems Lab Experiments - Chinese

11 Intelligent Database Systems Lab Experiments - English

12 Intelligent Database Systems Lab Experiments – t-test

13 Intelligent Database Systems Lab Comparison information retrieval text categorization computation complexity TF*IDFChineseO(n m) LSIEnglishbestO(n 2 r 3 ) multi-wordO(ms 2 )

14 Intelligent Database Systems Lab Conclusions LSI can produce better indexing in discriminative power. LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods. The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.

15 Intelligent Database Systems Lab Comments Advantages - Compare with TF*IDF, LSI and multi-words Disadvantage - semantic quality and statistical quality are considered merely by our intuition instead of theory Applications - text mining


Download ppt "Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,"

Similar presentations


Ads by Google