Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF,

Slides:



Advertisements
Similar presentations
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Abdelghani Bellaachia and Mohammed Al-Dhelaan 2012, WIIAT NE-Rank: A Novel Graph-based.
Advertisements

Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Ahmed Abbasi, Stephen France, Zhu Zhang, and Hsinchun Chen 2011, IEEE TKDE Selecting.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Quality evaluation of product reviews using an information.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Liang-Chu Chen, Ting-Jung Yu, Chia-Jung Hsieh ACM KeyGraph-based chance discovery.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector.
Intelligent Database Systems Lab Presenter : BEI-YI JIANG Authors : UNIVERSIT´E CATHOLIQUE DE LOUVAIN, BELGIUM ASSOCIATION FOR COMPUTING MACHINERY.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Evaluation of novelty metrics for sentence-level novelty mining Presenter : Lin, Shu-Han Authors : Flora.
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : Jorge Villalon and Rafael A. Calvo 2011, EST Concept Maps as Cognitive Visualizations.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
Intelligent Database Systems Lab Presenter: WU, MIN-CONG Authors: Zhiyuan Liu, Xinxiong Chen, Yabin Zheng, Maosong Sun 2011, FCCNLL Automatic Keyphrase.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
Intelligent Database Systems Lab Presenter : JHOU, YU-LIANG Authors :Shady Shehata, Fakhri Karray, Mohamed S. Kamel, Fellow 2012, IEEE An Efficient Concept-Based.
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Sheng-Tun Li a,b,*, Fu-Ching Tsai a 2013, KBS A fuzzy conceptualization model for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Recommendations for E-Learning Personalization.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. An IPC-based vector space model for patent retrieval Presenter: Jun-Yi Wu Authors: Yen-Liang Chen, Yu-Ting.
Intelligent Database Systems Lab Presenter : Chang,Chun-Chih Authors : Youngjoong Ko, Jungyun Seo 2009, IPM Text classification from unlabeled documents.
Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Cihan Kaleli, Huseyin Polat 2012, KBS Privacy-preserving SOM-based recommendations.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
1 Mining the Web to Determine Similarity Between Words, Objects, and Communities Author : Mehran Sahami Reporter : Tse Ho Lin 2007/9/10 FLAIRS, 2006.
Intelligent Database Systems Lab Presenter : Kung, Chien-Hao Authors : Medhdi Khashei, Mehdi Bijari 2011, ASOC A novel hybridization of artificial neural.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Adaptation of the Vector-Space Model for Ontology-Based.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2007.SIGIR.8 New Event Detection Based on Indexing-tree.
Vector Space Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Utilizing Marginal Net Utility for Recommendation in E-commerce.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Bui Quang Hung, Masanori Otsubo, Yoshinori Hijikata, Shogo Nishida 2010.WIA. HITS.
Intelligent Database Systems Lab Presenter : BEI-YI JIANG Authors : HAI V. PHAM, ERIC W. COOPER, THANG CAO, KATSUARI KAMEI INFORMATION SCIENCES Hybrid.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Intelligent Database Systems Lab Presenter : WU, MIN-CONG Authors : YUNG-MING LI, TSUNG-YING LI 2013, DSS Deriving market intelligence from microblogs.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Luca Cagliero, Paolo Garza 2013.DKE. Improving classification models with taxonomy.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Intelligent Database Systems Lab Presenter : Chuang, Kai-Ting Authors : Rafael Odon de Alencar, Clodoveu Augusto Davis Jr., Marcos André Gonçalves 2010,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 TIARA: A Visual Exploratory Text Analytic System Presenter.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Wei Xu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu 2004.ICDM. Improving Text.
Intelligent Database Systems Lab Presenter : JHOU, YU-LIANG Authors : Jae Hwa Lee, Aviv Segev 2012 CE Knowledge maps for e-learning.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Junping Zhang, Hua Huang and Jue Wang IEEE INTELLIGENT SYSTEMS Manifold Learning.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
Sentence Similarity Based on Semantic Nets and Corpus Statistics
SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Intelligent Database Systems Lab Presenter: YU-TING LU Authors: Yong-Bin Kang, Pari Delir Haghighi, Frada Burstein ESA CFinder: An intelligent key.
Using lexical chains for keyword extraction
Presentation transcript:

Intelligent Database Systems Lab Presenter : JIAN-REN CHEN Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang 2011.ESWA A comparative study of TF*IDF, LSI and multi-words for text classification

Intelligent Database Systems Lab Outlines Motivation Objectives Methodology Experiments Conclusions Comments

Intelligent Database Systems Lab Motivation Although TF*IDF, LSI and multi-word have been proposed for a long time, there is no comparative study on these indexing methods, and no results are reported concerning their classification performances.

Intelligent Database Systems Lab Objectives A comparative study of TF*IDF, LSI and multi-words for text classification. - information retrieval - text categorization indexing term: semantic quality ‚statistical quality

Intelligent Database Systems Lab Methodology - TF*IDF 1)w i,j : the weight for term i in document j 2) N : the number of documents in the collection 3) tf i,j : is the term frequency of term i in document j 4) df i : is the document frequency of term i in the collection Terms (keywords) of the document collection documents

Intelligent Database Systems Lab Methodology - LSI Given a term-document matrix X = [x 1, x 2,..., x n ] є R m and suppose the rank of X is r, LSI decomposes the X using SVD as follows: Terms (keywords) of the document collection documents 1. X k =U k ’Σ k V k T ’ 2.

Intelligent Database Systems Lab Methodology - Multi-word the length of the multi-word should be between 2 and 6 its occurrence frequency should be at least twice in a document.

Intelligent Database Systems Lab Experiments - Datasets Chinese corpus : TanCorpV documents20 categories Select 1200 documents219,115 sentences 5,468,301 individual words agriculturehistorypoliticseconomy English corpus : Reuters distribution documents135 categories Select 2032 documents50,837 sentences 281,111 individual words Crude (520)agriculture (574)Trade (514)Interest (424)

Intelligent Database Systems Lab Experiments - Evaluation

Intelligent Database Systems Lab Experiments - Chinese

Intelligent Database Systems Lab Experiments - English

Intelligent Database Systems Lab Experiments – t-test

Intelligent Database Systems Lab Comparison information retrieval text categorization computation complexity TF*IDFChineseO(n m) LSIEnglishbestO(n 2 r 3 ) multi-wordO(ms 2 )

Intelligent Database Systems Lab Conclusions LSI can produce better indexing in discriminative power. LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods. The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.

Intelligent Database Systems Lab Comments Advantages - Compare with TF*IDF, LSI and multi-words Disadvantage - semantic quality and statistical quality are considered merely by our intuition instead of theory Applications - text mining