SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007.

Slides:



Advertisements
Similar presentations
Recommender System A Brief Survey.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Critical Reading Strategies: Overview of Research Process
Text Categorization.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Introduction to Automatic Classification Shih-Wen (George) Ke 7 th Dec 2005.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Vector Space Model CS 652 Information Extraction and Integration.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Scalable Text Mining with Sparse Generative Models
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Universit at Dortmund, LS VIII
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Chapter 6: Information Retrieval and Web Search
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Reduction of Training Noises for Text Classifiers Rey-Long Liu Dept. of Medical Informatics Tzu Chi University Taiwan.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Vector Space Models.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
NTU & MSRA Ming-Feng Tsai
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Using Asymmetric Distributions to Improve Text Classifier Probability Estimates Paul N. Bennett Computer Science Dept. Carnegie Mellon University SIGIR.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Information Organization: Overview
Sentiment analysis algorithms and applications: A survey
Semantic Processing with Context Analysis
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
Information Organization: Overview
Presentation transcript:

SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007

SoC Presentation Title 2004 Outline Introduction Motivation Methodology of Research Analysis and Proposal of a New Term Weighting Method Experimental Research Work Contributions and Future Work

SoC Presentation Title 2004 Outline Introduction Motivation Methodology of Research Analysis and Proposal of a New Term Weighting Method Experimental Research Work Contributions and Future Work

SoC Presentation Title 2004 Introduction Background Explosive increase of textual information Organizing and accessing these information in flexible ways Text Categorization (TC) is the task of classifying natural language documents into a predefined set of semantic categories

SoC Presentation Title 2004 Categorize web pages by topic (the directories like Yahoo!); Customize online newspapers to different labels according to a particular user’s reading preferences; Filter spam s and forward incoming s to the target expert by content; Word sense disambiguation is also taken as a text categorization task once we view word occurrence contexts as documents and word sense as categories. Introduction Applications of TC

SoC Presentation Title 2004 TC is a discipline at the crossroads of machine learning (ML) and information retrieval (IR) Two key issues in TC: – Text Representation – Classifier Construction This thesis focuses on the first issue. Introduction Two sub-issues of TC

SoC Presentation Title 2004 Approaches to build a classifier: – No more than 20 algorithms – Borrowed from Information Retrieval: Rocchio – Machine Learning: SVM, kNN, decision Tree, Naïve Bayesian, Neural Network, Linear Regression, Decision Rule, Boosting, etc. SVM performs better. Introduction Construction of Text Classifier

SoC Presentation Title 2004 Little room to improve the performance from the algorithm aspect: – Excellent algorithms are few. – The rationale is inherent to each algorithm and the method is usually fixed for one given algorithm – Tuning the parameter has limited improvement Introduction Construction of Text Classifier

SoC Presentation Title 2004 Various text format, such as DOC, PDF, PostScript, HTML, etc. Can Computer read them like us? Convert them into a compact format in order to be recognized and categorized for classifiers or a classifier-building algorithm in a computer. This indexing procedure is also called text representation. Introduction Text Representation

SoC Presentation Title 2004 Texts are vectors in the term space. Assumption: Documents that are “close together” in space are similar in meaning. Introduction Vector Space Model

SoC Presentation Title 2004 Two main issues in Text Representation: 1. What should a term be – Term Type 2. How to weight a term – Term Weighting Introduction Text Representation

SoC Presentation Title 2004 Sub-word level – syllables Word level – single token Multi-word level – phrases, sentences, etc Syntactic and semantic – sense (meaning) Introduction 1. What Should a Term be?

SoC Presentation Title Word senses (meanings) [Kehagias 2001] same word assumes different meanings in a different contexts 2. Term clustering [Lewis 1992] group words with high degree of pairwise semantic relatedness 3. Semantic and syntactic representation [Scott & matwin 1999] Relationship between words, i.e. phrases, synonyms and hypernyms Introduction 1. What Should a Term be?

SoC Presentation Title Latent Semantic Indexing [Deerwester 1990] A feature reconstruction technique 5. Combination Approach [Peng 2003] Combine two types of indexing terms, i.e. words and 3-grams 6. Theme Topic Mixture Model – [Keller 2004] Graphical Model 7. Using keywords from summarization [Li 2003] Introduction 1. What Should a Term be?

SoC Presentation Title 2004 In general, high level representation did not show good performance in most cases Word level is better (bag-of-words) Introduction 1. What Should a Term be?

SoC Presentation Title 2004 [Salton 1988] elaborated three considerations: –1. term occurrences closely represent the content of document –2. other factors with the discriminating power pick up the relevant documents from other irrelevant documents –3. consider the effect of length of documents Introduction 2. How to Weight a Term?

SoC Presentation Title 2004 Simplest method – binary Most popular method – tf.idf Combination with information-theory metrics or statistics method – tf.chi2, tf.ig, tf.gr Combination with the linear classifier Introduction 2. How to Weight a Term?

SoC Presentation Title 2004 Outline Introduction Motivation Methodology of Research Analysis and Proposal of a New Term Weighting Method Experimental Research Work Contributions and Future Work

SoC Presentation Title 2004 SVM is better. Leopold(2002) stated that text representation dominates the performance of text categorization rather than the kernel functions of SVMs. Which is the best method for SVM-based text classifier among these widely-used ones? Motivation (1)

SoC Presentation Title 2004 Text categorization is a form of supervised learning The prior information on the membership of training documents in predefined categories is useful in – feature selection – supervised learning for classifier Motivation (2)

SoC Presentation Title 2004 The supervised term weighting methods adopt this known information and consider the document distribution. They are naturally expected to be superior to the unsupervised (traditional) term weighting methods. Motivation (2)

SoC Presentation Title 2004 Q1. How can we propose a new effective term weighting method by using the important prior information given by the training data set? Q2. Which is the best term weighting method for SVM-based text classifier? Motivation Three questions to be addressed

SoC Presentation Title 2004 Q3. Are supervised term weighting methods able to lead to better performance than unsupervised ones for TC? What kinds of relationship can we find between term weighting methods and the widely-used learning algorithms, i.e. kNN and SVM, given different benchmark data collections? Motivation Three questions to be addressed

SoC Presentation Title 2004 First, we will analyze term’s discriminating power for TC and propose a new term weighting method using the prior information of the training data set to improve the performance of TC. Motivation Sub-tasks of This Thesis

SoC Presentation Title 2004 Second, we will explore term weighting methods for SVM-based text categorization and investigate the best method for SVM- based text classifier. Motivation Sub-tasks of This Thesis

SoC Presentation Title 2004 Third, we will extend research work on more general benchmark data sets and learning algorithms to examine the superiority of supervised term weighting methods and investigate the relationship between term weighting methods and different learning algorithms. Moreover, this study will be extended to a new application domain, i.e. biomedical literature classification. Motivation Sub-tasks of This Thesis

SoC Presentation Title 2004 Outline Introduction Motivation Methodology of Research Analysis and Proposal of a New Term Weighting Method Experimental Research Work Contributions and Future Work

SoC Presentation Title 2004 Machine Learning Algorithms – SVM, kNN Benchmark Data Corpora – Reuters News Corpus – 20Newsgroups Corpus – Ohsumed Corpus – 18Journal Corpus Methodology of Research

SoC Presentation Title 2004 Evaluation Measures – F1 – Breakeven point Significance Tests – McNemar’s significance Test Methodology of Research

SoC Presentation Title 2004 Outline Introduction Motivation Methodology of Research Analysis and Proposal of a New Term Weighting Method Experimental Research Work Contributions and Future Work

SoC Presentation Title 2004 Three considerations: 1. Term occurrence: binary, tf, ITF, log(tf) 2. Term’s discriminative power: idf Note: chi^2, ig (information gain), gr (gain ratio), mi (mutual information), or (Odds Ratio), etc. 3. Document length: cosine normalization, linear normalization Analysis and Proposal of a New Term Weighting Method

SoC Presentation Title 2004 Analysis and Proposal of a New Term Weighting Method Analysis of Term’s Discriminating Power

SoC Presentation Title 2004 Assume they have same tf value. t1, t2, t3 share the same idf1 value; t4, t5, t6 share the same idf2 value. Clearly, the six terms contribute differently to the semantic of documents. Analysis and Proposal of a New Term Weighting Method Analysis of Term’s Discriminating Power

SoC Presentation Title 2004 Case 1. t1 contributes more than t2 and t3; t4 contributes more than t5 and t6. Case 2. t4 contributes more than t1 although idf(t4) < idf(t1). Analysis and Proposal of a New Term Weighting Method Term’s Discriminating Power -- idf

SoC Presentation Title 2004 Case 3. for t1, t2, t3, in idf value, t1= t2= t3 in chi^2 value, t1=t3 >t2 in or value, t1 > t2 > t3 Case 4. for t1 and t4, in idf value, t1 > t4 in chi^2 and or value, t1 < t4 Analysis and Proposal of a New Term Weighting Method Term’s Discriminating Power -- idf, chi^2, or

SoC Presentation Title 2004 Intuitive Consideration the more concentrated a high frequency term is in the positive category than in the negative category, the more contributions it makes in selecting the positive samples from among the negative samples. Analysis and Proposal A New Term Weighting Method -- rf

SoC Presentation Title 2004 rf – relevance frequency The rf is only related to the ratio of b and c, not involve d The base of log is 2. in case of c=0, c=1 The final weight of term is: tf*rf Analysis and Proposal A New Term Weighting Method -- rf

SoC Presentation Title 2004 Analysis and Proposal Empirical observation feature Category: 00_acq idfrfchi2origgr Acquir Stake Payout dividend Comparison of idf, rf, chi2, or, ig and gr value of four features in category 00_acq of Reuters Corpus

SoC Presentation Title 2004 Analysis and Proposal Empirical observation feature Category: 03_earn idfrfchi2origgr Acquir Stake Payout dividend Comparison of idf, rf, chi2, or, ig and gr value of four features in category 03_earn of Reuters Corpus

SoC Presentation Title 2004 Outline Introduction Motivation Methodology of Research Analysis and Proposal of a New Term Weighting Method Experimental Research Work Contributions and Future Work

SoC Presentation Title 2004 Purposes of Experiment Set 1: 1. Explore the best term weighting method for SVM-based text categorization – (Q2) 2. Compare tf.rf with various traditional term weighting methods on SVM – (Q1) Experiment Set 1 Exploring the best term weighting method for SVM-based text categorization

SoC Presentation Title 2004 MethodsDenotationDescription Traditional Weighting Methods Term frequency binary0: absence, 1: presence tfTerm frequency alone logtflog(1+tf) ITF1-1/(1+tf) tf.idf And variants idfidf alone tf.idfClassical tf * idf logtf.idflog(1+tf) * idf tf. idf_probtf * Probabilistic idf Supervised Weighting Methods tf.chi^2tf * chi square Our new tf.rftf.rfOur new method Experiment Set 1 Experimental Methodology: 10 Methods

SoC Presentation Title 2004 Experiment Set 1 Results on Reuters Corpus

SoC Presentation Title 2004 Experiment Set 1 Results on subset of 20Newsgroups Corpus

SoC Presentation Title 2004 tf.rf performs better consistently No significant difference among the three term frequency variants, i.e. tf, logtf and ITF No significant difference between tf.idf, logtf.idf and tf.idf-prob Experiment Set 1 Conclusions (1)

SoC Presentation Title 2004 idf and chi^2 factor, even considering the distribution of documents, do not improve or even decrease the performance binary and tf.chi^2 significantly underperform the other methods Experiment Set 1 Conclusions (2)

SoC Presentation Title 2004 Purposes of Experiment Set 2: 1. Investigate supervised term weighting methods and their relationship with learning algorithms – (Q3) 2. Compare the effectiveness of tf.rf under more general experimental circumstances – (Q1) Experiment Set 2 Investigating supervised term weighting method and their relationship with learning algorithms

SoC Presentation Title 2004 Supervised term weighting methods – Use the prior information on the membership of training documents in predefined categories Unsupervised term weighting methods – Does not use. – binary, tf, log(1+tf), ITF – Most popular is tf.idf and its variants: logtf.idf, tf.idf-prob Experiment Set 2 Review

SoC Presentation Title Combined with information-theory functions or statistic metrics – such as chi2, information gain, gain ratio, Odds ratio, etc. – Used in the feature selection step – Select the most relevant and discriminating features for the classification task, that is, the terms with higher feature selection scores The results are inconsistent and/or incomplete. Experiment Set 2 Review: Supervised Term Weighting Methods

SoC Presentation Title Interaction with supervised text classifier –Linear SVM, Perceptron, kNN –Text classifier selects the positive test documents from negative test documents by assigning different scores to the test samples, these scores are believed to be effective in assigning more appropriate weights to terms Experiment Set 2 Review: Supervised Term Weighting Methods

SoC Presentation Title Based on Statistical Confidence Intervals -- ConfWeight – Linear SVM – Compared with tf.idf and tf.gr (gain ratio) – The results failed to show that supervised methods are generally higher than unsupervised methods. Experiment Set 2 Review: Supervised Term Weighting Methods

SoC Presentation Title 2004 Hypothesis: Since supervised schemes consider the document distribution, they should perform better than unsupervised ones Motivation: 1. Are supervised term weighting method is superior to the unsupervised (traditional) methods? 2. What kinds of relationship between term weighting methods and the learning algorithms, i.e. kNN and SVM, given different data collections? Experiment Set 2

SoC Presentation Title 2004 Unsupervised Term Weighting binary0 or 1 tfTerm frequency tf.idfClassic scheme Supervised Term Weighting tf.rfOwn scheme tf.chi^2Chi^2 tf.igInformation gain tf.orOdds ratio Experiment Set 2 Methodology

SoC Presentation Title 2004 Experiment Set 2 Results on Reuters Corpus using SVM

SoC Presentation Title 2004 Experiment Set 2 Results on Reuters Corpus using kNN

SoC Presentation Title 2004 Experiment Set 2 Results on 20Newsgroups Corpus using SVM

SoC Presentation Title 2004 Experiment Set 2 Results on 20Newsgroups Corpus using kNN

SoC Presentation Title 2004 AlgorithmCorpus#_fea Significance Test Results SVMR15937 (tf.rf, tf, rf) > tf.idf > (tf.ig, tf.chi2, binary) >> tf.or SVM (rf, tf.rf, tf.idf) > tf >> binary >> tf.or >> (tf.ig, tf.chi2) kNNR405 (binary, tf.rf) > tf >> (tf.idf, rf, tf.ig) > tf.chi2 >> tf.or kNN20494 (tf.rf, binary, tf.idf,tf) >> rf >> (tf.or, tf.ig, tf.chi2) Experiment Set 2 McNemar’s Significance Tests

SoC Presentation Title 2004 Experiment Set 2 Effects of feature set size on algorithms For SVM, almost all methods achieved the best performance when in putting the full vocabulary ( features) For kNN, the best performance achieved at a smaller feature set size ( features) Possible reason: different noise resistance

SoC Presentation Title 2004 Q1: Are supervised term weighting methods superior to the unsupervised term weighting methods? -- No always. Experiment Set 2 Conclusions (1)

SoC Presentation Title 2004 Experiment Set 2 Conclusions (1) Specifically, three supervised methods based on information theory, i.e. tf.chi2, tf.ig and tf.or, perform rather poorly in all experiments. On the other hand, newly proposed supervised method, tf.rf achieved the best performance consistently and outperforms other methods substantially and significantly.

SoC Presentation Title 2004 Q2: What kinds of relationship between term weighting methods and learning algorithms, given different benchmark data collections? A: The performance of the term weighting methods, especially, the three unsupervised methods, has close relationships with the learning algorithms and data corpora. Experiment Set 2 Conclusions (2)

SoC Presentation Title 2004 Summary of supervised methods: – tf.rf performs consistently better in all experiments; – tf.or, tf.chi^2 and tf.ig perform consistently the worst in all experiments; – rf alone, shows a comparable performance to tf.rf except for on Reuters using kNN Experiment Set 2 Conclusions (4)

SoC Presentation Title 2004 Summary of unsupervised methods: – tf.idf performs comparable to tf.rf on the uniform category corpus either using SVM or kNN; – binary performs comparable to tf.rf on both corpora using kNN while rather bad using SVM; – although tf does not perform as well as tf.rf, it performs consistently well in all experiments Experiment Set 2 Conclusions (5)

SoC Presentation Title 2004 Purpose: Application on biomedical data collections Motivation: Explosive growth of biomedical research Experiment Set 3 Applications on Biomedical Domain

SoC Presentation Title 2004 The Ohsumed Corpus Used by Joachims [1998] 18 Journals Corpus – Biochemistry and Molecular Biology – Top impact factor – 7903 documents from two-year span ( ) Experiment Set 3 Data Corpora

SoC Presentation Title 2004 Four term weighting methods bianry, tf, tf.idf, tf.rf Evaluation Measures – Micro-averaged breakeven point – F1 Experiment Set 3 Methodology

SoC Presentation Title 2004 Experiment Set 3 Results on the Ohsumed Corpus

SoC Presentation Title 2004 schemeMicro-RMicro-PMicro-F1Macro-F1 binary tf tf.idf tf.rf Experiment Set 3 The best performance of SVM with four term weighting schemes on the Ohsumed Corpus

SoC Presentation Title 2004 Our linear SVM is more accurate – 68.05% (linear) vs 60.7%(linear) and 66.1%(rbf) The performances of tf.idf are identical – 65.78% vs 66% Our proposed tf.rf performs significantly better than other methods. Experiment Set 3 Comparison between our study and Joachims’ study

SoC Presentation Title 2004 Experiment Set 3 Results on 18 Journals Corpus

SoC Presentation Title 2004 Experiment Set 3 Results on 3 subsets of 18 Journals Corpus

SoC Presentation Title 2004 The comparison between our results and Joachims’ results shows that tf.rf can improve on the classification performance than tf.idf. tf.rf outperforms the other term weighting methods on the both data corpora. tf.rf can improve the classification performance of biomedical text classification. Experiment Set 3 Conclusions

SoC Presentation Title 2004 Outline Introduction Motivation Methodology of Research Analysis and Proposal of a New Term Weighting Method Experimental Research Work Contributions and Future Work

SoC Presentation Title 2004 Contributions of Thesis 1.To propose an effective supervised term weighting method tf.rf to improve the performance of text categorization.

SoC Presentation Title 2004 Contributions of Thesis 2. To make an extensive comparative study of different term weighting methods under controlled conditions.

SoC Presentation Title 2004 Contributions of Thesis 3. To give a deep analysis of the terms’ discriminating power for text categorization from the qualitative and quantitative aspects.

SoC Presentation Title 2004 Contributions of Thesis 4. To investigate the relationships between term weighting methods and learning algorithms given different corpora.

SoC Presentation Title 2004 Future Work (1) Extending term weighting methods on feature types other than word

SoC Presentation Title 2004 Future Work (2) 2. Applying term weighting methods to other text-related applications

SoC Presentation Title 2004 Thanks for your time and suggestion.