Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Slides:



Advertisements
Similar presentations
Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis.
Evaluation of Decision Forests on Text Categorization
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Multilingual Text Retrieval Applications of Multilingual Text Retrieval W. Bruce Croft, John Broglio and Hideo Fujii Computer Science Department University.
Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.
SUPPORT VECTOR MACHINES PRESENTED BY MUTHAPPA. Introduction Support Vector Machines(SVMs) are supervised learning models with associated learning algorithms.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Presented by Zeehasham Rasheed
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
Recommender systems Ram Akella November 26 th 2008.
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.
Introduction to Machine Learning Approach Lecture 5.
Chapter 5: Information Retrieval and Web Search
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
META-LEARNING FOR AUTOMATIC SELECTION OF ALGORITHMS FOR TEXT CLASSIFICATION Karol Furdík, Ján Paralič, Gabriel Tutoky {Jan.Paralic,
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Speaker : Shau-Shiang Hung ( 洪紹祥 ) Adviser : Shu-Chen Cheng ( 鄭淑真 ) Date : 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine.
Class Imbalance in Text Classification
Data Mining and Decision Support
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Divided Pretreatment to Targets and Intentions for Query Recommendation Reporter: Yangyang Kang /23.
NTU & MSRA Ming-Feng Tsai
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Queensland University of Technology
Sparse Kernel Machines
Pawan Lingras and Cory Butz
Presented by: Prof. Ali Jaoua
Representation of documents and queries
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Spam Detection Using Support Vector Machine Presenting By Nan Mya Oo University of Computer Studies Taunggyi.
Presentation transcript:

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15

 Introduction  Document preprocessing  Scoring measures for feature selection  Classification, performance evaluation, and corpora description  Experiments  Reuters  Ohsumed  Comparing the results  Conclusion

 Machine Learning (ML) automatically builds a classifier for a certain category by observing the characteristics of a set of documents that have been classified manually under this category.  The high dimensionality of TC problems makes most ML-based classification algorithms infeasible.  Many features could be irrelevant or noisy.  Small percentage of the words are really meaningful.  Feature selection is performed to reduce the number of features and avoid overfitting.

 Before performing FS  must transform documents to obtain a representation suitable for computational use.  Additionally, we perform two kinds of feature reduction.  The first removes the stop words (extremely common words such as “the,” “and,” and “to”) Aren’t useful for classification.  The second is stemming Maps words with the same meaning to one morphological form by removing suffixes.

 Information retrieval measures  determine word relevance  Information theory measures  These measures consider a word’s distribution over the different categories. Information gain(IG) Takes into account the word’s presence or absence in a category.

 ML measures  To define our measures, we associate to each pair (w, c) this rule w → c : If the word w appears in a document, then that document belongs to category c.  Then, we use measures that have been applied to quantify the quality of the rules induced by an ML algorithm.  In this way, we reduce the quantification of the importance of a word w in a category c to the quantification of the quality of w → c.

 Laplace measure (L) Modifies the percentage of success Takes into account the documents in which the word appears  difference (D) Establishes a balance between the documents of category c and the documents in other categories that also contain w

 impurity level(IL) Take into account the number of documents in the category in which the word occurs and the distribution of the documents over the categories

 For a classifier, we chose support vector machines because  Shown better results than other traditional text classifiers.  Perform better because they handle examples with many features well and they deal well with sparse vectors.  Are binary classifiers that can determine linear or nonlinear threshold functions to separate the examples of documents in one category from those in other categories. Disadvantages They handle missing values poorly Multiclass classification doesn’t perform well

 Evaluating performance  Precision (P) quantifies the percentage of documents that are correctly classified as positives (they belong to the category).  Recall (R) quantifies the percentage of correctly classified positive documents.  F 1 Gives the same relevance to both precision and recall

 The corpora  We used the Reuters and the Ohsumed corpora.  Reuters is a set of economic news documents that Reuters published in  Ohsumed is a clinically oriented MEDLINE subset consisting of 348,566 references of 270 medical journals published between 1987 and 1991.

 The results show that our proposed measures are more dependent on some statistical properties of the corpora, particularly the distribution of the words throughout the categories and of the documents over the categories.  However, ML measures exploit that dependence by finding at least a simple measure that performs better than IG and TF-IDF.