Speaker : Shau-Shiang Hung ( 洪紹祥 ) Adviser : Shu-Chen Cheng ( 鄭淑真 ) Date : 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Naïve-Bayes Classifiers Business Intelligence for Managers.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
ICONIP 2005 Improve Naïve Bayesian Classifier by Discriminative Training Kaizhu Huang, Zhangbing Zhou, Irwin King, Michael R. Lyu Oct
Evaluation of Decision Forests on Text Categorization
1 CS 391L: Machine Learning: Instance Based Learning Raymond J. Mooney University of Texas at Austin.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
Landmark Classification in Large- scale Image Collections Yunpeng Li David J. Crandall Daniel P. Huttenlocher ICCV 2009.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Distinguishing Photographic Images and Photorealistic Computer Graphics Using Visual Vocabulary on Local Image Edges Rong Zhang,Rand-Ding Wang, and Tian-Tsong.
ApMl (All Purpose Machine Learning) Toolkit David W. Miller and Helen Howell Semantic Web Final Project Spring 2002 Department of Computer Science University.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Scalable Text Mining with Sparse Generative Models
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Advanced Multimedia Text Classification Tamara Berg.
Rainbow Tool Kit Matt Perry Global Information Systems Spring 2003.
A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization Thorsten Joachims Carnegie Mellon University Presented by Ning Kang.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 An Efficient Classification Approach Based on Grid Code Transformation and Mask-Matching Method Presenter: Yo-Ping Huang Tatung University.
Text mining.
Data mining and machine learning A brief introduction.
Bayesian Networks. Male brain wiring Female brain wiring.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Text Classification, Active/Interactive learning.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
TEXT CLASSIFICATION USING MACHINE LEARNING Student: Hung Vo Course: CP-SC 881 Instructor: Professor Luo Feng Clemson University 04/27/2011.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Externally Enhanced Classifiers and Application in Web Page Classification Join work with Chi-Feng Chang and Hsuan-Yu Chen Jyh-Jong Tsay National Chung.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Frontiers in the Convergence of Bioscience and Information Technologies 2007 Seyed Koosha Golmohammadi, Lukasz Kurgan, Brendan Crowley, and Marek Reformat.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
§ 5.3 Normal Distributions: Finding Values. Probability and Normal Distributions If a random variable, x, is normally distributed, you can find the probability.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Ch1 Larson/Farber 1 1 Elementary Statistics Larson Farber Introduction to Statistics As you view these slides be sure to have paper, pencil, a calculator.
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Musical Genre Categorization Using Support Vector Machines Shu Wang.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007.
An Overview of Statistics Section 1.1 After you see the slides for each section, do the Try It Yourself problems in your text for that section to see if.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
KNN & Naïve Bayes Hongning Wang
11 Automated multi-label text categorization with VG-RAM weightless neural networks Presenter: Guan-Yu Chen A. F. DeSouza, F. Pedroni, E. Oliveira, P.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
IR 6 Scoring, term weighting and the vector space model.
CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER
Tackling the Poor Assumptions of Naive Bayes Text Classifiers Pubished by: Jason D.M.Rennie, Lawrence Shih, Jamime Teevan, David R.Karger Liang Lan 11/19/2007.
Text Categorization Assigning documents to a fixed set of categories
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Chapter 7: Transformations
Information Retrieval
Presentation transcript:

Speaker : Shau-Shiang Hung ( 洪紹祥 ) Adviser : Shu-Chen Cheng ( 鄭淑真 ) Date : 99/05/04 1 Qirui Zhang, Jinghua Tan, Huaying Zhou, Weiye Tao, Kejing He, "Machine Learning Methods for Medical Text Categorization," paccs, pp , 2009 Pacific-Asia Conference on Circuits, Communications and Systems, 2009

Outline Introduction Document indexing Classification Algorithm Experiments Conclusion 2

Introduction Text categorization (TC) is the process of automatically assigning one or more predefined category labels to text documents. Digital medical information is rapidly increasing with the development of network. How to effectively deal with and organize them is a problem in the field of medical informatics. 3

Document indexing Because classifiers cannot directly interpret documents, it is necessary to transform them into the forms that classifiers can identify. Vector space model (VSM) is a famous statistical model. 4

Document indexing A. Standard Term Frequency Inverse Document Frequency (TFIDF) 5

Document indexing In order for the weights to fall the [0,1] interval and for the documents to be represent by vectors of equal length, the weights resulting from tfidf are often normalized by cosine normalization. 文章 1 所有關鍵字的 TFIDF 平方相加 6

Document indexing B. Improvement Term Frequency, Inverted Document Frequency and Inverted Entropy (TFIDFIE) In the field of text classification, the importance of term depends on not only its term frequency, but also its contribution to classification. For example: Term 1 客房 and Term 2 風景 has same weight 7

Document indexing In order to stand out the relation between terms and categories, we also calculate the distribution of those documents in categories in course of weighting terms. This distribution can be weight by information entropy H. 8

Classification Algorithm A. K-Nearest Neighbor (KNN) B. Support Vector Machine (SVM) C. Naïve Bayes (NB) D. Clonal Selection Algorithm Based on Antibody Density (CSABAD) Because the nature of immune algorithm is to distinguish between self and non-self, it can be used in text categorization. 9

Classification Algorithm CSABAD In text categorization, Antigen training text. B cell An individual of classifier. Antibody affinity between the individual and training documents. The final classifier is composed with many memory B cells. The cosine value of two vectors is used to measure the affinity f(x i,d j ) between of B cell x i and antigen d j The affinity f(x i ) of B cell x i and N antigens is defined as the average value of all N affinities. The antibody selection probability P(x i ) is defined as follows: 10

Experiments A. Data collection OHSUMED is a bibliographical document collection. Using a single-label subset of OHSUMED is called OHSCAL, which consists of documents include 10 categories. 11

Experiments B. Experiment results and analysis Randomly divided the OHSCAL dataset into a training set and a test set in the proportion of 2:1. For eliminating the chanciness of experimental results, we made ten independent experiments on OHSCAL. 12

Conclusion In this paper, we propose an improved approach, called TFIDFIE. It considers the distribution of documents in the training set in which the term occurs. The experiments show that SVM and CSABAD outperform significantly kNN and Naive Bayes, and TFIDFIE is more effective than TFIDF. Considering the characteristics of professional medical words, we will study the feature selection in the medical text classification in further work. 13