AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Evaluation of Decision Forests on Text Categorization
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts 04 10, 2014 Hyun Geun Soo Bo Pang and Lillian Lee (2004)
On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.
Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.
Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.
Using IR techniques to improve Automated Text Classification
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
© 2013 IBM Corporation Efficient Multi-stage Image Classification for Mobile Sensing in Urban Environments Presented by Shashank Mujumdar IBM Research,
SPAM DETECTION USING MACHINE LEARNING Lydia Song, Lauren Steimle, Xiaoxiao Xu.
Evaluating Classifiers
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Finding Associations in Collections of Text 김유환.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Presented by Tienwei Tsai July, 2005
The identification of interesting web sites Presented by Xiaoshu Cai.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor : Dr.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Introduction Use machine learning and various classifying techniques to be able to create an algorithm that can decipher between spam and ham s. .
Pattern Discovery of Fuzzy Time Series for Financial Prediction -IEEE Transaction of Knowledge and Data Engineering Presented by Hong Yancheng For COMP630P,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Feature selection for text categorization on imbalanced.
Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.
Presentation Title Department of Computer Science A More Principled Approach to Machine Learning Michael R. Smith Brigham Young University Department of.
Text categorization Updated 11/1/2006. Performance measures – binary classification Accuracy: acc = (a+d)/(a+b+c+d) Precision: p = a/(a+b) Recall: r =
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Nuhi BESIMI, Adrian BESIMI, Visar SHEHU
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Text Classification and Naïve Bayes Text Classification: Evaluation.
Big Data Processing of School Shooting Archives
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
A Smart Tool to Predict Salary Trends of H1-B Holders
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Erasmus University Rotterdam
Presented by: Prof. Ali Jaoua
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning for Visual Scene Classification with EEG Data
Roc curves By Vittoria Cozza, matr
Presentation transcript:

AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh

AGENDA Introduction Background on ATC The Two-Dimensional Model – Probabilities – Peculiarity Document Coordinates Experiments Results Analyses Critiques Critiques

INTRODUCTION What is ATC? – Build a classifier by observing the properties of the set of pre- classified documents. (Naïve Bayes Model) Simple to implement Gives remarkable accuracy Gives directions for SVM (parameter tuning) – 2DPM starts from different hypotheses from NB: Terms are seen as disjoint event Documents are seen as union of these events Visualization tool for understanding the relationships between categories Helps users to visually audit the classifier identify suspicious training data.

BACKGROUND ON ATCBACKGROUND ON ATC Set of categories Set of documents – D X C {T, F} Multi Vs Single label categorization – CSVi : D [0,1] Degree of membership of document – Binary categorization C categories and D documents to assigned to (c) or it is complement – Document Dtr for classifier and Dte to measure performance

THE TWO-DIMENSIONAL MODELTHE TWO-DIMENSIONAL MODEL Presenting the document on a 2-d Cartesian plan Based on two parameters – Presence and expressiveness, measuring the frequency of the term in the document of all the categories. Advantages – No explicit need for feature to reduce dimensionality – Limited space required to store objects compare to NB – Less computation cost for classifier training.

BASIC PROBABILITIESBASIC PROBABILITIES Given C = {c1, c2, …..} Given V = {t1, t2, …..} Find Ω={(tk, ci)}

BASIC PROBABILITIESBASIC PROBABILITIES Probability of a term given category or a set of categories For set of categories

BASIC PROBABILITIESBASIC PROBABILITIES Probability of set of terms given category or a set of categories And..

BASIC PROBABILITIESBASIC PROBABILITIES “Peculiarity” of a tem, chosen a set of categories – It is the probability of finding a term in a document and the probability of not finding the same term in the document complement. – Presence (how frequently the term is in the categories) – expressiveness (how distinctive the term is for that set of categories) – It is useful when finding the probability of complements of sets

BASIC PROBABILITIESBASIC PROBABILITIES Probability of a term given complementary sets of categories

BASIC PROBABILITIESBASIC PROBABILITIES Probability of a term having chosen a sets of categories – Indicates finding a term from set of documents and not finding the same term in the complement of the set of documents

2-D REPRESENTATION AND CATEGORIZATION To find the probability of set of terms in a set of categories Breaking down the expression to plot it Natural way to assign d to c With threshold to improve the separation (q) With angular coefficient (m)

EXPERIMENTS Dataset – Reuters (135 potential categories) First experiment with 10 most frequent categories Second excrement with 90 most frequent categories – Reuters collection volume 1 810,000 newswire (Aug to Aug. 1997) 30,000 training documents 36,000 test documents

EXPERIMENTS Pre-processing – Removing all the punctuation and convert to lower case – Removing the most frequent terms of the English language – K-fold cross validation (k=5) for FAR alg. For best separation – The recall and precision are computer for each category and combining them into a single measure (F1) – Marco-averaging and micro-averaging is computed for each of the previous measurements

ANALYSES AND RESULTSANALYSES AND RESULTS Comparing 2DPM with NB multinomial Case 1 : NB performs better than 2DPM Case 2 : almost the same but the macro-avar. Is halved Case 3 : same case 2 but macro-avar. increased Case 4 : NB performs better than 2DPM in micro but worst in macro.

ANALYSES AND RESULTSANALYSES AND RESULTS

CONCLUSION Plotting makes the understanding of the classifier decision easy Rotation of the decision line would result in better separation of the two classes. (Focused Angular Region) 2DPM performance is at least equal to NB model 2DPM is better with macro-avar. F1

Critiques The paper was not clear in directing the reading to the final results There are many cases of getting the probability but it does not show how to use them The paper focused on explaining the theoretical side, while the results and analysis part is almost 10% The main algorithm that the paper depends on was just mentioned with no enough explanation (focused angular region)

Thank you