How to classify reading passages into predefined categories ASH.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19,
Text Categorization CSC 575 Intelligent Information Retrieval.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
CSC 380 Algorithm Project Presentation Spam Detection Algorithms Kyle McCombs Bridget Kelly.
What is Statistical Modeling
Multiple Criteria for Evaluating Land Cover Classification Algorithms Summary of a paper by R.S. DeFries and Jonathan Cheung-Wai Chan April, 2000 Remote.
Learning for Text Categorization
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 13: Text Classification & Naive.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Lecture 13-1: Text Classification & Naive Bayes
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Naïve Bayes for Text Classification: Spam Detection
Advanced Multimedia Text Classification Tamara Berg.
Information Retrieval and Web Search Introduction to Text Classification (Note: slides in this set have been adapted from the course taught by Chris Manning.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.
Text Classification, Active/Interactive learning.
1 Bins and Text Categorization Carl Sable (Columbia University) Kenneth W. Church (AT&T)
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Naive Bayes Classifier
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Feature Selection: Why?
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Domain-Specific Iterative Readability Computation Jin Zhao 13/05/2011.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Classification Techniques: Bayesian Classification
Slides for “Data Mining” by I. H. Witten and E. Frank.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Classification using Co-Training
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Text Classification and Naïve Bayes Naïve Bayes (I)
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Naive Bayes Classifier
Lecture 15: Text Classification & Naive Bayes
Machine Learning. k-Nearest Neighbor Classifiers.
Text Categorization Assigning documents to a fixed set of categories
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Information Retrieval
Naïve Bayes Text Classification
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

How to classify reading passages into predefined categories ASH

  Assigning subject categories, topics, or genres  Spam detection  Authorship identification  Age/gender identification  Language Identification  Sentiment analysis  And so on.. Text Classification

 Process Document Performance measure Applying classification algorithm Feature filtering Feature/indexing Pre-processing

 Documents Removal of Stop words Stemming words Pre-processed data

 Classification Methods (1)  Manual classification  Used by the original Yahoo! Directory  Accurate when job is done by experts  Consistent when the problem size and team is small  Difficult and expensive to scale  Means we need automatic classification methods for big problems

 Classification Methods (2)  Hand-coded rule-based classifiers  Commercial systems have complex query languages  Accuracy is can be high if a rule has been carefully refined over time by a subject expert  Building and maintaining these rules is expensive

 Classification Methods (3): Supervised learning  Given:  A document d  A fixed set of classes: C = {c 1, c 2,…, c J }  A training set D of documents each with a label in C  Determine:  A learning method or algorithm which will enable us to learn a classifier γ  For a test document d, we assign it the class γ(d) ∈ C

 Classification Methods (3)  Supervised learning  Naive Bayes (simple, common)  k-Nearest Neighbors (simple, powerful)  Support-vector machines (new, generally more powerful)  … plus many other methods  No free lunch: requires hand-classified training data  But data can be built up (and refined) by amateurs  Many commercial systems use a mixture of methods

  A simple probabilistic classifier which is based on Bayes theorem with strong and naïve independence assumptions.  Assumption  features used in the classification are independent  Despite the naïve design and oversimplified assumptions that this technique uses, Naïve Bayes performs well in many complex real-world problems. What is the Naïve Bayes Classifier?

  It is less computationally intensive and requires a small amount of training time.  When you have limited resources in terms of CPU and Memory.  Moreover, when the training time is a crucial factor.  Keep in mind that the Naïve Bayes classifier is used as a baseline in many researches. When to use the Naïve Bayes Text Classifier?

  For a document d and a class c Bayes’ Rule Applied to Documents and Classes

 Naive Bayes Classifiers Task: Classify a new instance D based on a tuple of attribute values into one of the classes c j  C 12 MAP = Maximum Aposteriori Probability

 Underflow Prevention: log space  Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow.  It will be rounded to zero, and our analyses will be useless.  Class with highest final un-normalized log probability score is still the most probable.  Note that model is just max of sum of weight 13

 The Naïve Bayes Classifier  1) Conditional Independence Assumption: Features (term presence) are independent of each other given the class:  2) Position Independence Assumption: Word appearance does not depend on position 14 For all positions i, j, word w, and class c

 Learning the Model  maximum likelihood estimates  simply use the frequencies in the data 15

 What if a particular feature/word does not appear in a particular class?  To avoid this, we will use Laplace smoothing by adding 1 to each count: Problem with Max Likelihood 16 # of terms contained in the vocabulary j

 Text j  single document containing all docs j for each word x k in Vocabulary n k  number of occurrences of x k in Text j Naïve Bayes: Learning Algorithm  From training corpus, extract Vocabulary  Calculate required P(c j ) and P(x k | c j ) terms  For each c j in C do  docs j  subset of documents for which the target class is c j  17

 Naïve Bayes: Classifying  Return c NB, where  Positions  all word positions in current document which contain tokens found in Vocabulary 18

 Two Models  Model 1: Multivariate Bernoulli  Used when in our problem the absence of a particular word matters.  Ex) Spam or Adult Content Detection  Hold 1 st assumption of Naïve bayesian 19 fraction of documents of topic c j in which word w appears

 Two Models  Model 2: Multinomial = Class conditional unigram  Used when the multiple occurrences of the words matter a lot in the classification problem.  Ex) Topic classification  Hold 1 st and 2 nd assumption of Naïve Bayesian Use same parameters for each position Result is bag of words model (over tokens not types) 20 fraction of times in which word w appears across all documents of topic c j

 The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. γ(γ( )=c Naïve Bayes relies on Bag of words representation

 Thebagof words representation: using asubset of words

 The bag of words representation γ(γ( )=c great2 love2 recommend1 laugh1 happy1...

 Bag of words for document classification

 Feature Selection: Why?  Text collections have a large number of features  10,000 – 1,000,000 unique words … and more  Feature Selection  Makes using a particular classifier feasible  Some classifiers can’t deal with 100,000 of features  Reduces training time  Training time for some methods is quadratic or worse in the number of features  Can improve generalization (performance)  Eliminates noise features 25

 Feature selection: how?  Two ideas:  Chi-square test  2 )  Test whether the occurrence of a specific term and the occurrence of a specific class are independent.  Clear information-theoretic interpretation  May select rare uninformative terms  Mutual information (MI)  How much information does the value of one categorical variable give you about the value of another  Statistical foundation  May select very slightly informative frequent terms that are not very useful for classification  They’re similar, but  2 measures confidence in association, (based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities) 26

 Feature Selection: Frequency  The simplest feature selection method :  Just use the commonest terms  No particular foundation  But it make sense why this works  They’re the words that can be well-estimated and are most often available as evidence  In practice, this is often 90% as good as better methods

 Example

 The denominators are (8 + 6) and (3 + 6) because the lengths of text c and are 8 and 3, respectively, and because the constant B is 6 as the vocabulary consists of six terms.

 Example Thus, the classifier assigns the test document to c = China. The reason for this classification decision is that the three occurrences of the positive indicator CHINESE in d 5 outweigh the occurrences of the two negative indicators JAPAN and TOKYO.

 Evaluating Categorization  Evaluation must be done on test data that are independent of the training data  Sometimes use cross-validation (averaging results over multiple training and test splits of the overall data)  Measures: precision, recall, F1, classification accuracy  Classification accuracy: r/n where n is the total number of test docs and r is the number of test docs correctly classified

  Training data: sets  Category: Science or Humanity (Binary) or Science, Humanity, Social-science, Art, Business (5 categories)  Test data: Using 10-fold method (K-fold) Experiment

 ClassifiersAccuracyPrecisionRecallF-measure NB92.49%92.9%92.5% RBF network59.32%58.5%59.3%57.2% Bagging92.34%92.5%92.3%92.4% ZeroR56.25%31.6%56.3%40.5% Random Forest90.47%91%90.5% Lib SVM94.85%94.9% ClassifiersAccuracyPrecisionRecallF-measure NB76.91%78.5%76.9% RBF network50.08%25.1%50.1%33.4% Bagging81.93%81.6%81.9%81.6% ZeroR50.08%25.1%50.1%33.4% Random Forest76.67%76.2%76.7%75.8% Lib SVM88.61%88.6% 88.5% Two category Five category

 Naive Bayes is Not So Naive  Very fast learning and testing (highly efficient)  Low storage requirements  Robust to irrelevant(noise) features and concept drift

  es.pdf es.pdf  /pxc pdf /pxc pdf  tutorial-the-naive-bayes-text-classifier/ tutorial-the-naive-bayes-text-classifier/ References