Text Classification, Active/Interactive learning.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Text Categorization CSC 575 Intelligent Information Retrieval.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Naïve Bayes Classifier
What is Statistical Modeling
Learning for Text Categorization
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?
Lecture 13-1: Text Classification & Naive Bayes
TEXT CLASSIFICATION CC437 (Includes some original material by Chris Manning)
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Ch 4: Information Retrieval and Text Mining
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Text Classification – Naive Bayes
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Semi-Supervised Learning
Crash Course on Machine Learning
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Exercise Session 10 – Image Categorization
Advanced Multimedia Text Classification Tamara Berg.
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
Categorization/Classification
Information Retrieval and Web Search Introduction to Text Classification (Note: slides in this set have been adapted from the course taught by Chris Manning.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text mining.
Bayesian Networks. Male brain wiring Female brain wiring.
How to classify reading passages into predefined categories ASH.
Naive Bayes Classifier
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Classification Techniques: Bayesian Classification
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
1 CS 391L: Machine Learning Text Categorization Raymond J. Mooney University of Texas at Austin.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
1 Unsupervised Learning and Clustering Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Bayesian Learning. Uncertainty & Probability Baye's rule Choosing Hypotheses- Maximum a posteriori Maximum Likelihood - Baye's concept learning Maximum.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
IR 6 Scoring, term weighting and the vector space model.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Lecture 13: Language Models for IR
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
CSC 594 Topics in AI – Natural Language Processing
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
CSE P573 Applications of Artificial Intelligence Bayesian Learning
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Text Classification, Active/Interactive learning

Text Categorization Categorization of documents, based on topics When the topics are known – categorization (supervised) When the topics are unknown – classification (a.k.a clustering, unsupervised)

Supervised learning Training data – documents with their assigned (true) categories Documents are represented by feature vectors x=(x 1, x 2,…,x n )

Document features Word frequencies Stems/lemmas Phrases POS tags Semantic features (concepts, named entities) tdidf : tf = term frequency idf = inverse document frequency

TFIDF Weights TFIDF definitions: tf ik : #occurrences of term t k in document D i df k : #documents which contain t k idf k : log(d / df k ) where d is the total number of documents w ik : tf ik idf k term weight Intuition: rare words get more weight, common words less weight

Naïve Bayes classifier Tightly tied to text categorization Interesting theoretical properties A simple example of an important class of learners based on generative models that approximate how data is produced For certain special cases, NB is the best thing you can do

Bayes’ rule

Maximum a posteriori hypothesis As P(D) is constant

Maximum likelihood hypothesis If all hypotheses are a priori equally likely, we only need to consider the P(D|h) term:

Naive Bayes classifiers Task: Classify a new instance D based on a tuple of attribute values into one of the classes c j  C

Naïve Bayes assumption P(c j ) – Can be estimated from the frequency of classes in the training examples. P(x 1,x 2,…,x n |c j ) – O(|X| n |C|) parameters – Could only be estimated if a very, very large number of training examples was available. Naïve Bayes Conditional Independence Assumption: Assume that the probability of observing the conjunction of attributes is equal to the product of the individual probabilities P(x i |c j ).

Naive Bayes for text categorization Attributes are text positions, values are words Still too many possibilities Assume that classification is independent of the positions of the words –Use same parameters for each position –Result is bag of words model (over tokens not types)

Text j  single document containing all docs j for each word x k in Vocabulary –n k  number of occurrences of x k in Text j – Naïve Bayes: learning probabilities From training corpus, extract Vocabulary Calculate required P(c j ) and P(x k | c j ) terms – For each c j in C do docs j  subset of documents for which the target class is c j

Naïve Bayes: classifying positions  all word positions in current document which contain tokens found in Vocabulary Return c NB, where

Underflow Prevention

Binomial Naïve Bayes One feature X w for each word in dictionary X w = true in document d if w appears in d Naive Bayes assumption: Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Parameter Estimation fraction of documents of topic c j in which word w appears Binomial model: Multinomial model: – Can create a mega-document for topic j by concatenating all documents in this topic – Use frequency of w in mega-document fraction of times in which word w appears across all documents of topic c j

Learning probabilities Passive learning – learning from the the annotated corpus Active learning – learning only from the most informative instances (real time annotation) Interactive learning – the annotator can choose to suggest specific features (e.g., playoffs to indicate sports) and not just complete instances

{Inter}active learning Available actions: – Annotate an instance with a class – Annotate a feature with a class – Suggest new feature and annotate it with a class

Adding prior to features’ max likelihood - Normalization factor (summing over all words)

Adding prior to classes’ max likelihood - Normalization factor (summing over all classes)

Semi-supervised (learning from unlabeled data) Learning model probabilities ( ) using only priors Apply the induced classifier on the unlabeled instances Re-estimate the probabilities using the labeled as well as the probabilistically labeled instances (multiply the latters with 0.1 to avoid over whelming the model) Possibly iterating this processes This is actually EM

Suggesting instances for annotation Use weight function For example, entropy-based uncertainty weight: Then, suggest the top D documents Document d

Suggesting features for annotation Using info-gain Then, suggest the top V features for the class with which they occur most

Results of 3 annotators, comparing active, interactive and passive learning (Closing the Loop: Fast, Interactive Semi-Supervised Annotation With Queries on Features and Instances, Burr Settles)