Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
Unsupervised Learning
Biointelligence Laboratory, Seoul National University
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Active Learning with Support Vector Machines
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Visual Recognition Tutorial
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Final review LING572 Fei Xia Week 10: 03/11/
Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.
Active Learning for Class Imbalance Problem
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Bayesian Networks. Male brain wiring Female brain wiring.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
EM and expected complete log-likelihood Mixture of Experts
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Text Clustering.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Slides for “Data Mining” by I. H. Witten and E. Frank.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Data Mining and Decision Support
1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Active, Semi-Supervised Learning for Textual Information Access Anastasia Krithara¹, Cyril Goutte², Massih-Reza Amini³, Jean-Michel Renders¹ Massih-Reza.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Machine Learning Lecture 9: Clustering
Classification of unlabeled data:
Lecture 15: Text Classification & Naive Bayes
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Presentation transcript:

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008

outline 1, introduction 2, active learning with different methods 3, Employing EM and Pool-based Active Learning for Text Classification

introduction 1, what is active learning? 2, why active learning is very important? 3, real life applications

introduction The primary goal of machine learning is to derive general patterns from a limited amount of data. For most of supervised and unsupervised learning tasks, what we usually do is to gather a significant quantity of data which is randomly sampled from the underlying population distribution and then we induce a classifier or model.

introduction But this process is some kind of passive! Often the most time-consuming and costly task in the process is the gathering the data. Example: document classification. Easy to get large pool of unlabeled document. But it will take a long time for people to hand-label thousands of training document.

introduction Now, instead of randomly picking documents to be manually labeled fro our training set, we want to choose and query documents from the pool very carefully. Based on this carefully choosing training data, we can improve the model’s performance very quickly.

what is active learning? The process of guiding the sampling process by querying for certain types of instances based upon the data that we have seen so far is called active learning.

Why important The process of labeling the training data is not only time-consuming sometimes but also very expensive. Less training data we need, more we will save.

applications Text classification Web page classification Junk mail recognition

active learning with different methods 1, Neural Networks 2, Bayesian rule 3, SVM No matter which method will be used, the core problem will be the same.

active learning with different methods The core problem is how to select training points actively? In other words, which training points will be informative to the model?

Apply active learning to Neural Networks Combined with query by committee Algorithm: 1, Samples two Neural Networks from distribution 2, when the unlabeled data arrives, use the committee to predict the label 3, if they disagree with each other, select it.

Apply active learning to Neural Networks Usually: Committee may contain more than two members. Classification problem will count #(+) and #(-) to see whether they are close. Regression problem use the variance of the outputs as the criteria of disagreement. Stop criteria is maximum model variance dropped below a set threshold.

Apply active learning to Baysian theory Characteristic: build a probabilistic classifier which not only make classification decisions, but estimate their uncertainty Try to estimate P(Ci | w), posterior probability that an example with pattern w belongs to class Ci. P(Ci | w) will directly guide to select training data.

Apply active learning to SVM Problem is also what is the criteria for uncertainty sampling? we can improve the model by attempting to maximally narrow the existing margin. If the points which lie on or close to the dividing hyperplane are added into training points, it will on average narrow the margin most.

Apply active learning to Baysian theory About the stopping criteria: All unlabeled data in the margin have been exhausted, we will stop. Why? Only unlabeled data within the margin will have great effect on our learner. Labeling an example in the margin may shift the margin such that examples that were previously outside are now inside.

Employing EM and Pool-based Active Learning for Text Classification Motivation: Obtaining labeled training examples for text classification is often expensive, while gathering large quantities of unlabeled examples is very cheap. Here, we will present techniques for using a large pool of unlabeled documents to improve text classification when labeled training data is sparse.

How data are produced We approach the task of text classification from a bayesian learning perspective, we assume that the documents are generated by a particular parametric model, mixture of naïve nayes, and one-to-one correspondence between class labels and the mixture components.

How data are produced,Indicate the jth component and jth class Each component c j is parameterized by a disjoint subset of θ The likelihood of a document is a sum of total probability over all generative components

How data are produced Document di is considered to be an ordered list of word events. Wdik represents the word in position k of document di. The subscript of w indicates an index into the vocabulary V=. Combined with standard naïve bayes assumption: words are independent from other words in the same document.

goal Given these underlying assumption of how data are produced, the task of learning a text classifier consists of forming an estimate of θ, written as based on a training set.

Formular If the task is to classify a test document di into a single class, simply select the class with the highest posterior probability: argmax j P(c j |d j ; )

EM and Unlabeled data problem: When naïve bayes is given just a small set of labeled training data, classifiction accuracy will suffer because variance in the parameter estimates of the generative model will be high.

EM and Unlabeled data Motivation: By augmenting this small labeled set with a large set of unlabeled data and combining the two pools with EM, we can improve the parameter estimates.

implementation of EM Initialize just using labeled data. E-step: Calculate probabilistically-weighted class labels, P(cj | dj; ), for every unlabeled document. M-step: Calculate a new maximum likelihood estimate for θ using all the labeled data. The process iterate until reaches a fixed point

Active learning with EM

Disagreement creteria To measure committee disagreement for each document using Kullback-Leibler divergence to the mean. KL divergence to the mean is an average of the KL divergence between each distribution and the mean of all the distributions:

END Thank you