Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,

Slides:



Advertisements
Similar presentations
A Simple Probabilistic Approach to Learning from Positive and Unlabeled Examples Dell Zhang (BBK) and Wee Sun Lee (NUS)
Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Unsupervised Learning
Evaluating Classifiers
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Chapter 5: Partially-Supervised Learning
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
SLIDE 1IS 240 – Spring 2010 Logistic Regression The logistic function: The logistic function is useful because it can take as an input any.
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Bing LiuCS Department, UIC1 Learning from Positive and Unlabeled Examples Bing Liu Department of Computer Science University of Illinois at Chicago Joint.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.
VI. Evaluate Model Fit Basic questions that modelers must address are: How well does the model fit the data? Do changes to a model, such as reparameterization,
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Class Imbalance in Text Classification
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine Learning 5. Parametric Methods.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
NTU & MSRA Ming-Feng Tsai
Classification and Regression Trees
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.
Roger B. Hammer Assistant Professor Department of Sociology Oregon State University Conducting Social Research Logistic Regression Categorical Data Analysis.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Learning Deep Generative Models by Ruslan Salakhutdinov
Deep Feedforward Networks
Chapter 8: Semi-Supervised Learning
Empirical risk minimization
Regularized risk minimization
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
Probabilistic Models for Linear Regression
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Discrete Event Simulation - 4
Empirical risk minimization
Concave Minimization for Support Vector Machine Classifiers
Support Vector Machines 2
Presentation transcript:

Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois, Chicago

Personalized Web Browser Learn web pages that are of interest to you! Information that is available to browser when it is installed: –Your bookmark (or cached documents) – Positive examples –All documents in the web – Unlabeled examples!!

Direct Marketing Company has database with details of its customer – positive examples Want to find people who are similar to their own customer Buy a database consisting of details of people, some of whom may be potential customers – unlabeled examples.

Assumptions All examples are drawn independently from a fixed underlying distribution Negative examples are never labeled With fixed probability , positive example is independently left unlabeled.

Are Unlabeled Examples Helpful? Function known to be either x 1 0 Which one is it? x 1 < 0 x 2 > u u u u u u u u u u u Not learnable with only positive examples. However, addition of unlabeled examples makes it learnable.

Related Works Denis (1998) showed that function classes learnable in the statistical query model is learnable from positive and unlabeled examples. Muggleton (2001) showed that learning from positive examples is possible if the distribution of inputs is known. Liu et.al. (2002) give sample complexity bounds and an algorithm based on EM Yu et.al. (2002) gives algorithm based on SVM …

Approach Label all unlabeled examples as negative (Denis 1998) –Negative examples are always labeled negative –Positive examples are labeled negative with probability  Training with one-sided noise Problem:  is not known Also, what if there is some noise on the negative examples? Negative examples occasionally labeled positive with small probability.

Selecting Threshold and Robustness to Noise Approach: Reweigh examples and learn conditional probability P(Y=1|X) If you weight the examples by –Multiplying the negative examples with weight equal to the number of positive examples and –Multiplying the positive examples with weight equal to the number of negative examples

Selecting Threshold and Robustness to Noise Then P(Y=1|X) > 0.5 when X is a positive example and P(Y=1|X) < 0.5 when X is a negative example, as long as –  +  < 1 where  is probability that positive example is labeled negative  is probability that negative example is labeled positive Okay, even if some of the positive examples are not actually positive (noise).

Weighted Logistic Regression Practical algorithm: Reweigh the examples and then do logistic regression with linear function to learn P(Y=1|X). –Compose linear function with sigmoid then do maximum likelihood estimation Convex optimization problem Will learn the correct conditional probability if it can be represented Minimize upper bound to weighted classification error if cannot be represented – still makes sense.

Selecting Regularization Parameter Regularization important when learning with noise Add c times sum of squared values of weights to cost function as regularization How to choose the value of c? –When both positive and negative examples available, can use validation set to choose c. –Can use weighted examples in a validation set to choose c, but not sure if this makes sense?

Selecting Regularization Parameter Performance criteria pr/P(Y=1) can be estimated directly from validation set as r 2 /P(f(X) = 1) –Recall r = P(f(X) = 1| Y = 1) –Precision p = P(Y = 1| f(X) = 1) Can use for –tuning regularization parameter c –also to compare different algorithms when only positive and unlabeled examples (no negative) available Behavior similar to commonly used F-score F = 2pr/(p+r) –Reasonable when use of F-score reasonable

Experimental Setup 20 Newsgroup dataset 1 group positive, 19 others negative Term frequency as features, normalized to length 1 Randomly split –50% train –20% validation –30% test Validation set used to select regularization parameter from small discrete set then retrain on training+validation set

Results  Optpr/P(Y=1)Weighted Error S-EM1-Cls SVM F-score averaged over 20 groups

Conclusions Learning from positive and unlabeled examples by learning P(Y=1|X) after setting all unlabeled examples negative. –Reweighing examples allows threshold at 0.5 and makes it tolerant to negative examples that are misclassified as positive Performance measure pr/P(Y=1) can be estimated from data –Useful when F-score is reasonable –Can be used to select regularization parameter Logistic regression using linear regression and these methods works well on text classification