Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
CMPUT 466/551 Principal Source: CMU
AdaBoost & Its Applications
Face detection Many slides adapted from P. Viola.
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Decision Theory Naïve Bayes ROC Curves
ROC Curves.
Decision trees and empirical methodology Sec 4.3,
The joy of Entropy.
Ensemble Learning (2), Tree and Forest
Face Detection CSE 576. Face detection State-of-the-art face detection demo (Courtesy Boris Babenko)Boris Babenko.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Personalized Spam Filtering for Gray Mail Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Robert McCann Microsoft Corporation.
Face Detection using the Viola-Jones Method
Active Learning for Class Imbalance Problem
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Evaluation – next steps
ROC 1.Medical decision making 2.Machine learning 3.Data mining research communities A technique for visualizing, organizing, selecting classifiers based.
Data Analysis 1 Mark Stamp. Topics  Experimental design o Training set, test set, n-fold cross validation, thresholding, imbalance, etc.  Accuracy o.
Window-based models for generic object detection Mei-Chen Yeh 04/24/2012.
Universit at Dortmund, LS VIII
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Introduction to Software Testing. Types of Software Testing Unit Testing Strategies – Equivalence Class Testing – Boundary Value Testing – Output Testing.
Machine learning system design Prioritizing what to work on
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
1 A Study of Supervised Spam Detection Applied to Eight Months of Personal E- Mail Gordon Cormack and Thomas Lynam Presented by Hui Fang.
CSSE463: Image Recognition Day 11 Lab 4 (shape) tomorrow: feel free to start in advance Lab 4 (shape) tomorrow: feel free to start in advance Test Monday.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Learning to Detect Faces A Large-Scale Application of Machine Learning (This material is not in the text: for further information see the paper by P.
29 August 2013 Venkat Naïve Bayesian on CDF Pair Scores.
© 2013 WESTERN DIGITAL TECHNOLOGIES, INC. ALL RIGHTS RESERVED Machine Learning and Failure Prediction in Hard Disk Drives Dr. Amit Chattopadhyay Director.
Evaluating Classification Performance
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Notes on HW 1 grading I gave full credit as long as you gave a description, confusion matrix, and working code Many people’s descriptions were quite short.
Classification using Co-Training
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Deep Feedforward Networks
Evaluation – next steps
Session 7: Face Detection (cont.)
ECE 5424: Introduction to Machine Learning
Support Vector Machines (SVM)
Asymmetric Gradient Boosting with Application to Spam Filtering
Ensembles.
ROC Curves and Operating Points
Presentation transcript:

Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft

2 Regular wayRegular way Give him lots of positive & negative examples to learnGive him lots of positive & negative examples to learn Violent wayViolent way Punch him for false positivePunch him for false positive Collaborative wayCollaborative way Train the 2 nd using examples that the 1 st thinks are spamTrain the 2 nd using examples that the 1 st thinks are spam

3 Regular wayRegular way Give him lots of positive & negative examples to learnGive him lots of positive & negative examples to learn Violent wayViolent way Collaborative wayCollaborative way 10% 15% ~ 30% 20% ~ 40%

4 Low False-Positive Region Improving the whole ROC curve? Improving the whole ROC curve?  Fantastic! But… 0 (No missed spam) (All spam missed) 1 0 (No good caught) 1 (All good caught) We only care about low false- positive region!

5 Outline Quick review of ROC curves Quick review of ROC curves Methods to improve spam filters in low false-positive region Methods to improve spam filters in low false-positive region  2-stage framework  Training with utility Experiments Experiments Related work Related work Conclusions Conclusions

6 False-Positive vs. False-Negative X: Ratio of misclassified good mail (FPR) X: Ratio of misclassified good mail (FPR) Y: Ratio of missed spam (FNR) Y: Ratio of missed spam (FNR) Statistical filter assigns scores to messages Statistical filter assigns scores to messages  Change behavior by choosing different thresholds 0 (No missed spam) (All spam missed) 1 0 (No good caught) 1 (All good caught) ROC curve

7 Properties of ROC Curves ROC shows the trade-off between false positive (misclassified good mail) and false negative (missed spam) given different thresholds ROC shows the trade-off between false positive (misclassified good mail) and false negative (missed spam) given different thresholds   = 0.5 may not be the best choice  Decide  according to the ROC curve The ranking decides the ROC curve, not the absolute scores. The ranking decides the ROC curve, not the absolute scores. For spam filtering, we only care about how much spam the filter can catch when the false-positive rate is low. For spam filtering, we only care about how much spam the filter can catch when the false-positive rate is low.  The cost of missing good mail is much higher than not catching spam!

8 Outline Quick Review of ROC curves Quick Review of ROC curves Methods to improve spam filters in low false-positive region Methods to improve spam filters in low false-positive region  2-stage framework (collaborative way)  Training with utility (violent way) Experiments Experiments Related work Related work Conclusions Conclusions

9 2-Stage Framework: Idea Forget easy good mail and hard spam Forget easy good mail and hard spam  The messages that have low scores  They fall in the high false-positive region. Try do a better job on other messages Try do a better job on other messages

10 Use 2-Stage Framework We trained 2 models for these 2 stages We trained 2 models for these 2 stages Apply the 1 st -stage model on all messages Apply the 1 st -stage model on all messages For messages having scores less than , don’t change the order or scores For messages having scores less than , don’t change the order or scores Re-score and re-order the remaining messages using the 2 nd -stage model Re-score and re-order the remaining messages using the 2 nd -stage model

11 Train 2-Stage Framework (1/2) The naïve way The naïve way  Train the 1 st -stage model as usual  Score the training messages using the 1 st -stage model  Use the subset of training data whose scores are larger than  as the training data for the 2 nd -stage model Problem Problem  The scores of the training data tend to be too good and different from scores on unseen data Solution: Use cross-validation Solution: Use cross-validation

12 Why does 2-stage work? The 2-stage framework provides more complex hypothesis space, which might fit the data better. The 2-stage framework provides more complex hypothesis space, which might fit the data better. Suppose that the base model form is a linear classifier. Suppose that the base model form is a linear classifier. Pick the subset of the data in the region you care about. Pick the subset of the data in the region you care about.  Find all messages, good and spam, that are more than, say, 50% likely to be spam according to the first model. Train a new model on only this data. Train a new model on only this data. At test time, use both models. At test time, use both models. Spam mail Good mail Spam mail Good mail Spam mail Good mail

13 Training with Utility Motivation again: why do we care more in the low false-positive rate region? Motivation again: why do we care more in the low false-positive rate region?  The “cost” or “utility” of a false positive error (misclassified good mail) is much higher. A common approach is to select the right threshold to get the correct FP/FN trade-off A common approach is to select the right threshold to get the correct FP/FN trade-off A less common approach is to “re-weight” the data (training with utility) A less common approach is to “re-weight” the data (training with utility)  It's more important to get negative examples right  Duplicate negative examples 10 times and train the model on the new data

14 Work for Naïve Bayes? Training with utility is usually used in non- probabilistic learning algorithms, such as SVMs, when the data is highly skewed. Training with utility is usually used in non- probabilistic learning algorithms, such as SVMs, when the data is highly skewed. It has been argued that training with utility has no effect on naïve Bayes It has been argued that training with utility has no effect on naïve Bayes  Only the prior changes  Probability is multiplied by a constant  Effectively, the decision hyperplane only shifts but not rotates

15 The Real Results In practice, training with utility improves both naïve Bayes and logistic regression filters. In practice, training with utility improves both naïve Bayes and logistic regression filters. For naïve Bayes, smoothing is the key For naïve Bayes, smoothing is the key  Training with utility is equivalent to having different smoothing parameters for positive and negative examples. For logistic regression, the hyperplane may both shift and rotate, even without smoothing. For logistic regression, the hyperplane may both shift and rotate, even without smoothing.

16 Outline Quick Review of ROC curves Quick Review of ROC curves Methods to improve spam filters in low false-positive region Methods to improve spam filters in low false-positive region  2-stage framework  Training with utility Experiments Experiments Related work Related work Conclusions Conclusions

17 Data Hotmail Feedback Loop Hotmail Feedback Loop  Polling over 100,000 Hotmail users daily  Asked to hand-label a sample message addressed to him as Good or Spam  Very roughly 3% of user labels are errors Training data (7/1/05 ~ 11/30/05) Training data (7/1/05 ~ 11/30/05)  5,000 msg per day; 765,000 msg in total Testing data (12/1/05 ~ 12/15/05) Testing data (12/1/05 ~ 12/15/05)  10,000 msg per day; 150,000 msg in total Features (excluding proprietary features) Features (excluding proprietary features)  Subject keywords and body keywords

18 Logistic Regression (20% improvement)

19 Naïve Bayes (40% improvement)

20 Related Work 2-stage framework 2-stage framework  Different from Boosting  We specifically focus on low false-positive region  We use cross validation  We combine classifiers differently  A special form of decision list with only 2 layers, or cascade of classifiers ([Viola&Jones ’01], [Roth&Yih ’01])  Cascades improve overall accuracy or speed up the system  Cascades use more features or more complex models ain later stages Training with utility Training with utility  Has been done before – but not for spam we think  Typically used for unbalanced data as opposed to for emphasizing low false positive rate

21 Conclusions Reduced false negative by 20-40% at low false positive rates. Reduced false negative by 20-40% at low false positive rates.  Training with utility (10% for both NB and LR)  Two-stage filtering works even better (15% ~ 30%)  The combination works the best!  40% gain for Naïve Bayes  20% for logistic regression Both techniques can potentially be used with most other learning methods Both techniques can potentially be used with most other learning methods Key insight: by specifically targeting low false positive rate at training time can get better results. Key insight: by specifically targeting low false positive rate at training time can get better results.