1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Slides:



Advertisements
Similar presentations
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Linear Regression.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Machine Learning Week 2 Lecture 1.
Partitioned Logistic Regression for Spam Filtering Ming-wei Chang University of Illinois at Urbana-Champaign Wen-tau Yih and Christopher Meek Microsoft.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
What is Statistical Modeling
The loss function, the normal equation,
x – independent variable (input)
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Sample Selection Bias Lei Tang Feb. 20th, Classical ML vs. Reality  Training data and Test data share the same distribution (In classical Machine.
Decision Theory Naïve Bayes ROC Curves
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
Crash Course on Machine Learning
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Crash Course on Machine Learning Part II
CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 2012
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
SI485i : NLP Set 5 Using Naïve Bayes.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Machine Learning CSE 681 CH2 - Supervised Learning.
2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.
Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Machine Learning Recitation 6 Sep 30, 2009 Oznur Tastan.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Machine Learning – Classification David Fenyő
Chapter 7. Classification and Prediction
Generative verses discriminative classifier
Empirical risk minimization
Discriminative and Generative Classifiers
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
Hidden Markov Models Part 2: Algorithms
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Empirical risk minimization
Logistic Regression Chapter 7.
Support Vector Machines
Presentation transcript:

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides are based on Tom Mitchell’s slides TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A

2 Outline Reminder: Naive Bayes and Logistic Regression (MaxEnt) Asymptotic Analysis What is better if you have an infinite dataset? Non-asymptotic Analysis What is the rate of convergence of parameters? More important: convergence of the expected error Empirical evaluation Why this lecture? Nice and simple application of Large Deviation bounds we considered before We will analyze specifically NB vs LogRegression, but “hope” it generalizes to other models (e.g, models for sequence labeling or parsing)

3 Discriminative vs Generative Training classifiers involves estimating f: X  Y, or P(Y|X) Discriminative classifiers (conditional models) Assume some functional form for P(Y|X) Estimate parameters of P(Y|X) directly from training data Generative classifiers (joint models) Assume some functional form for P(X|Y), P(X) Estimate parameters of P(X|Y), P(X) directly from training data Use Bayes rule to calculate P(Y|X= xi)

4 Naive Bayes Example: assume Y boolean, X =, where xi are binary Generative model: Naive Bayes Classify new example x based on ratio You can do it in log-scale s indicates size of set. l is smoothing parameter

5 Naive Bayes vs Logistic Regression Generative model: Naive Bayes Classify new example x based on ratio Logistic Regression: Recall: both classifiers are linear

What is the difference asymptotically? Notation: let denote error of hypothesis learned via algorithm A, from m examples If the Naive Bayes model is true: Otherwise Logistic regression estimator is consistent: ² (h Di s,m ) converges to H is the class of all linear classifers Therefore, it is asymptotically better than the linear classifier selected by the NB algorithm

Rate of covergence: logistic regression Convergences to best linear classifier, in order of n examples follows from Vapnik’s structural risk bound (VC- dimension of n dimensional linear separators is n+1 )

Rate of covergence: Naive Bayes We will proceed in 2 stages: Consider how fast parameters converge to their optimal values (we do not care about it, actually) We care: Derive how it corresponds to the convergence of the error to the asymptotical error The authors consider a continous case (where input is continious) but it is not very interesting for NLP However, similar techniques apply

9 Convergence of Parameters

10 Recall: Chernoff Bound

11 Recall: Union Bound

12 Proof of Lemma (no smoothing for simplicity) By the Chernoff’s bound, with probability at least : the fraction of positive examples will be within of : Therefore we have at least positive and negative examples By the Chernoff’s bound for every feature and class label (2n cases) with probability We have one event with probability and 2n events with probabilities, there joint probability is not greater than sum: Solve this for m, and you get

Implications With a number of samples logarithmic in n (not linear as for the logistic regression!) the parameters of approach parameters of Are we done? Not really: this does not automatically imply that the error approaches with the same rate

Implications We need to show that and “often” agree if their parameters are close We compare log-scores given by the models: and I.e.:

15 Convergence of Classifiers G – defines the fraction of points very close to the decision boundary What is this fraction? See later

Proof of Theorem (sketch) By the lemma (with high probability) the parameters of are within : of those of It implies that every term in the sum is also within of the term in and hence Let So and can have different predictions only if Probability of this event is

17 Convergence of Classifiers G -- What is this fraction? This is somewhat more difficult

18 Convergence of Classifiers G -- What is this fraction? This is somewhat more difficult

What to do with this theorem This is easy to prove, no proof but intuition: A fraction of terms in have large expectation Therefore, the sum has also large expectation

What to do with this theorem But this is weaker then what we need: We have that the expectation is “large” We need that the probability of small values is low What about Chebyshev inequality? They are not independent... How to deal with it?

Corollary from the theorem Is this condition realistic? Yes (e.g., we can show it for rather realistic conditions)

Empirical Evaluation (UCI dataset) Dashed line is logistic regression Solid line is Naives Bayes

Empirical Evaluation (UCI dataset) Dashed line is logistic regression Solid line is Naives Bayes

Empirical Evaluation (UCI dataset) Dashed line is logistic regression Solid line is Naives Bayes

Summary Logistic regression has lower asymptotic error... But Naive Bayes needs less data to approach its asymptotic error

First Assignment I am still checking it, I will let you know by/on Friday Note though: Do not perform multiple tests (model selection) on the final test set! It is a form of cheating

Term Project / Substitution This Friday I will distribute the first phase -- due after the Spring break I will be away for the next 2 weeks: the first week (Mar, 9 – Mar, 15): I will be slow to respond to I will be substituted for this week by: Active Learning (Kevin Small) Indirect Supervision (Alex Klementiev) Presentation by Ryan Cunningham on Friday week Mar, 16 - Mar, 23 – no lectures: work on the project, send questions if needed