Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Similar presentations


Presentation on theme: "CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer."— Presentation transcript:

1 CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer

2 Gaussian Naïve Bayes P(X i =x| Y=y k ) = N(  ik,  ik ) P(Y | X)  P(X | Y) P(Y) N(  ik,  ik ) = Sometimes Assume Variance – is independent of Y (i.e.,  i ), – or independent of X i (i.e.,  k ) – or both (i.e.,  )

3 Generative vs. Discriminative Classifiers Want to Learn: h:X  Y – X – features – Y – target classes Bayes optimal classifier – P(Y|X) Generative classifier, e.g., Naïve Bayes: – Assume some functional form for P(X|Y), P(Y) – Estimate parameters of P(X|Y), P(Y) directly from training data – Use Bayes rule to calculate P(Y|X= x) – This is a ‘generative’ model Indirect computation of P(Y|X) through Bayes rule As a result, can also generate a sample of the data, P(X) =  y P(y) P(X|y) Discriminative classifiers, e.g., Logistic Regression: – Assume some functional form for P(Y|X) – Estimate parameters of P(Y|X) directly from training data – This is the ‘discriminative’ model Directly learn P(Y|X) But cannot obtain a sample of the data, because P(X) is not available 3 P(Y | X)  P(X | Y) P(Y)

4 Univariate Linear Regression 4 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  L 2 (y j, h w (x j )) = j=1 n  (y j - h w (x j )) 2 = j=1 n  (y j – (w 1 x j +w 0 )) 2

5 Understanding Weight Space 5 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 h w (x) = w 1 x + w 0

6 Understanding Weight Space 6 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2

7 Finding Minimum Loss 7 h w (x) = w 1 x + w 0 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 Loss(h w ) = 0 w0w0  Argmin w Loss(h w ) Loss(h w ) = 0 w1w1 

8 Unique Solution! h w (x) = w 1 x + w 0 w 1 = Argmin w Loss(h w ) w 0 = (  (y j )–w 1 (  x j )/N N  (x j y j )–(  x j )(  y j ) N  (x j 2 )–(  x j ) 2

9 Could also Solve Iteratively w = any point in weight space Loop until convergence For each w i in w do w i := w i -  Loss(w) Argmin w Loss(h w ) wiwi 

10 Multivariate Linear Regression 10 h w (x j ) = w 0 Unique Solution = (x T x) -1 x T y +  w i x j,i =  w i x j,i =w T x j Argmin w Loss(h w ) Problem….

11 Overfitting Regularize!! Penalize high weights 11 Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  w i 2 k Loss(h w ) = j=1 n  (y j – (w 1 x j +w 0 )) 2 i=1 +  | w i | k Alternatively….

12 Regularization 12 L1L2

13 Back to Classification 13 P(edible|X)=1 P(edible|X)=0 Decision Boundary

14 Logistic Regression Learn P(Y|X) directly!  Assume a particular functional form L Not differentiable… 14 P(Y)=1 P(Y)=0

15 Logistic Regression Learn P(Y|X) directly!  Assume a particular functional form  Logistic Function  Aka Sigmoid 15

16 Logistic Function in n Dimensions 16 Sigmoid applied to a linear function of the data: Features can be discrete or continuous!

17 Understanding Sigmoids w 0 =0, w 1 =-1 w 0 =-2, w 1 =-1 w 0 =0, w 1 = -0.5

18 Very convenient! implies linear classification rule! 18 ©Carlos Guestrin 2005-2009

19 Logistic regression more generally Logistic regression in more general case, where Y  {y 1,…,y R } for k<R for k=R (normalization, so no weights for this class) Features can be discrete or continuous! 19©Carlos Guestrin 2005-2009

20 Generative (Naïve Bayes) Loss function: Data likelihood Discriminative (Logistic Regr.) Loss funct: Conditional Data Likelihood Discriminative models can’t compute P(x j |w)! Or, … “They don’t waste effort learning P(X)” Focus only on P(Y|X) - all that matters for classification Loss Functions: Likelihood vs. Conditional Likelihood 20 ©Carlos Guestrin 2005-2009

21 Expressing Conditional Log Likelihood 21©Carlos Guestrin 2005-2009 } Probability of predicting 1 Probability of predicting 0 } 1 when correct answer is 01 when correct answer is 1 ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

22 Expressing Conditional Log Likelihood ln(P(Y=0|X,w) = -ln(1+exp(w 0 +  i w i X i ) ln(P(Y=0|X,w) = w 0 +  i w i X i - ln(1+exp(w 0 +  i w i X i )

23 Maximizing Conditional Log Likelihood Bad news: no closed-form solution to maximize l(w) Good news: l(w) is concave function of w! No local minima Concave functions easy to optimize 23©Carlos Guestrin 2005-2009

24 Optimizing Concave Functions Gradient Ascent Conditional likelihood for Logistic Regression is concave ! Find optimum with gradient ascent Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) Gradient: Learning rate,  >0 Update rule: 24 ©Carlos Guestrin 2005-2009

25 Gradient Descent for LR Gradient ascent algorithm: iterate until change <  For i=1,…,n, Repeat 25 ©Carlos Guestrin 2005-2009

26 26 That’s all M C LE. How about M C AP? One common approach is to define priors on w – Normal distribution, zero mean, identity covariance – “Pushes” parameters towards zero Corresponds to Regularization – Helps avoid very large weights and overfitting – More on this later in the semester MAP estimate ©Carlos Guestrin 2005-2009

27 27 M(C)AP as Regularization Penalizes high weights, also applicable in linear regression ©Carlos Guestrin 2005-2009

28 28 Large parameters  Overfitting If data is linearly separable, weights go to infinity Leads to overfitting: Penalizing high weights can prevent overfitting… ©Carlos Guestrin 2005-2009

29 29 Gradient of M(C)AP ©Carlos Guestrin 2005-2009

30 30 MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate ©Carlos Guestrin 2005-2009

31 Logistic regression v. Naïve Bayes Consider learning f: X  Y, where – X is a vector of real-valued features, – Y is boolean Could use a Gaussian Naïve Bayes classifier – assume all X i are conditionally independent given Y – model P(X i | Y = y k ) as Gaussian N(  ik,  i ) – model P(Y) as Bernoulli( ,1-  ) What does that imply about the form of P(Y|X)? Cool!!!! 31 ©Carlos Guestrin 2005-2009

32 Derive form for P(Y|X) for continuous X i 32©Carlos Guestrin 2005-2009

33 Ratio of class-conditional probabilities 33©Carlos Guestrin 2005-2009

34 Derive form for P(Y|X) for continuous X i 34©Carlos Guestrin 2005-2009

35 Gaussian Naïve Bayes v. Logistic Regression Representation equivalence – But only in a special case!!! (GNB with class-independent variances) But what’s the difference??? LR makes no assumptions about P(X|Y) in learning!!! Loss function!!! – Optimize different functions ! Obtain different solutions Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters 35 ©Carlos Guestrin 2005-2009

36 Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X= Number of parameters: NB: 4n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled 36 ©Carlos Guestrin 2005-2009

37 G. Naïve Bayes vs. Logistic Regression 1 Generative and Discriminative classifiers Asymptotic comparison (# training examples  infinity) – when model correct GNB, LR produce identical classifiers – when model incorrect LR is less biased – does not assume conditional independence – therefore LR expected to outperform GNB [Ng & Jordan, 2002] 37 ©Carlos Guestrin 2005-2009

38 G. Naïve Bayes vs. Logistic Regression 2 Generative and Discriminative classifiers Non-asymptotic analysis – convergence rate of parameter estimates, n = # of attributes in X Size of training data to get close to infinite data solution GNB needs O(log n) samples LR needs O(n) samples – GNB converges more quickly to its (perhaps less helpful) asymptotic estimates [Ng & Jordan, 2002] 38 ©Carlos Guestrin 2005-2009

39 Some experiments from UCI data sets Naïve bayes Logistic Regression 39©Carlos Guestrin 2005-2009

40 What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR – Solution differs because of objective (loss) function In general, NB and LR make different assumptions – NB: Features independent given class ! assumption on P(X|Y) – LR: Functional form of P(Y|X), no assumption on P(X|Y) LR is a linear classifier – decision rule is a hyperplane LR optimized by conditional likelihood – no closed-form solution – concave ! global optimum with gradient ascent – Maximum conditional a posteriori corresponds to regularization Convergence rates – GNB (usually) needs less data – LR (usually) gets to better solutions in the limit 40 ©Carlos Guestrin 2005-2009

41 41

42 Bayesian Learning Use Bayes rule: Or equivalently: P(Y | X)  P(X | Y) P(Y) Prior Normalization Data Likelihood Posterior P(Y | X) = P(X |Y) P(Y) P(X)

43 Naïve Bayes Naïve Bayes assumption: – Features are independent given class: – More generally: How many parameters now? Suppose X is composed of n binary features

44 The Distributions We Love 44 DiscreteContinuous Binary {0, 1}k Values Single Event Sequence (N trials) N=  H +  T Bernouilli BinomialMultinomial BetaDirichlet Conjugate Prior

45 Maximum Likelihood Estimates: Mean: Variance: Learning Gaussian Parameters

46 Maximum Likelihood Estimates: Mean: Variance: Learning Gaussian Parameters j th training example  x  if x true, else 

47 Maximum Likelihood Estimates: Mean: Variance: Learning Gaussian Parameters

48 What You Need to Know about Naïve Bayes Optimal Decision using Bayes Classifier Naïve Bayes Classifier – What’s the assumption – Why we use it – How do we learn it Text Classification – Bag of words model Gaussian NB – Features still conditionally independent – Features have Gaussian distribution given class

49 What’s (supervised) learning more formally Given: – Dataset: Instances {  x 1 ;t(x 1 ) ,…,  x N ;t(x N )  } e.g.,  x i ;t(x i )  =  (GPA=3.9,IQ=120,MLscore=99);150K  – Hypothesis space: H e.g., polynomials of degree 8 – Loss function: measures quality of hypothesis h  H e.g., squared error for regression Obtain: – Learning algorithm: obtain h  H that minimizes loss function e.g., using closed form solution if available Or greedy search if not Want to minimize prediction error, but can only minimize error in dataset 49

50 Types of (supervised) learning problems, revisited Decision Trees, e.g., – dataset:  votes; party  – hypothesis space: – Loss function: NB Classification, e.g., – dataset:  brain image; {verb v. noun}  – hypothesis space: – Loss function: Density estimation, e.g., – dataset:  grades  – hypothesis space: – Loss function: 50 ©Carlos Guestrin 2005-2009

51 Learning is (simply) function approximation! The general (supervised) learning problem: – Given some data (including features), hypothesis space, loss function – Learning is no magic! – Simply trying to find a function that fits the data Regression Density estimation Classification (Not surprisingly) Seemly different problem, very similar solutions… 51

52 What you need to know about supervised learning Learning is function approximation What functions are being optimized? 52


Download ppt "CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer."

Similar presentations


Ads by Google