Today Linear Regression Logistic Regression Bayesians v. Frequentists
Published byModified over 4 years ago
Presentation on theme: "Today Linear Regression Logistic Regression Bayesians v. Frequentists"— Presentation transcript:
0 Lecture 4: Logistic Regression Machine LearningCUNY Graduate Center
1 Today Linear Regression Logistic Regression Bayesians v. Frequentists Bayesian Linear RegressionLogistic RegressionLinear Model for Classification
2 Regularization: Penalize large weights Introduce a penalty term in the loss function.Regularized Regression(L2-Regularization or Ridge Regression)
3 More regularizationThe penalty term defines the styles of regularizationL2-RegularizationL1-RegularizationL0-RegularizationL0-norm is the optimal subset of features
4 Curse of dimensionality Increasing dimensionality of features increases the data requirements exponentially.For example, if a single feature can be accurately approximated with 100 data points, to optimize the joint over two features requires 100*100 data points.Models should be small relative to the amount of available dataDimensionality reduction techniques – feature selection – can help.L0-regularization is explicit feature selectionL1- and L2-regularizations approximate feature selection.
5 Bayesians v. Frequentists What is a probability?FrequentistsA probability is the likelihood that an event will happenIt is approximated by the ratio of the number of observed events to the number of total eventsAssessment is vital to selecting a modelPoint estimates are absolutely fineBayesiansA probability is a degree of believability of a proposition.Bayesians require that probabilities be prior beliefs conditioned on data.The Bayesian approach “is optimal”, given a good model, a good prior and a good loss function. Don’t worry so much about assessment.If you are ever making a point estimate, you’ve made a mistake. The only valid probabilities are posteriors based on evidence given some prior
6 Bayesian Linear Regression The previous MLE derivation of linear regression uses point estimates for the weight vector, w.Bayesians say, “hold it right there”.Use a prior distribution over w to estimate parametersAlpha is a hyperparameter over w, where alpha is the precision or inverse variance of the distribution.Now optimize:
7 Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.
8 Optimize the Bayesian posterior As usual it’s easier to optimize after a log transform.
9 Optimize the Bayesian posterior Ignoring terms that do not depend on wIDENTICAL formulation as L2-regularization
10 Context Overfitting is bad. Bayesians vs. Frequentists Is one better? Machine Learning uses techniques from both camps.
11 Logistic Regression Linear model applied to classification Supervised: target information is availableEach data point xi has a corresponding target ti.Goal: Identify a function
12 Target VariablesIn binary classification, it is convenient to represent ti as a scalar with a range of [0,1]Interpretation of ti as the likelihood that xi is the member of the positive classUsed to represent the confidence of a prediction.For L > 2 classes, ti is often represented as a K element vector.tij represents the degree of membership in class j.|ti| = 1E.g. 5-way classification vector:
16 Classification approaches GenerativeModels the joint distribution between c and xHighest data requirementsDiscriminativeFewer parameters to approximateDiscriminant FunctionMay still be trained probabilistically, but not necessarily modeling a likelihood.
18 Relationship between Regression and Classification Since we’re classifying two classes, why not set one class to ‘0’ and the other to ‘1’ then use linear regression.Regression: -infinity to infinity, while class labels are 0, 1Can use a threshold, e.g.y >= 0.5 then class 1y < 0.5 then class 2f(x)>=0.5?Happy/Good/ClassASad/Not Good/ClassB1
19 Odds-ratioRather than thresholding, we’ll relate the regression to the class-conditional probability.Ratio of the odd of prediction y = 1 or y = 0If p(y=1|x) = 0.8 and p(y=0|x) = 0.2Odds ratio = 0.8/0.2 = 4Use a linear model to predict odds rather than a class label.
20 Logit – Log odds ratio function LHS: 0 to infinityRHS: -infinity to infinityUse a log function.Has the added bonus of disolving the division leading to easy manipulation
21 Logistic RegressionA linear model used to predict log-odds ratio of two classesInclude image.
30 Discriminative Training Take the derivatives w.r.t.Be prepared for this for homework.In the generative formulation, we need to estimate the joint of t and x.But we get an intuitive regularization technique.Discriminative TrainingModel p(t|x) directly.
31 What’s the problem with generative training Formulated this way, in D dimensions, this function has D parameters.In the generative case, 2D means, and D(D+1)/2 covariance valuesQuadratic growth in the number of parameters.We’d rather linear growth.
38 OptimizationWe know the gradient of the error function, but how do we find the maximum value?Setting to zero is nontrivialNumerical approximation
39 Gradient Descent Take a guess. Move in the direction of the negative gradientJump again.In a convex function this will convergeOther methods include Newton-Raphson
40 Multi-class discriminant functions Can extend to multiple classesOther approaches include constructing K-1 binary classifiers.Each classifier compares cn to not cnComputationally simpler, but not without problems
41 Exponential Model Logistic Regression is a type of exponential model. Linear combination of weights and features to produce a probabilistic model.
44 Entropy Measure of uncertainty, or Measure of “Information” High uncertainty equals high entropy.Rare events are more “informative” than common events.
45 Entropy How much information is received when observing ‘x’? If independent, p(x,y) = p(x)p(y).H(x,y) = H(x) + H(y)The information contained in two unrelated events is equal to their sum.
46 Entropy Binary coding of p(x): -log p(x) “How many bits does it take to represent a value p(x)?”How many “decimal” places? How many binary decimal places?Expected value of observed information
47 Examples of EntropyUniform distributions have higher distributions.
48 Maximum Entropy Logistic Regression is also known as Maximum Entropy. Entropy is convex.Convergence Expectation.Constrain this optimization to enforce good classification.Increase maximum likelihood of the data while making the distribution of weights most even.Include as many useful features as possible.
49 Maximum Entropy with Constraints From Klein and Manning Tutorial
50 Optimization formulation If we let the weights represent likelihoods of value for each feature.For each feature i
51 Solving MaxEnt formulation Convex optimization with a concave objective function and linear constraints.Lagrange MultipliersDual representation of the maximum likelihood estimation of Logistic RegressionFor each feature i
52 Summary Bayesian Regularization Logistic Regression Entropy Introduction of a prior over parameters serves to constrain weightsLogistic RegressionLog odds to construct a linear modelFormulation with Gaussian Class ConditionalsDiscriminative TrainingGradient DescentEntropyLogistic Regression as Maximum Entropy.