Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review of Lecture Two Linear Regression Normal Equation

Similar presentations


Presentation on theme: "Review of Lecture Two Linear Regression Normal Equation"— Presentation transcript:

1 Review of Lecture Two Linear Regression Normal Equation
Cost Function Gradient Decent Normal Equation (XTX)-1 Probabilistic Interpretation Maximum Likelihood Estimation vs. Linear Regression Gaussian Distribution of the Data Generative vs Discriminative

2 General Linear Regression Methods Important Implications
Recall q, a column vector (1 for the intercept q0 + n parameters), can be obtained from: When the X variables are linearly independent (XTX being full rank), there is a unique solution to the normal equations; The inversion of XTX depends on the existence of XTXX=X, that is to find a matrix equivalent of a numerical reciprocal; Only models with a single output variable can be trained.

3 Maximum Likelihood Estimation
Assume data are i.i.d. (independently identically distributed) Likelihood of L(q) = the probability of y given x parameterized by q What is Maximum Likelihood Estimation (MLE)? Chose parameters q to maximize the function , so to make the training data set as probable as possible.

4 The Connection Between MLE and OLE
Chose parameters q to maximize the data likelihood: Equivalent to minimize

5 The Equivalence of MLE and OLE
= J(q) !?

6 Today’s Content Logistic Regression The Exponential Family
Discrete Output Connection to MLE The Exponential Family Bernoulli Gaussian Generalized Linear Models (GLMs)

7 Sigmoid (Logistic) Function
Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize Linear Methods) that the choice of the logistic function is a natural one.

8

9 Gradient Assent for MLE of the Logistic Function
Recall Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule: Given

10 One Useful Property of the Logistic Function

11 Identical to Least Square Again?

12

13 Discriminative vs. Generative Algorithms
Discriminative Learning Either Learn p(y|x) directly, or learn hq {1,0} that given x, the hypothesis will output {1,0} directly; Logistic regression is an example of discriminative learning algorithm; In Contrast, Generative Learning Build the probabilistic distribution of x conditioned for each of the classes, p(x|y=1) and p(x|y=0), respectively; Also build the probabilistic distribution of p(y=1) or p(y=0), as the class priors (or the weights); Use the Bayes Rule to compare the p(x|y) given y=1 or y=0, i.e., to see which one is more likely;

14 Question For P(y|x; q) We learn q in order to maximize the P(y I x; q)
When we do so: If y ~ Gaussian, we use Least Square Regression If y {0,1} ~ Bernoulli, we use Logistic Regression Why ? Any natural reasons?

15 Any Probabilistic, Linear, and General (PLG), Learning Framework?
A web-site visiting problem, for a PLG solution

16 Generalized Linear Models The Exponential Family
Natural (distribution) Parameter Sufficient Statistics, often T(y) = y Normalization Term A fixed choice of T, a, and b defines a set of distributions that is parameterized by h; as we vary h we will get different distributions within this family (affecting the mean). Bernoulli, Gaussian, and other distributions are examples of exponential family distributions. A way of unifying various statistical models, like linear regression, logistic regression and Poisson regression, into one framework.

17 Examples of distributions in the exponential family
Gaussian Bernoulli Binomial Multinomial Chi-square Exponential Poisson Beta

18 Bernoulli Y | x; q ~ ExpFamily (h), here we chose a, b, T to be the specific form to cause the distribution to be Bernoulli. For any fixed x, q, we hope that our algorithm will output hq(x) = E[y|x;q) = p (y=1|x;q) = f = 1/(1+e-h) = 1/(1+eqTx) If you recall that the form of logistic function being 1/(1+e-z), now you should understand why we chose the logistic form for a learning process if my data mimics a Bernoulli distribution.

19 To Build GLM p = (y|x ; q) where y belongs to a distribution of the Exponential Family (h) given x and q Given x, our goal is to output E[T(y)|x] i.e., we want h(x) = E[T(y)|x] (Note for most cases, T(y) = y) Think about the relationship between the input x and the parameter h, which we hope to use h to define my desired distribution, according to h=qTX (linear, as my design choice), h is a number or a vector

20 Generalized Linear Models The Exponential Family
Distribution Name Link Function Mean Normal Identity qTX = μ μ = qTX Exponential Inverse qTX = μ-1 μ = (qTX)-1 Gamma Poisson Log qTX = ln(μ) μ = exp(qTX) Binomial Logit Multinomial

21 More precisely… A flexible generalization of ordinary least squares regression that relates the random distribution of the distribution function to the systematic portion of the linear predictor through a function called the link function.

22 Models that deal with correlated data are extensions of GLZ’s.
The standard GLZ assumes that the observations are uncorrelated (i.i.d.) Models that deal with correlated data are extensions of GLZ’s. Generalized estimating equations: Use population-averaged effects. Generalized linear mixed models: A type of multilevel model (mixed model), an extension of logistic regression. Hierarchical generalized linear models: similar to generalized linear mixed models, apart from two distinctions: The random effects can have any distribution in the exponential family, whereas current linear mixed models nearly always have normal random effects; Computationally less complex than linear mixed models.

23 Summary GLM is a flexible generalization of ordinary least squares regression. GLM generalizes linear regression by allowing the linear model to be related to the output variable via a link function and by allowing the magnitude of the variance of each feature to be a function of its predicted value. GLMs are of unifying various other statistical models, including linear, logistic, …, and Poisson regressions, under one framework. This allowed us to develop a general algorithm for maximum likelihood estimation in all these models. It extends naturally to encompass many other models as well. In a GLM, the output is thus assumed to be generated from a particular distribution function of the exponential family.


Download ppt "Review of Lecture Two Linear Regression Normal Equation"

Similar presentations


Ads by Google