Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Generative Models Rong Jin

Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(  1,  1 ) Male: Gaussian distribution N(  2,  2 ) Pr(male|1.67m) Pr(female|1.67m)

Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(y|x;  ) Male: Gaussian distribution N(  1,  1 ) Female: Gaussian distribution N(  2,  2 ) Pr(male|1.67m) Pr(female|1.67m)

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using maximum likelihood approach  The class of a new instance is predicted by

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using the maximum likelihood approach  The class of a new instance is predicted by

Maximum Likelihood Estimation (MLE)  Given training example  Compute log-likelihood of data  Find the parameters  that maximizes the log-likelihood In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation

Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using the maximum likelihood approach  The class of a new instance is predicted by

Generative Models  Most probabilistic distributions are joint distribution (i.e., p(x;  )), not conditional distribution (i.e., p(y|x;  ))  Using Bayes rule p(xly;  )  { p(y|x;  ); p(y;  )}

Generative Models  Most probabilistic distributions are joint distribution (i.e., p(x;  )), not conditional distribution (i.e., p(y|x;  ))  Using Bayes rule p(y|x;  )  { p(x|y;  ); p(y;  )}

Generative Models (cont’d)  Treatment of p(x|y;  )  Let y  Y={1, 2, …, c}  Allocate a separate set of parameters for each class   {  1,  2,…,  c } p(xly;  )  p(x;  y ) Data in different class have different input patterns

Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

Example Task: predict gender of individuals based on their heights Given 100 height examples of women 100 height examples of man Assume height of women and man follow different Gaussian distributions

Example (cont’d)  Gaussian distribution  Parameter space Gaussian distribution for man: (  m  m ) Gaussian distribution for man: (  w  w ) Class priors: p m = p(y=man), p w = p(y=women)

Example (cont’d)  Gaussian distribution  Parameter space Gaussian distribution for male: (  m,  m ) Gaussian distribution for female: (  f,  f ) Class priors: p m = p(y=male), p f = p(y=female)

Example (cont’d)

 Learn a Gaussian generative model Example (cont’d)

 Predict the gender of an individual given his/her height Example (cont’d)

Decision boundary  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f h*

Example  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f p f < p m p f > p m

Gaussian Generative Model (II)  Inputs contain multiple features  Example Task: predict if an individual is overweight based on his/her salary and the number of hours on watching TV Input: (s: salary, h: hours for watching TV) Output: +1 (overweight), -1 (normal)

Multi-variate Gaussian Distribution

Properties of Covariance Matrix  What if the number of data points N < d?  How about for any vector ? Positive semi-definitive matrix

Properties of Covariance Matrix  What if the number of data points N < d?  How about for any ? Positive semi-definitive matrix

Properties of Covariance Matrix  What if the number of data points N < d?  How about for any ? Positive semi-definitive matrix  Number of different elements in  ?

Joint distribution p(s,h) for salary (s) and hours for watching TV (h) Gaussian Generative Model (II)

Multi-variate Gaussian Generative Model  Input with multiple input features  A multi-variate Gaussian distribution for each class

Improve Multivariate Gaussian Model  How could we improve the prediction of model for overweight?  Multiple modes for each class  Introduce more attributes of individuals Location Occupation The number of children House Age …

Problems with Using Multi-variate Gaussian Generative Model   is a matrix of size dxd, contains d(d+1)/2 independent variables d=100: the number of variables in  is 5,050 d=1000: the number of variables in  is 505,000  A large parameter space   can be singular If N < d If two features are linear correlated   -1 does not exist

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize 

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Feature independence assumption (Naïve Bayes assumption)

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Smooth the covariance matrix

Overfitting Issue  Complex model vs. insufficient training  Example Consider a classification problem of multiple inputs  100 input features  5 classes  1000 training examples Total number parameters for a full Gaussian model is  5 class prior  5 parameters  5 means  500 parameters  5 covariance matrices  50,500 parameters  51,005 parameters  insufficient training data

Model Complexity Vs. Data

Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Feature independence assumption

Naïve Bayes Model  In general, for any generative model, we have to estimate  For x in high dimension space, this probability is hard to estimate  In Naïve Bayes Model, we approximate

Text Categorization  Learn to classify text into predefined categories  Input x: a document Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}  Output y: if the document is politics or not +1 for political document, -1 for not political document

Text Categorization  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

Text Classification  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

Text Classification  A Naïve Bayes approach  For a document

Text Classification  The original parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  Parameter space after Naïve Bayes simplification p(+) and p ( - ) {p(w 1 |+), p(w 2 |+),…, p(w n |+)} {p(w 1 |-), p(w 2 |-),…, p(w n |-)}

Text Classification  Learning parameters from training examples Each document  Learn parameters using maximum likelihood estimation

Text Classification

 The optimal solution that maximizes the likelihood of training data

Text Classification Twenty NewsgroupsAn Example

Text Classification  Any problems with the Naïve Bayes text classifier?  Unseen words Word ‘w’ is unseen from the training documents, what is the consequence? Word ‘w’ is only unseen for documents of one class, what is the consequence? Related to the overfitting problem  Any suggestion?  Solution: word class approach Introducing word class T= {t 1, t 2, …, t m }  Compute p(t i |+), p(t i |-)  When w is unseen before, replace p(w|  ) with p(t i |  ) Introducing prior for word probabilities

Naïve Bayes Model  This is a terrible approximation

Naïve Bayes Model  Why use Naïve Bayes Model ?  We are essentially interested in p(y|x;  ), not p(x|y;  )

Naïve Bayes Model  The key for the prediction model is not p(x|y;  ), but the ratio p(x|y;  )/p(x|y’;  )  Although Naïve Bayes model does a poor job for estimating p(x|y;  ), it does a reasonable good on estimating the ratio.

The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance

The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance Gaussian generative model is a linear model

Linear Decision Boundary  Gaussian Generative Models == Finding a linear decision boundary  Why not directly estimate the decision boundary?

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Similar presentations

Presentation on theme: "Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N("— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Similar presentations

Presentation on theme: "Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N("— Presentation transcript:

Similar presentations

About project

Feedback