Presentation is loading. Please wait.

Presentation is loading. Please wait.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

Similar presentations


Presentation on theme: "Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N("— Presentation transcript:

1 Generative Models Rong Jin

2 Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(  1,  1 ) Male: Gaussian distribution N(  2,  2 ) Pr(male|1.67m) Pr(female|1.67m)

3 Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(y|x;  ) Male: Gaussian distribution N(  1,  1 ) Female: Gaussian distribution N(  2,  2 ) Pr(male|1.67m) Pr(female|1.67m)

4 Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using maximum likelihood approach  The class of a new instance is predicted by

5 Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using maximum likelihood approach  The class of a new instance is predicted by

6 Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using maximum likelihood approach  The class of a new instance is predicted by

7 Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using the maximum likelihood approach  The class of a new instance is predicted by

8 Maximum Likelihood Estimation (MLE)  Given training example  Compute log-likelihood of data  Find the parameters  that maximizes the log-likelihood In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation

9 Maximum Likelihood Estimation (MLE)  Given training example  Compute log-likelihood of data  Find the parameters  that maximizes the log-likelihood In many case, the expression for log-likelihood is not closed form and therefore MLE requires numerical calculation

10 Probabilistic Models for Classification Problems  Apply statistical inference methods  Given training example  Assume a parametric model  Learn the model parameters  from training example using the maximum likelihood approach  The class of a new instance is predicted by

11 Generative Models  Most probabilistic distributions are joint distribution (i.e., p(x;  )), not conditional distribution (i.e., p(y|x;  ))  Using Bayes rule p(xly;  )  { p(y|x;  ); p(y;  )}

12 Generative Models  Most probabilistic distributions are joint distribution (i.e., p(x;  )), not conditional distribution (i.e., p(y|x;  ))  Using Bayes rule p(y|x;  )  { p(x|y;  ); p(y;  )}

13 Generative Models (cont’d)  Treatment of p(x|y;  )  Let y  Y={1, 2, …, c}  Allocate a separate set of parameters for each class   {  1,  2,…,  c } p(xly;  )  p(x;  y ) Data in different class have different input patterns

14 Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

15 Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

16 Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

17 Generative Models (cont’d)  Parameter space Parameters for distribution: {  1,  2,…,  c } Class priors: {p(y=1), p(y=2), …, p(y=c)}  Learn parameters from training examples using MLE Compute log-likelihood Search for the optimal parameters by maximizing the log-likelihood

18 Example Task: predict gender of individuals based on their heights Given 100 height examples of women 100 height examples of man Assume height of women and man follow different Gaussian distributions

19 Example (cont’d)  Gaussian distribution  Parameter space Gaussian distribution for man: (  m  m ) Gaussian distribution for man: (  w  w ) Class priors: p m = p(y=man), p w = p(y=women)

20 Example (cont’d)  Gaussian distribution  Parameter space Gaussian distribution for male: (  m,  m ) Gaussian distribution for female: (  f,  f ) Class priors: p m = p(y=male), p f = p(y=female)

21 Example (cont’d)

22

23

24  Learn a Gaussian generative model Example (cont’d)

25  Learn a Gaussian generative model Example (cont’d)

26

27  Predict the gender of an individual given his/her height Example (cont’d)

28 Decision boundary  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f h*

29 Example  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f p f < p m p f > p m

30 Example  Decision boundary h* Predict female when h<h* Predict male when h>h* Random when h=h*  Where is the decision boundary?  It depends on the ratio p m /p f p f < p m p f > p m

31 Gaussian Generative Model (II)  Inputs contain multiple features  Example Task: predict if an individual is overweight based on his/her salary and the number of hours on watching TV Input: (s: salary, h: hours for watching TV) Output: +1 (overweight), -1 (normal)

32 Multi-variate Gaussian Distribution

33

34

35 Properties of Covariance Matrix  What if the number of data points N < d?  How about for any vector ? Positive semi-definitive matrix

36 Properties of Covariance Matrix  What if the number of data points N < d?  How about for any ? Positive semi-definitive matrix

37 Properties of Covariance Matrix  What if the number of data points N < d?  How about for any ? Positive semi-definitive matrix  Number of different elements in  ?

38 Joint distribution p(s,h) for salary (s) and hours for watching TV (h) Gaussian Generative Model (II)

39 Joint distribution p(s,h) for salary (s) and hours for watching TV (h) Gaussian Generative Model (II)

40 Multi-variate Gaussian Generative Model  Input with multiple input features  A multi-variate Gaussian distribution for each class

41 Improve Multivariate Gaussian Model  How could we improve the prediction of model for overweight?  Multiple modes for each class  Introduce more attributes of individuals Location Occupation The number of children House Age …

42 Problems with Using Multi-variate Gaussian Generative Model   is a matrix of size dxd, contains d(d+1)/2 independent variables d=100: the number of variables in  is 5,050 d=1000: the number of variables in  is 505,000  A large parameter space   can be singular If N < d If two features are linear correlated   -1 does not exist

43 Problems with Using Multi-variate Gaussian Generative Model  Diagonalize 

44 Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Feature independence assumption (Naïve Bayes assumption)

45 Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Smooth the covariance matrix

46 Overfitting Issue  Complex model vs. insufficient training  Example Consider a classification problem of multiple inputs  100 input features  5 classes  1000 training examples Total number parameters for a full Gaussian model is  5 class prior  5 parameters  5 means  500 parameters  5 covariance matrices  50,500 parameters  51,005 parameters  insufficient training data

47 Model Complexity Vs. Data

48

49

50

51 Problems with Using Multi-variate Gaussian Generative Model  Diagonalize   Feature independence assumption

52 Naïve Bayes Model  In general, for any generative model, we have to estimate  For x in high dimension space, this probability is hard to estimate  In Naïve Bayes Model, we approximate

53 Naïve Bayes Model  In general, for any generative model, we have to estimate  For x in high dimension space, this probability is hard to estimate  In Naïve Bayes Model, we approximate

54 Naïve Bayes Model  In general, for any generative model, we have to estimate  For x in high dimension space, this probability is hard to estimate  In Naïve Bayes Model, we approximate

55 Text Categorization  Learn to classify text into predefined categories  Input x: a document Represented by a vector of words Example: {(president, 10), (bush, 2), (election, 5), …}  Output y: if the document is politics or not +1 for political document, -1 for not political document

56 Text Categorization  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

57 Text Classification  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

58 Text Classification  A generative model for text classification (TC)  Parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  It is difficult to estimate both p(doc|+;  ), p(doc| - ;  ) Typical vocabulary size ~ 100,000 Each document is a vector of 100,000 attributes ! Too many words in a document  A Naïve Bayes approach

59 Text Classification  A Naïve Bayes approach  For a document

60 Text Classification  The original parameter space p(+) and p ( - ) p(doc|+;  ), p(doc| - ;  )  Parameter space after Naïve Bayes simplification p(+) and p ( - ) {p(w 1 |+), p(w 2 |+),…, p(w n |+)} {p(w 1 |-), p(w 2 |-),…, p(w n |-)}

61 Text Classification  Learning parameters from training examples Each document  Learn parameters using maximum likelihood estimation

62 Text Classification

63

64

65  The optimal solution that maximizes the likelihood of training data

66 Text Classification Twenty NewsgroupsAn Example

67 Text Classification  Any problems with the Naïve Bayes text classifier?  Unseen words Word ‘w’ is unseen from the training documents, what is the consequence? Word ‘w’ is only unseen for documents of one class, what is the consequence? Related to the overfitting problem  Any suggestion?  Solution: word class approach Introducing word class T= {t 1, t 2, …, t m }  Compute p(t i |+), p(t i |-)  When w is unseen before, replace p(w|  ) with p(t i |  ) Introducing prior for word probabilities

68 Naïve Bayes Model  This is a terrible approximation

69 Naïve Bayes Model  Why use Naïve Bayes Model ?  We are essentially interested in p(y|x;  ), not p(x|y;  )

70 Naïve Bayes Model  Why use Naïve Bayes Model ?  We are essentially interested in p(y|x;  ), not p(x|y;  )

71 Naïve Bayes Model  Why use Naïve Bayes Model ?  We are essentially interested in p(y|x;  ), not p(x|y;  )

72 Naïve Bayes Model  The key for the prediction model is not p(x|y;  ), but the ratio p(x|y;  )/p(x|y’;  )  Although Naïve Bayes model does a poor job for estimating p(x|y;  ), it does a reasonable good on estimating the ratio.

73 The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance

74 The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance

75 The Ratio of Likelihood for Binary Classes  Assume that both classes share the same variance Gaussian generative model is a linear model

76 Linear Decision Boundary  Gaussian Generative Models == Finding a linear decision boundary  Why not directly estimate the decision boundary?


Download ppt "Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N("

Similar presentations


Ads by Google