Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007-03-27 Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Similar presentations


Presentation on theme: "Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007-03-27 Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I."— Presentation transcript:

1 Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007-03-27 Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I. Kim (originally) Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

2 2 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/ 1.1 Example: Polynomial Curve Fitting 1.2 Probability Theory  1.2.1 Probability densities  1.2.2 Expectations and covariance  1.2.3 Bayesian probabilities  1.2.4 The Gaussian distribution  1.2.5 Curve fitting re-visited  1.2.6 Bayesian curve fitting 1.3 Model Selection Contents

3 3 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Pattern Recognition Training set, Target vector, Training (learning) phase  Determine Generalization  Test set Preprocessing  Feature extraction

4 4 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Supervised, Unsupervised and Reinforcement Learning Supervised Learning: with target vector  Classification  Regression Unsupervised learning: w/o target vector  Clustering  Density estimation  Visualization Reinforcement learning: maximize a reward  Trade-off between exploration & exploitation

5 5 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.1 Example: Polynomial Curve Fitting N observations Fit data  With polynomial function Minimizing error function  Sum of squares of errors

6 6 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Model Selection & Over-fitting (1/2) Choosing order M

7 7 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Model Selection & Over-fitting (2/2) RMS(Root-Mean-Square) Error Too large → Over-fitting The more data, the better generalization Over-fitting is a general property of maximum likelihood

8 8 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Control over-fit phenomena  Use penalty term  Shrinkage  Ridge regression  Weight decay Regularization

9 9 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2 Probability Theory “What is the overall probability that the selection procedure will pick an apple?” “Given that we have chosen an orange, what is the probability that the box we chose was the blue one?”

10 10 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Rules of Probability (1/2) Joint probability Marginal probability Conditional probability

11 11 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Rules of Probability (2/2) Sum rule Production rule Bayes’ theorem Posterior Likelihood Prior Normalizing constant

12 N = 60 12 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/http://bi.snu.ac.kr/

13 13 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.1 Probability densities Probabilities with respect to continuous variables  Probability density over x, p(x): Sum rule Product rule Cumulative distribution function

14 14 1.2.2 Expectations and Covariances Expectation of f(x): average value of f(x) under a probability dist p(x)  Conditional expectations Variance – a measure of how much variability there is in f around its mean Covariance – the extent to which x and y vary together

15 15 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.3 Bayesian Probabilities – F requantist vs. Bayesian Likelihood: Frequentist  w: considered as a fixed parameter determined by ‘estimator’  Maximum likelihood: Error function =  Error bars: Obtained by the distribution of possible data sets –Bootstrap Bayesian  a single data set (the one that is actually observed)  the uncertainty in the parameters: a probability distribution over w  Advantage: the inclusion of prior knowledge arises naturally  Leads less extreme conclusion by incorporating prior  Non-informative prior Bayes’ theorem

16 16 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.3 Bayesian Probabilities – Expansion of Bayesian Application Limited application of full Bayesian procedure  Even it has its origin from 18 th century  Need to marginalize over the whole of parameter space Markov chain Monte Carlo sampling method  Computationally intensive   Used for small-scale problem Highly efficient deterministic approximation schemes  e.g. variational Bayes, expectation propagation  Alternative to sampling methods  Have allowed Bayesian to be used in large-scale problems

17 17 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.4 Gaussian distribution Gaussian distribution for a single real-valued variable x D-dimensional Multivariate Gaussian Distribution : for d-dimensional vector of x of continuous variables Mean, variance, standard deviation, precision Mean, covariance, determinant

18 18 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.4 Gaussian distribution – Example (1/2) Getting unknown parameters Data points are i.i.d.  Maximizing with respect to  sample mean:  Maximizing with respect to variance  sample variance: Evaluate subsequently

19 19 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.4 Gaussian distribution – Example (2/2) Bias phenomenon  Limitation of the maximum likelihood approach  Related to over-fitting When we consider the expectations,  We have correct mean  But we have underestimated variance True mean Sample mean Size of N

20 20 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.5 Curve Fitting Re-visited (1/2) Goal in the curve fitting problem  Prediction for the target variable t given some new input variable x If we assume Gaussian for t, Determine the unknown w & by maximum likelihood using training data {x, t} Likelihood For data drawn from dist i.i.d., In log form

21 21 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/  maximizing likelihood = minimizing the sum-of-squares error function (with negative log likelihood) Determining the precision with ML Predictive distribution  Predictions for new values of x (using w, ) 1.2.5 Curve Fitting Re-visited (2/2)

22 22 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ Maximum Posterior (MAP) Introduce a prior (Gaussian distribution on w),  : hyper-parameter  Taking the negative logarithm and combining previous terms, the maximum of the posterior is given by minimum of  Maximizing the posterior dist = minimizing the regularized sum-of-squares error function (1.4)  With regularization parameters: Determine w with most probable value of w given the data (maximizing posterior)

23 23 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.2.6 Bayesian Curve Fitting Marginalization

24 24 (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 1.3 Model Selection Proper model complexity → Good generalization & best model Measuring the generalization performance  If data are plentiful, divide into training, validation & test set  Otherwise, cross-validate  Leave-one-out technique  Drawbacks –Expensive computation –Using separate data → multiple complexity parameters  New measures of performance  e.g. Akaike information criterion(AIC), Bayesian information criterion(BIC)


Download ppt "Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, 2006. 2007-03-27 Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I."

Similar presentations


Ads by Google