Presentation is loading. Please wait.

Presentation is loading. Please wait.

ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for.

Similar presentations


Presentation on theme: "ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for."— Presentation transcript:

1 ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

2

3 Parametric Estimation X = { x t } t=1 N where x t ~ p(x) Here x is one dimensional and the densities are univariate. Parametric estimation: Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using X e.g., N ( μ, σ 2 ) where θ = { μ, σ 2 } 3Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

4 Maximum Likelihood Estimation Likelihood of θ given the sample X l (θ| X ) p( X |θ) = ∏ t=1 N p (x t |θ) Log likelihood L (θ| X ) log l (θ| X ) = ∑ t=1 N log p (x t |θ) Maximum likelihood estimator (MLE) θ * = arg max θ L (θ| X ) 4

5 Examples: Bernoulli Density Two states, failure/success, x in {0,1} P (x) = p x (1 – p ) (1 – x) l (θ| X )= ∏ t=1 N p (x t |θ) L (p| X ) = log l (θ| X ) = log ∏ t p x t (1 – p ) (1 – x t ) = (∑ t x t ) log p+ (N – ∑ t x t ) log (1 – p ) MLE: p = (∑ t x t ) / N MLE: Maximum Likelihood Estimation 5Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

6 Examples: Multinomial Density K > 2 states, x i in {0,1} P (x 1,x 2,...,x K ) = ∏ i p i x i L (p 1,p 2,...,p K | X ) = log ∏ t ∏ i p i x i t where x i t = 1 if experiment t chooses state i x i t = 0 otherwise MLE: p i = (∑ t x i t ) / N PS. K: mutually exclusive 6Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

7 Gaussian (Normal) Distribution p(x) = N ( μ, σ 2 ) Given a sample X = { x t } t=1 N with x t ~ N ( μ, σ 2 ), the log likelihood of a Gaussian sample is μ σ L (Why? N samples?) 7

8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Gaussian (Normal) Distribution MLE for μ and σ 2 : ( Exercise!) μ σ 8

9 θ Bias and Variance Unknown parameter θ Estimator d i = d (X i ) on sample X i Bias: b θ (d) = E [d] – θ Variance: E [(d–E [d]) 2 ] If b θ (d) = 0 d is an unbiased estimator of θ If E [(d–E [d]) 2 ] = 0 d is a consistent estimator of θ 9Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

10 Expected value If the probability distribution of X admits a probability density function f (x), then the expected value can be computed as It follows directly from the discrete case definition that if X is a constant random variable, i.e. X = b for some fixed real number b, then the expected value of X is also b. The expected value of an arbitrary function of X, g(X), with respect to the probability density function f(x) is given by the inner product of f and g: http://en.wikipedia.org/wiki/Expected_value 10

11 Bias and Variance For example: Var [m]  0 as N  ∞ m is also a consistent estimator 11Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

12 Bias and Variance For example: (see P. 65-66) s 2 is a biased estimator of σ 2 (N/(N-1))s 2 is a unbiased estimator of σ 2 Mean square error: r (d, θ ) = E [(d– θ ) 2 ] (see P. 66) = (E [d] – θ ) 2 + E [(d–E [d]) 2 ] = Bias 2 + Variance 12Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) θ

13 Standard Deviation In statistics, the standard deviation is often estimated from a random sample drawn from the population. The most common measure used is the sample standard deviation, which is defined by where is the sample (formally, realizations from a random variable X) and is the sample mean. http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation 13

14 Bayes’ Estimator Treat θ as a random variable with prior p(θ) Bayes’ rule: Maximum a Posteriori (MAP): θ MAP = arg max θ p(θ| X ) Maximum Likelihood (ML): θ ML = arg max θ p( X |θ) Bayes’ estimator: θ Bayes = E[θ| X ] = ∫ θ p(θ| X ) dθ 14Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

15 MAP VS ML If p(θ) is an uniform distribution then θ MAP = arg max θ p(θ| X ) = arg max θ p( X |θ) p(θ) / p( X ) = arg max θ p( X |θ) = θ ML θ MAP = θ ML where p(θ) / p( X ) is a constant 15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

16 Bayes’ Estimator: Example If p (θ| X ) is normal, then θ ML = m and θ Bayes =θ MAP Example: Suppose x t ~ N (θ, σ 2 ) and θ ~ N ( μ 0, σ 0 2 ) θ Bayes = The Bayes’ estimator is a weighted average of the prior mean μ 0 and the sample mean m. θ ML = m 16

17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Parametric Classification The discriminant function Assume that are Gaussian 17 log likelihood of a Gaussian sample

18 Given the sample Maximum Likelihood (ML) estimates are Discriminant becomes 18Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

19 The first term is a constant and if the priors are equal, those terms can be dropped. Assume that variances are equal, then becomes Choose C i if 19Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

20 Equal variances Single boundary at halfway between means Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are equal and the posteriors intersect at one point, which is the threshold of decision. 20

21 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Variances are different Two boundaries Likelihood functions and the posteriors with equal priors for two classes when the input is one-dimensional. Variances are unequal and the posteriors intersect at two points. 21

22 Exercise For a two-class problem, generate normal samples for two classes with different variances, then use parametric classification to estimate the discriminant points. Compare these with the theoretical values. PS. You can use normal sample generation tools. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)22

23 Regression Regression assumes 0 mean Gaussian noise added to the model; Here, the model is linear. 23Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) : the probability of the output given the input Given an sample X = { x t, r t } t=1 N, the log likelihood is

24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Regression: From LogL to Error Ignore the second term: (because it does not depend on our estimator) Maximizing this is equivalent to minimizing Least squares estimate 24

25 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Example: Linear Regression (Exercise!!) 25

26 Regression Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26 w 1 =? w 0 =? See page 36 Linear, second-order and sixth-order polynomials are fitted to the same set of points.

27 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Example: Polynomial Regression 27

28 Other Error Measures Square Error: Relative Square Error: Absolute Error: E (θ| X ) = ∑ t |r t – g(x t |θ)| ε-sensitive Error: E (θ| X ) = ∑ t 1(|r t – g(x t |θ)|>ε) (|r t – g(x t |θ)| – ε) 28Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

29 Bias and Variance bias 2 variance (See Eq. 4.17) The expected value (average over samples X, all of size N and drawn from the same joint density p(x, r)) : The expected square error at x : (See Eq. 4.11) (See page 66 and 76) 29Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) noisesquared error squared error = bais 2 +variance

30 Estimating Bias and Variance Samples X i ={x t i, r t i }, i =1,...,M, t = 1,2,…,N are used to fit g i (x), i =1,...,M 30Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) θ

31 Bias/Variance Dilemma Examples: g i (x) = 2 has no variance and high bias g i (x) = ∑ t r t i /N has lower bias with variance As we increase model complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data) Bias/Variance dilemma (Geman et al., 1992) 31Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

32 bias f gigi g f variance 32

33 Polynomial Regression Best fit “min error” overfittingunderfitting 33Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

34 Best fit, “elbow”( 肘部 ) Fig. 4.7 34Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

35 Model Selection (1) Cross-validation: Measure generalization accuracy by testing on data unused during training To find the optimal complexity Regularization ( 調整 ): Penalize complex models E’ = error on data + λ model complexity Structural risk minimization (SRM): To find the model simplest in terms of order and best in terms of empirical error on the data Model complexity measure: polynomials of increasing order, VC dimension,... Minimum description length (MDL): Kolmogorov complexity of a data set is defined as the shortest description of data 35Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

36 Model Selection (2) Bayesian Model Selection: Prior on models, p(model) Discussions: When the prior is chosen such that we give higher probabilities to simpler models, the Bayesian approach, regularization, SRM, and MDL are equivalent. Cross-validation is the best approach if there is a large enough validation dataset. 36Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

37 Regression example Coefficients increase in magnitude as order increases: 1: [-0.0769, 0.0016] 2: [0.1682, -0.6657, 0.0080] 3: [0.4238, -2.5778, 3.4675, -0.0002 4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019] Please compare with Fig. 4.5, p. 78. 37Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)


Download ppt "ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for."

Similar presentations


Ads by Google