Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

Similar presentations


Presentation on theme: "1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,"— Presentation transcript:

1 1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University

2 2 Typical Classification Problem Rarely know the complete probabilistic structure of the problem Have vague, general knowledge Have a number of design samples or training data as representatives of patterns for classification Find some way to use this information to design or train the classifier

3 3 Estimating Probabilities Not difficulty to Estimate prior probabilities Hard to estimate class-conditional densities –Number of available samples always seems too small –Serious when dimensionality is large

4 4 Estimating Parameters Problems permit to parameterize the conditional densities Simplifies the problem from one of estimating an unknown function to one of estimating the parameters –e.g., mean vector and covariance matrix for multi-variate normal distribution

5 5 Maximum-Likelihood Estimation View the parameters as quantities whose values are fixed but unknown Best estimate is the one that maximize the probability of obtaining the samples actually observed Nearly always have good convergence properties as the number of samples increases Often simpler than alternative methods

6 6 I. I. D. Random Variables Separate data into D 1,..., D c Samples in D j are drawn independently according to p(x|  j ) Such samples are independent and identically distributed (i. i. d.) random variables Let p(x|  j ) has a known parametric form and is determined uniquely by a parameter vector  j,, i.e., p(x|  j )=p(x|  j,  j )

7 7 Simplification Assumptions Samples in D i give no information about  j, if i is not equal to j Can work with each class separately Have c separate problems of the same form –Use set D of i. i. d. samples from p(x|  ) to estimate unknown parameter vector 

8 8 Maximum-likelihood Estimate

9 9 Maximum-likelihood Estimation

10 10 A Note The likelihood p(D|  ) as a function of  is not a probability density function of  Its area on the -domain has no significance The likelihood p(D|  ) can be regarded as probability of D for a given 

11 11 Analytical Approach

12 12 MAP Estimators

13 13 Gaussian Case: Unknown 

14 14 Univariate Gaussian Case: Unknown  and  2

15 15 Multivariate Gaussian Case: Unknown  and 

16 16 Bias, Absolutely Unbiased, and Asymptotically Unbiased

17 17 Model Error For reliable model, the ML classifier can give excellent results If the model is wrong, the ML classifier can not get the best results, even for the assumed set of models

18 18 Bayesian Estimation (Bayesian Learning) Answers obtained in general is nearly identical to those by maximum- likelihood Basic conceptual difference –The parameter vector  is a random variable –Use the training data to convert a distribution on this variable into a posterior probability density

19 19 Central Problem

20 20 Parameter Distribution Assume p(x) has a known parametric form with parameter vector  of unknown value Thus, p(x|  ) is completely known Information about  prior to observing samples is contained in known prior density p(  ) Observations convert p(  ) to p(  |D) –should be sharply peaked about the true value of 

21 21 Parameter Distribution

22 22 Univariate Gaussian Case: p(  |D)

23 23 Reproducing Density

24 24 Bayesian Learning

25 25 Dogmatism

26 26 Univariate Gaussian Case: p(x|D)

27 27 Multivariate Gaussian Case

28 28 Multivariate Gaussian Case

29 29 Multivariate Bayesian Learning

30 30 General Bayesian Estimation

31 31 Recursive Bayesian Learning

32 32 Example 1: Recursive Bayes Learning

33 33 Example 1: Recursive Bayes Learning

34 34 Example 1: Bayes vs. ML

35 35 Identifiability p(x|  ) is identifiable –Sequence of posterior densities p(  |D n ) converge to a delta function –Only one  causes p(x|  ) to fit the data In some occasions, more than one  values may yield the same p(x|  ) – p(  |D n ) will peak near all  that explain the data –Ambiguity is erased in integration for p(x|D n ), which converges to p(x) whether or not p(x|  ) is identifiable

36 36 ML vs. Bayes Methods Computational complexity Interpretability Confidence in prior information –Form of the underlying distribution p(x|  ) Results differs when p(  |D) is broad, or asymmetric around the estimated  –Bayes methods would exploit such information whereas ML would not

37 37 Classification Errors Bayes or indistinguishability error Model error Estimation error –Parameters are estimated from a finite sample –Vanishes in the limit of infinite training data (ML and Bayes would have the same total classification error)

38 38 Invariance and Non-informative Priors Guidance in creating priors Invariance –Translation invariance –Scale invariance Non-informative with respect to an invariance –Much better than accommodating arbitrary transformation in a MAP estimator –Of great use in Bayesian estimation

39 39 Gibbs Algorithm

40 40 Sufficient Statistics Statistic –Any function of samples Sufficient statistic s of samples D – s Contains all information relevant to estimating some parameter  –Definition: p(D|s,  ) is independent of  –If  can be regarded as a random variable

41 41 Factorization Theorem A statistic s is sufficient for  if and only if P(D|  ) can be written as the product P(D|  ) = g(s,  ) h(D) for some functions g(.,.) and h(.)

42 42 Example: Multivariate Gaussian

43 43 Proof of Factorization Theorem: The “Only if” Part

44 44 Proof of Factorization Theorem: The “if” Part

45 45 Kernel Density Factoring of P(D|  ) into g(s,  )h(D) is not unique –If f(s) is any function, g’(s,  )=f(s)g(s,  ) and h’(D) = h(D)/f(s) are equivalent factors Ambiguity is removed by defining the kernel density invariant to such scaling

46 46 Example: Multivariate Gaussian

47 47 Kernel Density and Parameter Estimation Maximum-likelihood –maximization of g(s,  ) Bayesian –If prior knowledge of  is vague, p(  ) tend to be uniform, and p(  |D) is approximately the same as the kernel density –If p(x|  ) is identifiable, g(s,  ) peaks sharply at some value, and p(  ) is continuous as well as non-zero there, p(  |D) approaches the kernel density

48 48 Sufficient Statistics for Exponential Family

49 49 Error Rate and Dimensionality

50 50 Accuracy and Dimensionality

51 51 Effects of Additional Features In practice, beyond a certain point, inclusion of additional features leads to worse rather than better performance Sources of difficulty –Wrong models –Number of design or training samples is finite and thus the distributions are not estimated accurately

52 52 Computational Complexity for Maximum-Likelihood Estimation

53 53 Computational Complexity for Classification

54 54 Approaches for Inadequate Samples Reduce dimensionality –Redesign feature extractor –Select appropriate subset of features –Combine the existing features –Pool the available data by assuming all classes share the same covariance matrix Look for a better estimate for  –Use Bayesian estimate and diagonal  0 –Threshold sample covariance matrix –Assume statistical independence

55 55 Shrinkage (Regularized Discriminant Analysis)

56 56 Concept of Overfitting

57 57 Best Representative Point

58 58 Projection Along a Line

59 59 Best Projection to a Line Through the Sample Mean

60 60 Best Representative Direction

61 61 Principal Component Analysis (PCA)

62 62 Concept of Fisher Linear Discriminant

63 63 Fisher Linear Discriminant Analysis

64 64 Fisher Linear Discriminant Analysis

65 65 Fisher Linear Discriminant Analysis

66 66 Fisher Linear Discriminant Analysis for Multivariate Normal

67 67 Concept of Multidimensional Discriminant Analysis

68 68 Multiple Discriminant Analysis

69 69 Multiple Discriminant Analysis

70 70 Multiple Discriminant Analysis

71 71 Multiple Discriminant Analysis

72 72 Expectation-Maximization (EM) Finding the maximum-likelihood estimate of the parameters of an underlying distribution –from a given data set when the data is incomplete or has missing values Two main applications –When the data indeed has missing values –When optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters

73 73 Expectation-Maximization (EM) Full sample D = {x 1,..., x n } x k = { x kg, x kb } Separate individual features into D g and D b – D is the union of D g and D b Form the function

74 74 Expectation-Maximization (EM) begin initialize  0, T, i  0 do i  i + 1 do i  i + 1 E step: Compute Q(  ;  i ) E step: Compute Q(  ;  i ) M step:  i+1  arg max  Q( ,  i ) M step:  i+1  arg max  Q( ,  i ) until Q(  i+1 ;  i )-Q(  i ;  i-1 ) T until Q(  i+1 ;  i )-Q(  i ;  i-1 ) T return    i+1 return    i+1end

75 75 Expectation-Maximization (EM)

76 76 Example: 2D Model

77 77 Example: 2D Model

78 78 Example: 2D Model

79 79 Example: 2D Model

80 80 Generalized Expectation- Maximization (GEM) Instead of maximizing Q(  ;  i ), we find some  i+1 such that Q(  i+1 ;  i )>Q(  ;  i ) and is also guaranteed to converge and is also guaranteed to converge Convergence will not as rapid Offers great freedom to choose computationally simpler steps –e.g., using maximum-likelihood value of unknown values, if they lead to a greater likelihood

81 81 Hidden Markov Model (HMM) Used for problems of making a series of decisions –e.g., speech or gesture recognition Problem states at time t are influenced directly by a state at t-1 More reference: –L. A. Rabiner and B. W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993, Chapter 6.

82 82 First Order Markov Models

83 83 First Order Hidden Markov Models

84 84 Hidden Markov Model Probabilities

85 85 Hidden Markov Model Computation Evaluation problem –Given a ij and b jk, determine P(V T |  ) Decoding problem –Given V T, determine the most likely sequence of hidden states that lead to V T Learning problem –Given training observations of visible symbols and the coarse structure but not the probabilities, determine a ij and b jk

86 86 Evaluation

87 87 HMM Forward

88 88 HMM Forward and Trellis

89 89 HMM Forward

90 90 HMM Backward

91 91 HMM Backward

92 92 Example 3: Hidden Markov Model

93 93 Example 3: Hidden Markov Model

94 94 Example 3: Hidden Markov Model

95 95 Left-to-Right Models for Speech

96 96 HMM Decoding

97 97 Problem of Local Optimization This decoding algorithm depends only on the single previous time step, not the full sequence Not guarantee that the path is indeed allowable

98 98 HMM Decoding

99 99 Example 4: HMM Decoding

100 100 Forward-Backward Algorithm Determines model parameters, a ij and b jk, from an ensemble of training samples An instance of a generalized expectation-maximization algorithm No known method for the optimal or most likely set of parameters from data

101 101 Probability of Transition

102 102 Improved Estimate for a ij

103 103 Improved Estimate for b jk

104 104 Forward-Backward Algorithm (Baum-Welch Algorithm)


Download ppt "1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,"

Similar presentations


Ads by Google