1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University

2 Typical Classification Problem Rarely know the complete probabilistic structure of the problem Have vague, general knowledge Have a number of design samples or training data as representatives of patterns for classification Find some way to use this information to design or train the classifier

3 Estimating Probabilities Not difficulty to Estimate prior probabilities Hard to estimate class-conditional densities –Number of available samples always seems too small –Serious when dimensionality is large

4 Estimating Parameters Problems permit to parameterize the conditional densities Simplifies the problem from one of estimating an unknown function to one of estimating the parameters –e.g., mean vector and covariance matrix for multi-variate normal distribution

5 Maximum-Likelihood Estimation View the parameters as quantities whose values are fixed but unknown Best estimate is the one that maximize the probability of obtaining the samples actually observed Nearly always have good convergence properties as the number of samples increases Often simpler than alternative methods

6 I. I. D. Random Variables Separate data into D 1,..., D c Samples in D j are drawn independently according to p(x|  j ) Such samples are independent and identically distributed (i. i. d.) random variables Let p(x|  j ) has a known parametric form and is determined uniquely by a parameter vector  j,, i.e., p(x|  j )=p(x|  j,  j )

7 Simplification Assumptions Samples in D i give no information about  j, if i is not equal to j Can work with each class separately Have c separate problems of the same form –Use set D of i. i. d. samples from p(x|  ) to estimate unknown parameter vector 

8 Maximum-likelihood Estimate

9 Maximum-likelihood Estimation

10 A Note The likelihood p(D|  ) as a function of  is not a probability density function of  Its area on the -domain has no significance The likelihood p(D|  ) can be regarded as probability of D for a given 

11 Analytical Approach

12 MAP Estimators

13 Gaussian Case: Unknown 

14 Univariate Gaussian Case: Unknown  and  2

15 Multivariate Gaussian Case: Unknown  and 

16 Bias, Absolutely Unbiased, and Asymptotically Unbiased

17 Model Error For reliable model, the ML classifier can give excellent results If the model is wrong, the ML classifier can not get the best results, even for the assumed set of models

18 Bayesian Estimation (Bayesian Learning) Answers obtained in general is nearly identical to those by maximum- likelihood Basic conceptual difference –The parameter vector  is a random variable –Use the training data to convert a distribution on this variable into a posterior probability density

19 Central Problem

20 Parameter Distribution Assume p(x) has a known parametric form with parameter vector  of unknown value Thus, p(x|  ) is completely known Information about  prior to observing samples is contained in known prior density p(  ) Observations convert p(  ) to p(  |D) –should be sharply peaked about the true value of 

21 Parameter Distribution

22 Univariate Gaussian Case: p(  |D)

23 Reproducing Density

24 Bayesian Learning

25 Dogmatism

26 Univariate Gaussian Case: p(x|D)

27 Multivariate Gaussian Case

28 Multivariate Gaussian Case

29 Multivariate Bayesian Learning

30 General Bayesian Estimation

31 Recursive Bayesian Learning

32 Example 1: Recursive Bayes Learning

33 Example 1: Recursive Bayes Learning

34 Example 1: Bayes vs. ML

35 Identifiability p(x|  ) is identifiable –Sequence of posterior densities p(  |D n ) converge to a delta function –Only one  causes p(x|  ) to fit the data In some occasions, more than one  values may yield the same p(x|  ) – p(  |D n ) will peak near all  that explain the data –Ambiguity is erased in integration for p(x|D n ), which converges to p(x) whether or not p(x|  ) is identifiable

36 ML vs. Bayes Methods Computational complexity Interpretability Confidence in prior information –Form of the underlying distribution p(x|  ) Results differs when p(  |D) is broad, or asymmetric around the estimated  –Bayes methods would exploit such information whereas ML would not

37 Classification Errors Bayes or indistinguishability error Model error Estimation error –Parameters are estimated from a finite sample –Vanishes in the limit of infinite training data (ML and Bayes would have the same total classification error)

38 Invariance and Non-informative Priors Guidance in creating priors Invariance –Translation invariance –Scale invariance Non-informative with respect to an invariance –Much better than accommodating arbitrary transformation in a MAP estimator –Of great use in Bayesian estimation

39 Gibbs Algorithm

40 Sufficient Statistics Statistic –Any function of samples Sufficient statistic s of samples D – s Contains all information relevant to estimating some parameter  –Definition: p(D|s,  ) is independent of  –If  can be regarded as a random variable

41 Factorization Theorem A statistic s is sufficient for  if and only if P(D|  ) can be written as the product P(D|  ) = g(s,  ) h(D) for some functions g(.,.) and h(.)

42 Example: Multivariate Gaussian

43 Proof of Factorization Theorem: The “Only if” Part

44 Proof of Factorization Theorem: The “if” Part

45 Kernel Density Factoring of P(D|  ) into g(s,  )h(D) is not unique –If f(s) is any function, g’(s,  )=f(s)g(s,  ) and h’(D) = h(D)/f(s) are equivalent factors Ambiguity is removed by defining the kernel density invariant to such scaling

46 Example: Multivariate Gaussian

47 Kernel Density and Parameter Estimation Maximum-likelihood –maximization of g(s,  ) Bayesian –If prior knowledge of  is vague, p(  ) tend to be uniform, and p(  |D) is approximately the same as the kernel density –If p(x|  ) is identifiable, g(s,  ) peaks sharply at some value, and p(  ) is continuous as well as non-zero there, p(  |D) approaches the kernel density

48 Sufficient Statistics for Exponential Family

49 Error Rate and Dimensionality

50 Accuracy and Dimensionality

51 Effects of Additional Features In practice, beyond a certain point, inclusion of additional features leads to worse rather than better performance Sources of difficulty –Wrong models –Number of design or training samples is finite and thus the distributions are not estimated accurately

52 Computational Complexity for Maximum-Likelihood Estimation

53 Computational Complexity for Classification

54 Approaches for Inadequate Samples Reduce dimensionality –Redesign feature extractor –Select appropriate subset of features –Combine the existing features –Pool the available data by assuming all classes share the same covariance matrix Look for a better estimate for  –Use Bayesian estimate and diagonal  0 –Threshold sample covariance matrix –Assume statistical independence

55 Shrinkage (Regularized Discriminant Analysis)

56 Concept of Overfitting

57 Best Representative Point

58 Projection Along a Line

59 Best Projection to a Line Through the Sample Mean

60 Best Representative Direction

61 Principal Component Analysis (PCA)

62 Concept of Fisher Linear Discriminant

63 Fisher Linear Discriminant Analysis

66 Fisher Linear Discriminant Analysis for Multivariate Normal

67 Concept of Multidimensional Discriminant Analysis

68 Multiple Discriminant Analysis

72 Expectation-Maximization (EM) Finding the maximum-likelihood estimate of the parameters of an underlying distribution –from a given data set when the data is incomplete or has missing values Two main applications –When the data indeed has missing values –When optimizing the likelihood function is analytically intractable but when the likelihood function can be simplified by assuming the existence of and values for additional but missing (or hidden) parameters

73 Expectation-Maximization (EM) Full sample D = {x 1,..., x n } x k = { x kg, x kb } Separate individual features into D g and D b – D is the union of D g and D b Form the function

74 Expectation-Maximization (EM) begin initialize  0, T, i  0 do i  i + 1 do i  i + 1 E step: Compute Q(  ;  i ) E step: Compute Q(  ;  i ) M step:  i+1  arg max  Q( ,  i ) M step:  i+1  arg max  Q( ,  i ) until Q(  i+1 ;  i )-Q(  i ;  i-1 ) T until Q(  i+1 ;  i )-Q(  i ;  i-1 ) T return    i+1 return    i+1end

75 Expectation-Maximization (EM)

76 Example: 2D Model

80 Generalized Expectation- Maximization (GEM) Instead of maximizing Q(  ;  i ), we find some  i+1 such that Q(  i+1 ;  i )>Q(  ;  i ) and is also guaranteed to converge and is also guaranteed to converge Convergence will not as rapid Offers great freedom to choose computationally simpler steps –e.g., using maximum-likelihood value of unknown values, if they lead to a greater likelihood

81 Hidden Markov Model (HMM) Used for problems of making a series of decisions –e.g., speech or gesture recognition Problem states at time t are influenced directly by a state at t-1 More reference: –L. A. Rabiner and B. W. Juang, Fundamentals of Speech Recognition, Prentice-Hall, 1993, Chapter 6.

82 First Order Markov Models

83 First Order Hidden Markov Models

84 Hidden Markov Model Probabilities

85 Hidden Markov Model Computation Evaluation problem –Given a ij and b jk, determine P(V T |  ) Decoding problem –Given V T, determine the most likely sequence of hidden states that lead to V T Learning problem –Given training observations of visible symbols and the coarse structure but not the probabilities, determine a ij and b jk

86 Evaluation

87 HMM Forward

88 HMM Forward and Trellis

89 HMM Forward

90 HMM Backward

91 HMM Backward

92 Example 3: Hidden Markov Model

95 Left-to-Right Models for Speech

96 HMM Decoding

97 Problem of Local Optimization This decoding algorithm depends only on the single previous time step, not the full sequence Not guarantee that the path is indeed allowable

98 HMM Decoding

99 Example 4: HMM Decoding

100 Forward-Backward Algorithm Determines model parameters, a ij and b jk, from an ensemble of training samples An instance of a generalized expectation-maximization algorithm No known method for the optimal or most likely set of parameters from data

101 Probability of Transition

102 Improved Estimate for a ij

103 Improved Estimate for b jk

104 Forward-Backward Algorithm (Baum-Welch Algorithm)

1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

Similar presentations

Presentation on theme: "1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,

Similar presentations

Presentation on theme: "1 Parameter Estimation Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia,"— Presentation transcript:

Similar presentations

About project

Feedback