Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Exploration and Pattern Recognition © R. El-Yaniv

Similar presentations


Presentation on theme: "Data Exploration and Pattern Recognition © R. El-Yaniv"— Presentation transcript:

1 Data Exploration and Pattern Recognition © R. El-Yaniv
The KL-Divergence Let and be distributions over The Kullback-Leibler (KL) divergence between them is The quantity is given in bits (whenever the logarithm is base 2) and measures the dissimilarity between the distributions. Other popular names: cross-entropy, relative entropy, discrimination information. Although the KL-divergence is not a metric (it’s not symmetric and doesn’t obey the triangle inequality) and therefore, not a true distance, it is widely used as a “distance” measure between distributions. כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

2 Reminder: Jensen Inequality
Lemma (Jensen Inequality) : If is a convex function and is a random variable, then If is strictly convex then equality implies that is constant (i.e ). Convex Concave כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

3 Properties of the KL-Divergence
Lemma (Information Inequality): with equality iff Proof. Let , the support set of Since log is strictly concave we have equality iff Jensen Inequality כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

4 Properties of the KL-Divergence
Sensitive to zero-probability events under Not a true distance (not symmetric, violates triangle inequality). So why use it? We could use, for example, the the Euclidean distance which is a true metric and not sensitive to zero-prob events (Partial) Answer: the Euclidean distance doen’t have statistical interpretations. Doesn’t yiels optimal solutions. The KL-Divergence does! We have: כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

5 Binomial Approximation With KL-Div.
Let One method to compute is to use Stirling Approximation כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

6 Binomial Approximation - Cntd.
כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

7 Binomial Approximation - Cntd.
כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

8 Binomial Approximation - Cntd.
Example: What is the probability of getting 30 heads when tossing an unbiased coin 100 times? כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

9 Learning From Observed Data
Hidden Observed Unsupervised Supervised כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

10 Data Exploration and Pattern Recognition © R. El-Yaniv
Density Estimation The Bayesian method is optimal (for classification and decision making) but requires that all relevant distributions (prior, and class-conditional) are known. Unfortunately, this is rarely the case. We only see data, not distributions. Threfore, in order to use Bayesian classification We want to learn these distributions from the data (called training data). Supervised learning: we get to see samples from each of the classes “separately” (called tagged samples). Tagged samples are “expensive”. We need to learn the distributions as efficiently as possible. Two methods: parametric (easier) and nonparametric (harder) כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

11 Data Exploration and Pattern Recognition © R. El-Yaniv
Parameter Estimation Suppose we can assume that the relevant densities are of some parametric form. For example, suppose we are pretty sure that is normal , without knowing and It remains to estimate the parameters and from the data. Examples of parameterized desnsities Binomial: has ’s and ’s Normal: Each data point is distributed according to Here, כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

12 Two Methods for Parameter Estimation
We’ll study two methods for parameter estimation: Bayesian estimation and maximum likelihood. Both methods are conceptually different. Maximum likelihood: unknown parameters are fixed; pick the parameters that best “explain” the data Bayesian estimation: unknown parameters are random variables sampled from some prior; we use the observed data to revise the prior (obtaining “sharper” posterior) and choose the best parameters using the standard (and optimal) Bayesian method. But assymptotically they yield the same results כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

13 Data Exploration and Pattern Recognition © R. El-Yaniv
Isolating the Problem We get to see a traning set of data points Each point in belongs to some class of different classes Suppose the subset of i.i.d. points is in some class so that each is drown according to to the class-conditional We assume that has some known parametric form given by for some parameter vector Thus, we have sepaerate parameter estimation problems כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

14 Maximum Likelihood Estimation
Recall that the likelihood of with respect to is The maximum likelihood parameter vector is the one that best supports the data; that is, Analytically, it is often easier to consider the log of the likelihood function (since the log is monotone, the maximum log-likelihood is same as maximum likelihood). Example: assume that all the points in are drawn from some (one-dimensional) normal distribution with some particular variance (and unknown mean). כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

15 Maximum Likelihood - Illustration
כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

16 Maximum Likelihood - Cntd.
If is “well-behaved” and, in particular, differentiable, we can find using standard differntial calculus Suppose Then satisfies and we must verify that this is a global maximum כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

17 Example - Maximum Likelihood
Suppose we know that each data point is distributed according to a normal distribution with known standard deviation 1 but with unknown mean Differentiating, we have So כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

18 Example: Normal, Unknown and
Suppose each data point is distributed We have so כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

19 Data Exploration and Pattern Recognition © R. El-Yaniv
Biased Estimators In the last example the ML estimator of was This estimate is biased ; that is, Claim: This estimator is asymptotically unbiased (approaches an unbiased estimate; see below) To see the bias it is sufficient to consider one data point (that is, ) Unbiased estimates for and are כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

20 ML Estimators for Multivariate Normal PDF
A similar (but much more involved) calculations yields the following ML estimator for the multivariate normal density with unknown mean vector and unknown covariance matrix The estimator for the mean is unbiased and the estimator for the covariance matrix is biased. An unbiased estimator is כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

21 Baysian Parameter Estimation
Here again, the form of the source density is assumed to be known but the parameter is unknown We assume that is random variable Our initial knowledge (guess) about , before observing the data , is given by the prior We use the sample data (drawn independently according to ) to compute the posterior Since is drawn i.i.d. Recall that is a normalizing factor כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

22 Data Exploration and Pattern Recognition © R. El-Yaniv
Bayesian Estimation The prior is typically “broad” or “flat”, reflecting the fact that we don’t know a lot about the parameters values The data we see is more consistent with some values of the parameters and therefore we expect the posterior to pick sharply around more likely values כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

23 Data Exploration and Pattern Recognition © R. El-Yaniv
Bayesian Estimation Recall that our goal (in the isolated problem) is to estimate the class-conditional density of the th class, given the (labeled data of that class ) Using the posterior we compute the class-conditional the weighted average over all possible values of כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

24 Bayesian Estimation - Example
Suppose we know that class-conditional p.d.f. is normal with unknown mean Also, suppose with both known We imagine that Nature draws a value for using and then i.i.d. chooses the data using We now calculate the posterior כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

25 Bayesian Estimation - Example cntd.
The answer (exercise): is normal Letting Hint: to save algebraic manipulations, note that any p.d.f. of the form is normal Notice that is a convex combination of and Always and after observing samples, is our “best guess” for כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

26 Bayesian Estimation - Example cntd.
After determining the posterior it remains to calculate the class-conditional where כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

27 Another Example: Prob. of Sun Rising
Question: What is the probability that the sun will rise tomorrow? Laplace’s (Bayesian) answer: Assume that each day the sun rises with probability (Bernoulli process) and that is distributed uniformaly in Suppose there were sun rises so far. What is the probability of an st rise? Denote the data set by where כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

28 Prob. of Sun Rising - Cntd.
We have Therefore, This is called: Laplace’s law of succession Notice that ML gives כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv

29 Maximum Likelihood vs. Bayesian
ML and Bayesian estimations are asymptotically equivalent. they yield the same class-conditional densities when the size of the training data grows to infinity. ML is typically computationally easier: E.g. consider the case where the p.d.f. is “nice” (i.e. differentiable) . In ML we need to do (multidimensional) differentiation and in Bayesian, (multidimensional) integration. ML is often easier to interpret: it returns the single best model (parameter) whereas Bayesian gives a weighted average of models. But for a finite training data (and given a reliable prior) Bayesian is more accurate (uses more of the information). Bayesian with “flat” prior is essntially ML. With asymmetric and broad priors the methods lead to different solutions. כ"ה/סיון/תשע"ט Data Exploration and Pattern Recognition © R. El-Yaniv


Download ppt "Data Exploration and Pattern Recognition © R. El-Yaniv"

Similar presentations


Ads by Google