Nonparametric methods Parzen window and nearest neighbor

Nonparametric methods Parzen window and nearest neighbor

Introduction In previous lectures we assumed that
The exact distribution of the data is known Or at least the shape of the distribution is known But in real-life tasks The exact distribution is never known Assuming that its shape takes a special form (e.g. Gaussian) is also mostly irrealistic – this results in a mismatch between the real and the assumed distribution, which we shortly called as the “modelling error” Today: methods that do not use a parametric curve (such as a Gaussian), to describe the shape of the distribution This is why they are called “non-parametric methods”

Nonparametric methods
They are called this way because they do not use any parametric curve to describe the shape of the distribution However, the name is misleading, as they still have some meta-parameters! They directly use the instances of the training data set to create a probability estimate This is why this approach is also known as instance-based learning There is no modelling assumption about the shape of distribution! In the training phase the training samples are simply stored This is why it is also called “lazy learning” Processing happens in the testing phase – this is their main drawback First type of methods: approximate p(x | ci) Second type of methods: directly approximate P(ci | x)

Basic idea In a given point we will estimate p(x) from the density of the data points falling into a small region R around x More samples  larger probability

Example How can we approximate p(x) for any x∈R1 or x∈R2 ?
To estimate P[x∈R1] and P[x∈R2], we divide the number k of the samples that fall in the given region R with the total number n of all samples P[x∈R1]=6/20 P[x∈R2]=6/20 Should our estimates for p(x) be equal? No, because R2 is wider than R1 So the estimate will be inversely proportional to the region size V Altogether, our estimate will be p(x)≈k/nV

Formal derivation The probability ρ that a point x falls inside region R is Let’s suppose that we draw n samples x1,..,xn from p(x). Then the probability that exactly k points fall in R is given by the binomial distribution: From the value of k we can create an estimate for ρ as follows: The expected value of this distribution regarding k is E[k]=nρ . Moreover, it is a very spiky distribution around k, especially when n is large. based on this, we can use the estimate k ≈ nρ  ρ ≈ k/n More formally. it can also be shown that k/n is the ML estimate of ρ

Formal derivation 2 Next, let’s assume that p(x) is continuous and that region R is so small that p(x) is approximately constant in R: Where x is in R and V is the volume of R (see the red line) Also, remember from the previous slide that and Putting this all together results in the estimate

Summary of derivation We obtained the same results that we obtained intuitively earlier: Our estimate is the average of the true density over R And we estimate p(x) with the same value within R, so our estimate assumes that p(x) constant in R

Interpretation as a histogram
Our probability estimate is very similar to a histogram: If the regions do not overlap and cover the whole range of the values then our estimate is basically a (normalized) histogram

The accuracy of the estimation
How accurate is our approximation ? It was built on two assumptions: it becomes more and more accurate as n∞ it becomes more and more accurate as V0 So, as we have more and more samples we can shrink V to be smaller and smaller, but we must be careful that it still contains samples, otherwise our estimate will be p(x)=0 In theory, if we can get an infinite number of samples, then our estimate will converge to the real distribution if we simultaneously increase the number of samples n and decrease the volume V, but carefully so that k is also big enough We discuss this more precisely on the next slide If the regions do not overlap and cover the whole range of the values then our estimate is basically a (normalized) histogram

The accuracy of the estimation 2
Let’s create a series of estimates where n is the number of samples, Vn is the volume of the region used in the nth estimate, and kn is te number of samples that fall in this volume It can be shown that If the following 3 conditions are satisfied: If the regions do not overlap and cover the whole range of the values then our estimate is basically a (normalized) histogram

The accuracy of the estimation 3
Two examples of how to satisfy these conditions: In the first case we define Vn as some function of n, for example and calculate kn from the data This will be the basis of te Parzen window approach A second possibility is to specify kn as a function of n, for example and let depend Vn on the actual data This will give the basis of the nearest neighbor approach For both of the above choices of Vn or kn the estimate will converge to p(x) as n goes to infinity, although it’s hard to tell how they will behave in a finite-sample case

Demonstration

Density estimation in practice
For a real-life task the number of samples n is always finite So we cannot increase n, only decrease V But we must be careful, because if V becomes too small then p(x) will become 0 for many x values, as the region will contain no samples Therefore, in practice we must find a compromise for V It should not be too small so that it contains enough samples (k should not become 0) V should not be too large either, because our estimate will be inaccurate (it will be too smooth, e.g. it will be constant within R) We will find the optimal value for V or k based on the actual data

Two practical approaches
Parzen windows: we fix the volume V and calculate k from the data Nearest neighbors: we fix the value of k and calculate V from the data Let’s start with the Parzen window approach: we fix the size and shape of region R We chose it to be a d-dimensional hypercube with side length h, so its volume is hd

Parzen windows So, n is given, V is fixed, so all we have to do is to center the region R at x, count the number of points k which fall into it, then we can calculate Example: We want to formalize this, so we define a window function

Parzen windows 2 Let’s center the window function on the x point where we want to estimate p(x), and let’s see how it behaves on our training samples x1,…,xn So,

Parzen window estimate
We can simply count the number of samples that fall within the hypercube with side h and centered at x as Recall that So our estimate will be Let’s check if this is a correct distribution function:

Parzen window – another intepretation
So far, we fixed x and varied i to see which of the xi samples fall within the hypercube centered on x, so that Let’s turn it around and analyze how a given xi contributes to the estimate of p(x) We see that is simply a function that gives 1 for all x values that are close enough to xf , and 0 otherwise

Parzen window as a sum of functions
Now, if we look at our estimate again We see that we can easily calculate it by fitting hypercubes on all training instances x1,…,xn So p(x) is just a sum of n box-like functions with height Let’s see an example!

Parzen window - example
We have seven samples D={2,3,4,8,10,11,12} n = 7, h = 3, d = 1 To obtain our estimate we simply have to sum 7 boxes positioned on the seven points The height of the boxes is

Drawbacks of the hypercube Parzen window
As long as xi is within the hypercube around x, its contribution to p(x) will be the same, independent of its distance from x The same is true for the samples outside the hypercube – they give a contribution of 0, no matter how far or close they are to x The estimate of p(x) is not smooth

Generalized window functions
The estimate of p(x) is Instead of the hypercube, let’s try to use other window functions that gradually decrease as we move away from their center Functions with this shape are frequently called „ kernel function” Will we still get a valid density estimate? is satisfied if So if ϕ(u) is a valid distribution function, then p(x) is also valid!

Generalized window functions 2
Notice that the window function is no longer counting the number of examples within a some R Instead, it now calculates the weighted average of all the samples, with the weight being inversely proportional with the distance of the given sample from x However, it still can be shown that under proper conditions A typical choice for ϕ(u) is the Gaussian distribution It solves both drawbacks of the „box” window Samlpes closer to x receive a larger weight The estimate of p(x) will be smooth

Gaussian window function - example
Let’s return to our previous example D={2,3,4,8,10,11,12}, n = 7, h = 1, d = 1 The estimate for p(x) will be sum of 7 Gaussians, each centerd on one of the sample points, each scaled by 1/7

Experimental evaluation
We will draw samples from a known distribution We will vary n and h, and see what estimates we get We will use 2 types of distributions: Normal density a mixture of a triangular and a uniform distribution

Example 1: Gaussian distribution

Example 2: Mixture distribution

The effect of window width h
If we select a too small h, we superimpose sharp „impulses” on the data samples, the estimate won’t be smooth enough If we select a too large h, te estimate will be too smooth The optimal value of h is task-specific and hard to guess in advance However, we can optimize it by experimenting on a validation data set

An example of the decision boundaries
For a very small h the classification will be perfect on the training data, however, the decision boundary is very complex, and won’t generalize well for the test data For a larger h the training data is not perfectly classified, however the decision boundary is smoother and more likely to generalize to unseen data e

Parzen window - summary
Advantages No assumption about the shape of the distribution In theory it converges to the real distribution if the number of samples goes to infinity Disadvantages The optimal value for h is difficult to find (it is called a non-paramteric method, but it does have a parameter! The choice of ϕ(u) can also be regarded as a parameter) May need a large number of samples for an accurate estimate Requires a lot of memory to store all the samples and a lot of computation to calculate p(x) (remember, it is a lazy learner!) One solution is to fit the Gaussian on clusters of the data instead of on all the sample instances  and we arrive to GMMs

K Nearest Neighbors (k-NN)
Remember that we approximated p(x) as In the Parzen window approach we fixed V Now we fix k, and select V in so that it contains exactly k points This will be the k nearest neighbors, or shortly k-NN approach The k-NN approach seems to be a good solution for the “optimal window size” problem Center a cell on x and let it grow until it captures k samples These k samples will be the k nearest neighbors of x The window size will change dynamically If the samples are locally dense, then V will be small, and we obtain a more precise estimate If the samples are sparse, then V is larger and the estimate is smoother

Ok, now we can tune V, but how to chose k? Earlier we said that theoretically is a good choice (the estimate converges to the real distribution) Although in practice it is too large for a large n Problem: for finite n it does not give a valid probability estimate Not even close… Example: d=1, n=1

For a larger k the estimate is better, but still not a valid distribution For small k the estimates are very “spiky”

Experimental evaluation
We will draw samples from a known distribution We will vary n and k, and see what estimates we get We will use 2 types of distributions: Normal density a mixture of a triangular and a uniform distribution

Example

Distribution estimation with k-NN
For infinite n we could create a series of estimates that converge to the real distribution But in practice n is finite, and the estimate is not a valid distribution not a valid distribution Its integrand is not 1 It may look very spiky Even for regions without samples the estimate is far from zero But for the Bayes decision rule we do not necessarily need p(x) if we can estimate P(ci|x) directly

Estimating P(ci|x) with k-NN
Let’s place a cell of volume V around x and capture k samples Our estimate for P(x) will be To create a class-specific estimate we should ignore all other classes but ci So our estimate will be Where ki is the number of samples that belong to ci How to estimate P(ci|x) from this? So our estimate will simply be the ratio of samples that belong to ci! According to the Bayes rule, we select the class that has the largest number of samples within the cell

How good is k-NN? Let’s denote the best possible error rate – the Bayes error – by E Let’s examine the case of k=1 It can be shown that even for k=1, as the error rate of k-NN goes to 2E So it is surprisingly good Though it is hard to tell anything about its performance for a finite n The decision boundaries of 1-NN can be represented by Voronoi cells: It can represent quite complex decision surfaces

How to select k in practice
Theoretically, a larger value of k gives better estimates if n is infinite But the k neighbors should be close to x It is possible when n is infinite But impossible in practice, when n is finite We should find a compromise for k Too small k: the decision boundary will be complex and noisy (i.e. too sensitive to the actual data points) Too large k: the decision boundary will be over-smoothed We have to tune k experimentally So, again, it is not true that nonparametric methods have no parameters!

K-NN - summary Advantages Disadvantages
No assumption about the shape of the distribution Very simple and works surprisingly well in pratice Disadvantages The optimal value for k has to be found experimantally May need a large number of samples for an accurate estimate Requires a lot of memory to store all the samples and a lot of computation Cost of computing the distance between two samples: O(d) Cost of finding one nearest neighbor: O(nd) Cost of finding k nearest neighbors: O(knd) Remember that this computation is done in test time and not in train time (lazy learning)

Speeding up K-NN Approach #1:
We can discard the samples that do not influence the decision boundary, so we can reduce the storage and computation costs

Speeding up K-NN Approach #2:
Speed up the search for the nearest neighbor by storing the samples in a search tree This reduces the computation time But it requires a complex search tree method These methods usually return only an approximation of the distance (i.e. instead of finding the nearest neighbor they find only a “close enough” neighbor), so the decision boundaries might change slightly

Finding the distance function
So far we assumed that the distance function that defines the distance between two samples is given But it can also be varied, so we can also consider this as a parameter The standard distance function is the Euclidean distance: It treats all dimensions (i.e. all features) as equally important However, some features may be more important than others Moreover, noise-like features can destroy the classification! In this case we can apply a weighted distance Where the weights are learned from the data: And, of course, we may apply quite different weight functions as well… For example, consider the case of discrete features

Nonparametric methods Parzen window and nearest neighbor

Similar presentations

Presentation on theme: "Nonparametric methods Parzen window and nearest neighbor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nonparametric methods Parzen window and nearest neighbor

Similar presentations

Presentation on theme: "Nonparametric methods Parzen window and nearest neighbor"— Presentation transcript:

Similar presentations

About project

Feedback