Presentation is loading. Please wait.

Presentation is loading. Please wait.

COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.

Similar presentations


Presentation on theme: "COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour."— Presentation transcript:

1 COMP 2208 Dr. Long Tran-Thanh ltt08r@ecs.soton.ac.uk University of Southampton K-Nearest Neighbour

2 Classification Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour

3 Generalisation X1, Y1 X2,Y2 X3,Y3 Xn, Yn (X,?) Training data:Unseen data: Our goal: good performance on both training data and never-seen-before data

4 Overfitting Overfitting: High accuracy on training data, but low quality in prediction Incorrect predictions Training data Testing data Correct predictions

5 Training vs. testing How do we know what is the generalisation power of the model? Does it efficiently predict? How to minimise generalisation error? Idea: use some data as training data, keep the others for testing Advantage: can measure generalisation power Issue: waste data for testing (cannot be used for training) – big issue when dataset is small

6 Training vs. testing Idea 2: why not swap the data sets, and rerun the algorithm? Objective: minimise the average generalisation error (+ training error) K-fold cross validation: K disjunct (i.e., non-overlapping) partitions of data points Use (K-1) for training, the K-th one for testing Repeat this K times (with each partition = testing data once)

7 Lazy learning So far: we train the system with training data, then test it on other data Lazy learning: there is no training phase We evaluate the data point at the time it is chosen (similar to online learning) We compare the new data point with existing ones in the system We fix the parameters of the system after training Typically no (global) parameters to be set (not always correct) A lazy learning algorithm: k-nearest neighbour

8 The intuition Humans usually categorise and generalise based on how similar a new object is, compared to other known things: e.g., dog, cat, desk, … But this is not always obvious:

9 The main challenge We need to be able to identify the degree of similarity Idea: define some geometric representation of the data points Degree of similarity = (geometric) distance between the points How to define the metrics?

10 A geographic example

11

12 In many cases, we only have some local information ? Idea: physically close locations are likely to be similar

13 Nearest neighbour We classify based on the nearest known data point Intuition: the most similar point is the most dominant one Hungarian proverb: Watch her mother before you marry a girl

14 Voronoi diagram Partition the space into sub-spaces Partition is based on distance We use this partitioning to classify

15 Another geographic example The towns of Baarle-Hertog (Belgium) and Baarle-Nassau (The Netherlands) have a very complicated border

16 Another geographic example

17 People’s nationality: blue = Dutch, red = Belgian

18 When nearest neighbour is wrong

19 K-nearest neighbour We choose the K closest known neighbours Use majority voting to determine should be In our second example: K = 5 Choose 5 nearest locations If 3 out of 5 is Belgian -> our prediction is also Belgian (and Dutch otherwise)

20 K-nearest neighbour with K = 5

21 KNN: Questions How large K should be? What distance metrics should be used? Anything other than majority voting?

22 Setting the value of K Unfortunately there's no general way to go. If k is too small, you over-fit to noise in the data (i.e., output is noisy). If k is too big, you lose all the structural detail (e.g., if k=N you predict all-Netherlands). In practice you try multiple values. A possible heuristic: Run kNN with multiple K-s at the same time Use cross validation to identify the best value of K

23 Distance metrics Geographical based problems: Euclidean distance Distance = physical distance between 2 locations In some cases, (Euclidean) distance is not obvious Data points have multiple dimensions Data point = sales person, who has age + sold items Age is between 0 and 100 (unit = 1), sold items is between 0 and 1 million (unit = 1000) If we put this into the Euclidean space, the latter dominates the former

24 Distance metrics Idea: we need to normalise the data when we put it into the Euclidean space Normalisation = rescale the data so all the dimensions are comparable E.g., make all the dimensions between 0 and 1 What to do when the data is categorical? (e.g., “Good”, “OK”, “Terrible”) Practical consensus: we don’t use kNN for these cases Categorical data: describes membership of a group. The groups are distinct, and may be represented with a code number, but they can't be ranked. Examples: country, sex, species, behaviour. kNN is typically good for using continuous data (e.g., location coordinates) to predict categorical (e.g., country) or continuous outcomes (e.g., home property price)

25 Other distance metrics? Data points = distributions of real data (e.g., data stream) Probability-distance metrics: Hellinger distance, Kolmogorov distance, etc… Other non-Euclidean distances: L1 (taxicab) distance

26 Anything other than majority voting? Choose the median Weighted majority voting How to set the weights? Prefer the closer ones, or the more distant ones? Weight by their importance Many other exotic voting rules (see social choices theory) Choose the average

27 Summary: kNN KNN is a simple example of lazy learning (instance-based, memory-based). KNN doesn't require a training stage. We use the training data itself to classify Things to be considered: A distance metric. How many neighbours do we look at? A method for predicting the output based on the neighbouring local points.


Download ppt "COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour."

Similar presentations


Ads by Google