COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.

COMP 2208 Dr. Long Tran-Thanh ltt08r@ecs.soton.ac.uk University of Southampton K-Nearest Neighbour

Classification Environment Perception Behaviour Categorize inputs Update belief model Update decision making policy Decision making Perception Behaviour

Generalisation X1, Y1 X2,Y2 X3,Y3 Xn, Yn (X,?) Training data:Unseen data: Our goal: good performance on both training data and never-seen-before data

Overfitting Overfitting: High accuracy on training data, but low quality in prediction Incorrect predictions Training data Testing data Correct predictions

Training vs. testing How do we know what is the generalisation power of the model? Does it efficiently predict? How to minimise generalisation error? Idea: use some data as training data, keep the others for testing Advantage: can measure generalisation power Issue: waste data for testing (cannot be used for training) – big issue when dataset is small

Training vs. testing Idea 2: why not swap the data sets, and rerun the algorithm? Objective: minimise the average generalisation error (+ training error) K-fold cross validation: K disjunct (i.e., non-overlapping) partitions of data points Use (K-1) for training, the K-th one for testing Repeat this K times (with each partition = testing data once)

Lazy learning So far: we train the system with training data, then test it on other data Lazy learning: there is no training phase We evaluate the data point at the time it is chosen (similar to online learning) We compare the new data point with existing ones in the system We fix the parameters of the system after training Typically no (global) parameters to be set (not always correct) A lazy learning algorithm: k-nearest neighbour

The intuition Humans usually categorise and generalise based on how similar a new object is, compared to other known things: e.g., dog, cat, desk, … But this is not always obvious:

The main challenge We need to be able to identify the degree of similarity Idea: define some geometric representation of the data points Degree of similarity = (geometric) distance between the points How to define the metrics?

A geographic example

In many cases, we only have some local information ? Idea: physically close locations are likely to be similar

Nearest neighbour We classify based on the nearest known data point Intuition: the most similar point is the most dominant one Hungarian proverb: Watch her mother before you marry a girl

Voronoi diagram Partition the space into sub-spaces Partition is based on distance We use this partitioning to classify

Another geographic example The towns of Baarle-Hertog (Belgium) and Baarle-Nassau (The Netherlands) have a very complicated border

Another geographic example

People’s nationality: blue = Dutch, red = Belgian

When nearest neighbour is wrong

K-nearest neighbour We choose the K closest known neighbours Use majority voting to determine should be In our second example: K = 5 Choose 5 nearest locations If 3 out of 5 is Belgian -> our prediction is also Belgian (and Dutch otherwise)

K-nearest neighbour with K = 5

KNN: Questions How large K should be? What distance metrics should be used? Anything other than majority voting?

Setting the value of K Unfortunately there's no general way to go. If k is too small, you over-fit to noise in the data (i.e., output is noisy). If k is too big, you lose all the structural detail (e.g., if k=N you predict all-Netherlands). In practice you try multiple values. A possible heuristic: Run kNN with multiple K-s at the same time Use cross validation to identify the best value of K

Distance metrics Geographical based problems: Euclidean distance Distance = physical distance between 2 locations In some cases, (Euclidean) distance is not obvious Data points have multiple dimensions Data point = sales person, who has age + sold items Age is between 0 and 100 (unit = 1), sold items is between 0 and 1 million (unit = 1000) If we put this into the Euclidean space, the latter dominates the former

Distance metrics Idea: we need to normalise the data when we put it into the Euclidean space Normalisation = rescale the data so all the dimensions are comparable E.g., make all the dimensions between 0 and 1 What to do when the data is categorical? (e.g., “Good”, “OK”, “Terrible”) Practical consensus: we don’t use kNN for these cases Categorical data: describes membership of a group. The groups are distinct, and may be represented with a code number, but they can't be ranked. Examples: country, sex, species, behaviour. kNN is typically good for using continuous data (e.g., location coordinates) to predict categorical (e.g., country) or continuous outcomes (e.g., home property price)

Other distance metrics? Data points = distributions of real data (e.g., data stream) Probability-distance metrics: Hellinger distance, Kolmogorov distance, etc… Other non-Euclidean distances: L1 (taxicab) distance

Anything other than majority voting? Choose the median Weighted majority voting How to set the weights? Prefer the closer ones, or the more distant ones? Weight by their importance Many other exotic voting rules (see social choices theory) Choose the average

Summary: kNN KNN is a simple example of lazy learning (instance-based, memory-based). KNN doesn't require a training stage. We use the training data itself to classify Things to be considered: A distance metric. How many neighbours do we look at? A method for predicting the output based on the neighbouring local points.

COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.

Similar presentations

Presentation on theme: "COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour.

Similar presentations

Presentation on theme: "COMP 2208 Dr. Long Tran-Thanh University of Southampton K-Nearest Neighbour."— Presentation transcript:

Similar presentations

About project

Feedback