Data Mining – Algorithms: Instance-Based Learning

Data Mining – Algorithms: Instance-Based Learning
Chapter 4, Section 4.7 Some of this was covered in Chapter 3

Instance Based Representation
Concept not really represented (except via examples) Training examples are merely stored (kind of like “rote learning”) Answers are given by finding the most similar training example(s) to test instance at testing time Has been called “lazy learning” – no work until an answer is needed

Instance Based – Finding Most Similar Example
Nearest Neighbor – each new instance is compared to all other instances, with a “distance” calculated for each attribute for each instance Class of nearest neighbor instance is used as the prediction <see next slide and come back> Combination of distances – city block or euclidean (crow flies) Higher powers increase the influence of large differences … see 2 slides forward

Nearest Neighbor T x x x y x x y x x x y y z z z y z y z y x z y y z z
When making prediction, find K most similar (to current) previous examples or cases. Use those examples “answer” to make new prediction. This picture illustrates finding most similar in 2D - 2 attributes - translates easily to any number of attributes (but hard to draw 120 dimensions) Prediction can be variety of: vote of k-nearest neighbors, weighted vote, (if prediction is numeric) ave of k-nearest, weighted ave of k-nearest. Could use adaptation to make up for differences between the case being predicted and the nearest neighbors. This is what I do!! y y z z z y z y z y x z y y z z z

Example Distance Metrics
Attributes A B C Sum Test 5 Train1 6 4 9 Train 2 7 3 Train 3 10 City Block 1 1 City Block 2 2 City Block 3 Euclidean 1 16 18 Euclidean 2 12 Euclidean 3 25

More Similarity/Distance
Normalization is necessary – as discussed in Chapter 2 Nominal Attributes frequently considered all or nothing - a complete match or no match at all Match  similarity = highest possible value, or distance = 0 Not Match  similarity = 0; or distance = highest possible value Nominals that are actually ordered ought to be treated differently (e.g. partial matches) Distance / Similarity function is simpler if data is normalized in advance. E.g. $10 difference in household income is not significant, while 1.0 distance in GPA is big Distance/ Similarity function must deal with binaries/nominals – usually by all or nothing match – but mild should be a better match to hot than cool is!

Missing Values Frequently treated as maximum distance to ANY other value For numerics, the maximum distance depends on what value comparing to E.g. if values range from 0-1 and comparing a missing value to .9, maximal possible distance is .9 If comparing a missing value to .3, maximal possible distance is .7 If comparing missing value to .5, maximal possible distance is .5

Dealing with Noise Noise is something that makes a task harder (e.g. real noise makes listening/hearing harder) (noise on data transmission makes communication more difficult) (noise in learning is incorrect values for attributes, including class, or could be un-representative instance) In instance-based learning, an approach to dealing with noise is to use greater number of neighbors, so are not led astray by an incorrect or weird example

K-nearest neighbor Can combine “opinions” by having the K nearest neighbors vote for the prediction to make Or, more sophisticated weighted k-vote An instance’s vote is weighted by how close it is to the test instance – closest neighbor is weighted more than further neighbor WEKA allows you to choose weight (distance weighting) as 1 – dist or 1 / dist

Effect of Distance Weighting Scheme
.1 .2 .3 .4 .5 .6 .7 .8 .9 Vote 1 – dist Vote 1 / dist 10 5 3.3 2.5 2 1.7 1.4 1.2 1.1 1 – dist is smoother 1 / dist gives a lot more credit to instances that are very close

Let’s try WEKA Experiment with K and weighting on Basketball (discretize),

K-nearest, Numeric Prediction
Average prediction of k-nearest OR weighted average of k-nearest based on distance

Weighted Similarity / Distance
Distance/Similarity function should weight different attributes differently – key task is determining those weights Next slide sketches general wrapper approach (see chapt 6, p195-6)

Learning weights Divide training data into training and validation (a sort of pre-test) data Until time to stop Loop through validation data Predict, and see success / or not Compare validation instance to training instances used to predict Attributes that lead to correct prediction have weights increased Attributes that lead to incorrect prediction have weights decreased Re-normalize weights to avoid chance of overflow time to stop varies based on scheme Amount of Weight adjustment may depend on amount of distance differences among validation and training instances, amount of error if numeric prediction

Learning re: Instances
May not need to save all instances Very normal instances may not all need be be saved One strategy – classify during training, and only keep instances that are misclassified Problem – will accumulate noisy or idiosyncratic examples More sophisticated – keep records for how often examples lead to correct and incorrect predictions and discard those that have poor performance (details on Aha’s method p 194-5) An in between strategy – weight instances based on their previous success or failure (I’m experimenting with) Some approaches actually do some generalization

Class Exercise Let’s run WEKA IBk on japanbank K=3
Pretty good – 85% correct in 10-fold cross validation

End Section 4.7

Data Mining – Algorithms: Instance-Based Learning

Similar presentations

Presentation on theme: "Data Mining – Algorithms: Instance-Based Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining – Algorithms: Instance-Based Learning

Similar presentations

Presentation on theme: "Data Mining – Algorithms: Instance-Based Learning"— Presentation transcript:

Similar presentations

About project

Feedback