Instance Based Learning

Instance Based Learning
Soongsil University Intelligent Systems Lab.

Content Motivation k-Nearest Neighbour Learning (kNN)
Eager Learning Lazy Learning Instance-Based Learning k-Nearest Neighbour Learning (kNN) Distance-Weighted k-NN Locally Weighted Regression (LWR) Radial Basis Functions (RBF) Case-Based Reasoning (CBR) Summary

Instance-based learning
One way of solving tasks of approximating discrete or real valued target functions Have training examples: (xn, f(xn)), n=1..N. Key idea: just store the training examples when a test example is given then find the closest matches

Motivation: Eager Learning
The Learning Task: Try to approximate a target function through a hypothesis on the basis of training examples EAGER Learning: As soon as the training examples and the hypothesis space are received the search for the first hypothesis begins Training phase: given: training examples D=<Xi, f(Xi)> hypothesis space H search: best hypothesis Processing phase: for every new instance xq return Examples Radial based function

Motivation: Lazy Algorithms
Training examples are stored and sleeping Generalisation beyond these examples is postponed till new instances must be classified Every time a new query instance is encountered, its relationship to the previously stored examples is examined in order to compute the value of the target function for this new instance

Motivation: Instance-Based Learning
Instance-Based Algorithms can establish a new local approximation for every new instance Training phase: given: training sample D=<Xi, f(Xi)> Processing phase: given: instance xq search: best local hypothesis return Examples: Nearest Neighbour Algorithm Distance Weighted Nearest Neighbour Locally Weighted Regression ....

Motivation: Instance-Based Learning
How are the instances represented? How can we measure the similarity of the instances? How can be computed?

Nearest Neighbour Algorithm
Idea: All instances correspond to the points in the n-dimensional space Assign the value of the next, neighboured instance to the new instance Representation: Let be an instance, where denotes the value of the r-th attribute of an instance x Target Function: Discrete valued or real valued We may also use Xir instead of

1-Nearest Neighbor Four things make a memory based learner: 1. A distance metric Euclidian 2. How many nearby neighbors to look at? One 3. A weighting function (optional) Unused 4. How to fit with the local points? Just predict the same output as the nearest neighbor.

HOW IS FORMED? Discrete target function: where V: set of s classes (e.g., red, black, yellow…) Continuous target function: Let the next neighbour of 

Given a query instance xq, first locate the nearest training example xn then := f(xn) K-Nearest neighbour: first locate the k nearest training examples If discrete valued target function, take vote among its k nearest neighbour. (e.g., X, X, O, O, X, O, X, X)X Else if real valued target faction, take the mean of the f values of the k nearest neighbour

How to choose “k” Average of k points more reliable when: Large k:
noise in attributes noise in class labels classes partially overlap Large k: less sensitive to noise (particularly class noise) better probability estimates for discrete classes larger training sets allow larger values of k Small k: captures fine structure of problem space better may be necessary with small training sets Balance must be struck between large and small k As training set approaches infinity, and k grows large, kNN becomes Bayes optimal if p (x) > .5 then predict 1, else 0

k-Nearest Neighbor Four things make a memory based learner:
1. A distance metric Euclidian 2. How many nearby neighbors to look at? k 3. A weighting function (optional) Unused 4. How to fit with the local points? Just predict the average output among the k nearest neighbors.

k-Nearest Neighbour Idea: If we choose k=1, then the algorithm assigns to the value f (xi) where xi is the nearest training instance to xq For larger values of k, the algorithm assigns the most common value among the k nearest training examples How can be established ? where 𝛿(𝑎, 𝑏)=1 if a = b, 𝛿(𝑎, 𝑏)=0 otherwise Let xi, …xk denote the k instances from training examples that are nearest to xq

k-Nearest Neighbour Algorithm
Training algorithm: For each training example < x, f(x) >, add the example to the list training_examples [Classification algorithm]: Given a query instance xq to be classified, Let xi, …xk denote the k instances from training examples that are nearest to xq Return where 𝛿 𝑎, 𝑏 =1 if a = b, 𝛿 𝑎, 𝑏 =0 otherwise

The distance between examples
We need a measure of distance in order to know who are the neighbours Assume that we have T attributes for the learning problem. Then one example point x has elements xt  , t=1,…T. The distance between two points xi , xj is often defined as the Euclidean distance: After title Say: If we are going to try to match a test example with the the training set we will need some function to tell us which examples are CLOSEST to the new instance. After first bullet point K-nearest neighbour methods are mainly suitable for problems with numerical data (typically real, floating point numbers). Decision tree learning are not very good at dealing with continuous data. WHY is that? Instance based learners can be learn quickly from quite small datasets x represents the test instance ASK: How would you measure the distance between examples? What would be the obvious choice? Click after answer

Similarity and Dissimilarity Between Objects
Distances are normally used measures Minkowski distance: a generalization If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Weighed distance

Voronoi Diagram Example: 1NN: 5-NN: Voronoi Diagram
Voronoi Diagram: The decision surface is induced by a 1-Nearest Neighbour algorithm for a typical set of training examples. The convex surrounding of each training example indicates the region of query points whose classification will be completely determined by the training example.

Characteristics of Inst-b-Learning
An instance-based learner is a lazy-learner and does all the work when the test example is presented. This is opposed to so-called eager-learners, which build a parameterised compact model of the target. It produces local approximation to the target function (different with each test instance) Many of the other machine learning algorithms we have looked at are EAGER they try to learn a description of the target concept with all the previous training examples. Naturally instance based learners need plenty of examples to be effective. A possible drawback of instance-based learning is that most of the computation happens when the test example is presented. Thus efficiency is very important in selecting an example that matches the test case.

When to consider Nearest Neighbour algorithms?
Instances map to points in Not more than say 20 attributes per instance Lots of training data Advantages: Training is very fast Can learn complex target functions Don’t lose information Disadvantages: ? (will see them shortly…)

one two three four five six seven Eight ? More Mondrians.
I hope you don’t think that I am trivialising Machine Learning by using the Mondrian example. I chose it because it illustrates very clearly some of the algorithms. I could just as well be talking about: medical diagnosis (link on web page to estimated incidence of failure to progress in labour) other problems (get some examples) The question is is painting 8 a genuine Mondrian.

Training data Test instance 14 Number Lines Line types Rectangles
Colours Mondrian? 1 6 10 4 No 2 8 5 3 7 Yes 14 OK here is the data. Now let’s calculate the distance between this example and all the others. Do on the overhead projector using pen. So distance between 8 and 7 is SQRT ( ) = SQRT(27)= 5.2 THINK a moment. DOES this seem sensible to you? Isn’t the calculation being skewed by the large values of the rectangle data relative to the other data? Test instance

Distances of test instance from training data
Example example Mondrian? 1 No 2 3 Yes 4 5 6 7 Distance of test from Classification 1-NN No 3-NN Yes 5-NN Yes 7-NN No THINK a moment !!. DOES this seem sensible to you ? Isn’t the calculation being skewed by the large values of the rectangle data relative to the other data?

Keep data in normalised form
One way to normalise the data ar(x) to a´r(x) is The test instance would have to be normalised (using the training data average and standard deviation)

Normalised training data
Test instance

Distances of test instance from training data
Example example Mondrian? 1 2.517 No 2 3.644 3 2.395 Yes 4 3.164 5 3.472 6 3.808 7 3.490 Distance of test from Classification After Normalize Before After Normalize 1-NN Yes No 3-NN Yes Yes 5-NN No Yes 7-NN No No Here are the distances Before clicking to reveal classification table So what would a 1-nearest neighbour classifier predict for the test instance? What would a 3-nearest neighbour predict?

Difficulties with k-nearest neighbour algorithms
Have to calculate the distance of the test case from all training cases There may be irrelevant attributes amongst the attributes – curse of dimensionality After the first bullet point, say: If there are a large number of examples then this is a BIG computational overhead and can lead to delays in making the decision (one of the drawbacks of lazy learning). There are algorithms that try to index the examples cleverly so that one can find the k-nearest cases without doing so much calculation (one of these methods is called kd-trees) Another active area of research in this field is to define approximate nearest neighbour algorithms (these can be much faster but of course aren’t so accurate) Irrelevant attributes cause a lot of problems and will lead to poor predictive power, they also cause the time to predict an answer much greater. This is called the CURSE OF DIMENSIONALITY Thus using statistical methods to remove irrelevant attributes is important (this is called PRINCIPAL-COMPONENT ANALYSIS or CROSS-VALIDATION)

What if the target function is real valued?
The k-nearest neighbour algorithm would just calculate the mean of the k nearest neighbours In the Mondrian case we were trying to predict a yes/no answer for each new test instance. There are many problems where we want a real number as the answer. So in that case we just compute the mean of the k-nearest neighbours (now k can be an even number).

Distance-Weighted KNN
The weights of the neighbours are taken into account relative to their distance to the query point. To accommodate the case where the query point exactly matches one of the training instances and the denominator therefore is zero, we assign to be in this case Query point와 정확히 일치하는 학습 data가 있으면 ← Might want nearer neighbors with more heavy weight:

Distance-Weighted KNN
For discrete-valued target functions: For real-valued target functions: Shepard method V: set of s classes (e.g., red, black, yellow…) where 𝛿 𝑎, 𝑏 =1 if a = b, 𝛿 𝑎, 𝑏 =0 otherwise

Remarks on k-Nearest Neighbour Algorithm
PROBLEM: The measurement of the distance between two instances considers every attribute. So even irrelevant attributes can influence the approximation. EXAMPLE: n =20 but only 2 attributes are relevant SOLUTION: Weight each attribute differently when calculating the distance between two neighbours: stretching the relevant axes in Euclidian space: shortening the axes that correspond to less relevant attributes lengthening the axes that correspond to more relevant attribute PROBLEM: Determine which weight belongs to which attribute automatically? Cross-validation Leave-one-out  see in next lecture

Remarks on k-Nearest Neighbour Algorithm 2
ADVANTAGE: The training phase is processed very fast Can learn complex target function Robust to noisy training data Quite effective when a sufficiently large set of training data is provided DISADVANTAGE: Alg. delays all processing until a new query is received => significant computation can be required to process; efficient memory indexing Processing is slow Sensibility about escape of the dimensions BIAS: Inductive bias corresponds to an assumption that the classification of an instance will be most similar to the classification of other instances that are nearby in Euclidean distance

Generalizing k-nearest neighbor to continuous outputs
The version of k-nearest neighbors we have already seen works well for discrete outputs. How would we generalize this to predict continuous outputs ? Ideas?

Locally Weighted Regression
Local means using nearby points (i.e. a nearest neighbors approach), based solely on the training data near the query point Weighted means we value points based upon how far away they are from the query point Regression means approximating a function This is an instance-based learning method The idea: whenever you want to classify a sample: Build a local model of the function (using a linear function, quadratic, neural network, etc.) Use the model to predict the output value Throw the model away

IDEA: Generalization of Nearest Neighbour Alg. It constructs an explicit approximation to f over a local region surrounding xq. It uses nearby or distance-weighted training examples to form the local approximation to f. Local: The function is approximated based solely on the training data near the query point Weighted: The construction of each training example is weighted by its distance from the query point Regression: means approximating a real-valued target function

How to works Locally Weighted Regression
Unweighted averaging using springs. The strength of the springs are equal in the unweighted case, and the position of the horizontal line minimizes the sum of the stored energy in the springs Locally weighted averaging using springs. The springs are not equal, and the spring constant of each spring is given by K(d(xi, q)). Note that the locally weighted average emphasizes points close to the query point, and produces an answer(the height of the horizontal line) that is closer to the height of points near the query point than the unweighted case.

Example of Locally Weighted Learning
containing in the upper graphic the set of data points (x,y) (blue dots), query point (green line), local linear model (red line) and prediction (yellow dot). The graphic in the middle shows the activation area of the model. The corresponding weighting kernel (receptive field) is shown in the bottom graphic.

How to works Locally Weighted Regression

Fits using different types of local models for three and five data points.
Nearest neighbor Weighted average Locally Weighted regression

Locally weighted linear regression
In the following, x is an instance, D is the set of possible instances, D=<xi, f(xi)> ai(x) is the value of the ith attribute value of instance x The weights wi form our hypothesis f is the target function is our approximation to the target function

In this case, we use a linear model to do the local approximation Suppose we aim to minimize the total squared error: Recall the gradient descent we used in checkers for this purpose η is a small number (the learning rate)

Now we adjust it to work with the present situation. Define the error for instance xq: Minimise the squared error over the KNN set using some kernel function K to decrease this error based on the distance And the new version of the gradient descent becomes: The third error criterion is a good approximation to the second one and it has the advantage that computational costs are independent of the total number of training examples

Locally Weighted Linear Regression
We might approximate the target function in the neighborhood surrounding xq using a linear function, a quadratic function, a multilayer neural network, or some other function form. Using linear function to approximate f: Recall chapter 4:  gradient descent rule

PROCEDURE: Given a new query xq , construct an approximation that fits the training examples in the neighbourhood surrounding xq This approximation is used to calculate , which is as the estimated target value assigned to the query instance. The description of may change, because a different local approximation will be calculated for each instance

Evaluation Locally Weighted Regression
ADVANTAGE Pointwise approximation of a complex target function Earlier data has no influence on the new ones DISADVANTAGE The quality of the result depends on Choice of the function Choice of the kernel function K Choice of the hypothesis space H Sensibility against the relevant and irrelevant attributes

Radial Basis Function (RBF) Networks
RBF neural network has an input layer, a hidden layer, an output layer. The neurons in the hidden layer contain Gaussian transfer functions whose outputs are inversely proportional to the distance from the center of the neuron. (뉴런의 중심에서 멀리 떨어진 데이터 일수록 결과에 대한 영향을 줄이고자 한다)

Similar to K-Means clustering and PNN (Probabilistic Neural Network ) /GRNN (Generalized Regression Neural Network). : 방법적인 면에서…. The main difference: PNN/GRNN networks have one neuron for each point in the training file, RBF networks have a variable number of neurons that is usually much less than the number of training points. For problems with small to medium size training sets, PNN/GRNN are usually more accurate than RBF PNN/GRNN networks are impractical for large training sets.

How RBF networks work Although the implementation is very different, RBF neural networks are conceptually similar to K-Nearest Neighbor (k-NN) models. : 전략적인 면에서…. The basic idea is that a predicted target value of an item is likely to be about the same as other items that have close values of the predictor variables.

Radial-Basis Function Networks
RBFs represent local receptors, as illustrated below, where each green point is a stored vector used in one RBF. In a RBF network one hidden layer uses neurons with RBF activation functions describing local receptors. Then one output node is used to combine linearly the outputs of the hidden neurons. The output of the red vector is “interpolated” using the three green vectors, where each vector gives a contribution that depends on its weight and on its distance from the red point. In the picture we have w3 w2 w1

In MLP

MLP vs RBFN in RBFN

Radial Basis Function Network
A kind of supervised neural networks Design of NN as curve-fitting problem Learning find surface in multidimensional space best fit to training data Generalization Use of this multidimensional surface to interpolate the test data

New model : f(x) = w1h1(x) + w2h2(x) + w3h3(x) where h1(x) = 1,
h2(x) = x, h3(x) = x2

Radial Basis Function Network
Approximate function with linear combination of Radial basis functions h(x) is mostly Gaussian function hj(x) = exp( -(x-cj)2 / rj2 ) Where cj is center of a region, rj is width of the receptive field

HIDDEN NEURON MODEL hj( || x - cj ||) hj( || x - cj ||)
the output depends on the distance of the input x from the center cj x1 hj( || x - cj ||) cj is called center of a region j is called spread center and spread are parameters x2 hj xm

RBF ARCHITECTURE One hidden layer with RBF activation functions
x2 xm x1 y wm1 w1 One hidden layer with RBF activation functions Output layer with linear activation function. ||x-c|| disitance of x=(x1, …, xm) from vector c

Three layers Input layer Hidden layer Output layer
Source nodes that connect to the network to its environment Hidden layer Hidden units provide a set of basis function High dimensionality Output layer Linear combination of hidden functions

RBF Network Architecture

Weight = RBF(distance)
The further a neuron is from the point being evaluated, the less influence it has.

Radial Basis Function Different types of radial basis functions could be used, but the most common is the Gaussian function:

Radial Basis Function If there is more than one predictor variable, then the RBF function has as many dimensions as there are variables. Three neurons in a space Z is the value coming out of the RBF functions two predictor variables, X and Y

The best predicted value for the new point is found by summing the output values of the RBF functions multiplied by weights computed for each neuron.

The radial basis function for a neuron has a center and a radius (also called a spread).
The radius may be different for each neuron, and, in RBF

Training RBF Networks The following parameters are determined by the training process: The number of neurons in the hidden layer. The coordinates of the center of each hidden-layer RBF function. The radius (spread) of each RBF function in each dimension. The weights applied to the RBF function outputs as they are passed to the summation layer.

designing Require Selection of the radial basis function width parameter Number of radial basis neurons

Number of radial basis neurons
designing Number of radial basis neurons By designer Max of neurons = number of input Min of neurons = ( experimentally determined) More neurons More complex, but smaller tolerance

Various learning strategies
How the centers of the radial-basis functions of the network are specified. Fixed centers selected at random Self-organized selection of centers Supervised selection of centers

Fixed centers selected at random(1)
learning strategies Fixed centers selected at random(1) Fixed RBFs of the hidden units The locations of the centers may be chosen randomly from the training data set. We can use different values of centers and widths for each radial basis function -> experimentation with training data is needed.

Fixed centers selected at random(2)
learning strategies Fixed centers selected at random(2) Only output layer weight is need to be learned. Main problem Require a large training set for a satisfactory level of performance

Self-organized selection of centers(1)
learning strategies Self-organized selection of centers(1) By means of clustering. Supervised learning of output weights by LMS(Least Mean Square) algorithm. Hybrid learning self-organized learning to estimate the centers of RBFs in hidden layer supervised learning to estimate the linear weights of the output layer ⋇ Center는 clustering으로 결정 하지만 output weight 는 supervised learning !!

Self-organized selection of centers(2)
learning strategies Self-organized selection of centers(2) k-means clustering Initialization Sampling Similarity matching Updating Continuation

Supervised selection of centers
learning strategies Supervised selection of centers All free parameters of the network are changed by supervised learning process. Error-correction learning using LMS algorithm.

Radial functions Gassian RBF: c : center, r : radius
monotonically decreases with distance from center Multiquadric RBF monotonically increases with distance from center

Gaussian RBF multiqradric RBF

Least Squares Training data : {(x1, y1), (x2, y2), …, (xp, yp)}
Minimize the sum-squared-error

Example Sample points (noisy) from the curve y = x : {(1, 1.1), (2, 1.8), (3, 3.1)} linear model : f(x) = w1h1(x) + w2h2(x), where h1(x) = 1, h2(x) = x Estimate the coefficient w1, w2

f(x) = x

h2(x) = x, h3(x) = x2

If the model absorbs all the noise : overfit
If it is too flexible, it will fit the noise If it is too inflexible, it will miss the target

The optimal weight vector
model sum-squared-error cost function (minimized): weight penalty term is added λj: regularization parameters

Design matrix

Example Sample points (noisy) from the curve y = x : {(1, 1.1), (2, 1.8), (3, 3.1)} linear model : f(x) = w1h1(x) + w2h2(x), where h1(x) = 1, h2(x) = x Estimate the coefficient w1, w2

{(1, 1.1), (2, 1.8), (3, 3.1)} h1(x) = 1, h2(x) = x f(x) = 0*1 + 1*x
f(x) = w1h1(x) + w2h2(x), f(x) = 0*1 + 1*x where h1(x) = 1, h2(x) = x

f(x) = x

It should have an extra term, x2
New model: f(x) = w1h1(x) + w2h2(x) + w3h3(x), Where w3h3(x) = x2

h2(x) = x, h3(x) = x2

Radial Basis Function (RBF) Networks
Each prototype node computes a distance based kernel function (Gaussian is common) Prototype nodes form a hidden layer in a neural network Train top layer with simple delta rule to get outputs Thus, prototype nodes learn weightings for each class  blend of instance-based method and neural network method.

Radial Basis Function Function to be learned:
One common choice for is: Global approximation to target function, in terms of linear combination of local approximations. Related to distance-weighted regression, “eager” instead of “lazy”.

Radial Basis Function Networks
ai(x) are attributes describing instance x. The first layer computes various Ku(d(xu,x)). Second layer computes linear combination of first-layer unit values. Hidden unit activation is close to 0 if x isn’t near xu

Approximation MLP : Global network RBF : Local network
MLP vs RBFN Approximation MLP : Global network All inputs cause an output RBF : Local network Only inputs near a receptive field produce an activation Can give “don’t know” output

MLP vs RBFN Global hyperplane Local receptive field
EBP(Error of Back Parapagation) LMS Local minima Serious local minima Smaller number of hidden neurons Larger number of hidden neurons Shorter computation time Longer computation time Longer learning time Shorter learning time

Case-based reasoning (CBR)
Instance-based methods and locally weighted regression CBR: first two principles and instances are represented by using a richer symbolic description and the methods used to retrieval CBR is an advanced instance based learning applied to more complex instance objects Objects may include complex structural descriptions of cases & adaptation rules CBR doesn’t just have a fixed database of past cases, it retains new examples once their classification is known Like k-nearest neighbour it matches new cases to find similar past cases. However it doesn’t use Euclidean distance measures. It tends to match using semantic nets. This allows diagrammatic matching. It is also lazy like k-nearest neighbour. The power comes from the organisation and content of the cases themselves. Empirical psychological studies have indicated that re-use of past cases is a major factor in human problem solving and reasoning Partly because CBR is modelled on human problem solving it is more understandable to human beings. This is important as it means that the CBR system (how the matching is done, how cases are adapted etc. can be altered by experts or in consultation with experts). Also CBR can deal with very complex forms of knowledge

CBR cannot use Euclidean distance measures
Must define distance measures for those complex objects instead (e.g. semantic nets) CBR tries to model human problem-solving uses past experience (cases) to solve new problems retains solutions to new problems CBR is an ongoing area of machine learning research with many applications

Applications of CBR Design Planning Diagnosis Adversarial reasoning
landscape, building, mechanical, conceptual design of aircraft sub-systems Planning repair schedules Diagnosis medical Adversarial reasoning legal The use of CBR in medical diagnosis is a very big research area (there are already CBR systems that diagnose heart defects, hearing disease and conditions) Adversarial (there are systems that not only find similar legal precedents but also can create arguments based on cases to support particular arguments)

CBR process New Retrieve Case Matched Cases Learn Case Base Adapt?
matching Case Base Knowledge and Adaptation rules Learn Adapt? No Yes Retain 1. The new test case comes in 2. This is matched to the Case Base (note that the system has more than a case base - it has prior knowledge and adaptation rules) 3.This process is called Retrieve 4. Does the new case match one of the existing cases perfectly? Or does one of the existing cases have to be adapted? 5. So we either re-use or revise 6. We now have recommended solution 7. If later facts suggest that the recommendation wasn’t in agreement with the subsequent facts or situation then retain the case as a new case in the case base. This is the learning phase. The learning phase may also add rules to the knowledge base that the eventual facts suggest. This model of case-based reasoning is called the R4 model. Revise Reuse Closest Case Suggest solution

CBR example: Property pricing
Test instance Put this slide on the OHP

How rules are generated
There is no unique way of doing it. Here is one possibility: Examine cases and look for ones that are almost identical case 1 and case 2 Rule1: If recep-rooms changes from 2 to 1 then reduce price by £5,000 case 3 and case 4 Rule2: If Type changes from semi to terraced then reduce price by £7,000

Matching Comparing test instance Estimate price of case 5 is £25,000
matches(5,1) = 3 matches(5,2) = 3 (MAX COST: £ 25000) matches(5,3) = 2 matches(5,4) = 1 Estimate price of case 5 is £25,000 The matching items in case 1 are number of bedrooms number of floors condition The matching items in case 2 are number of reception rooms Suppose for the sake of argument that we have a rule in the rule-base that says (within reason) that the number of reception rooms is more important than the condition. So we would say that case 2 was the best match and estimate the price of property 5 as £25,000

Adapting Reverse rule 2 Apply reversed rule 2
if type changes from terraced to semi then increase price by £7,000 Apply reversed rule 2 new estimate of price of property 5 is £32,000 Adapting may involve adapting rule in the rule-base and then applying them to the closest matching case. Thus we finally suggest that property 5 should fetch £32,000

Learning So far we have a new case and an estimated price
nothing is added yet to the case base If later we find house of location code 8 sold for £35,000 then the case would be added could add a new rule if location changes from 7 to 8 increase price by £3,000

Problems with CBR How should cases be represented?
How should cases be indexed for fast retrieval? How can good adaptation heuristics be developed? When should old cases be removed? These questions are still actively pursued research topics.

Advantages A local approximation is found for each test case
Knowledge is in a form understandable to human beings Fast to train After first bullet point: Unlike many other induction methods one isn’t trying to learn a single hypothesis that is consistent with all the training data The knowledge is in a form much like how human beings would think about the problem (this means that it is relatively easy for experts to add to the knowledge in the knowledge base) Like nearest neighbour it can be useful with relatively few cases.

Lazy and Eager Learning
Lazy: wait for query before generalizing KNN, locally weighted regression, CBR Eager: generalize before seeing query RBF networks Differences: Computation time Global and local approximations to the target function Use same H, lazy can represent more complex functions. (e.g., consider H=linear functions)

Summary Differences and advantages
KNN algorithm: the most basic instance-based method. Locally weighted regression: generalization of KNN. RBF networks: blend of instance-based method and neural network method. Case-based reasoning

Lazy and Eager Learning
Lazy: wait for query before generalizing k-Nearest Neighbour, Case based reasoning Eager: generalize before seeing query RBF Networks, ID3, … Does it matter? Eager learner must create global approximation Lazy learner can create many local approximations

The End

Instance Based Learning

Similar presentations

Presentation on theme: "Instance Based Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Instance Based Learning

Similar presentations

Presentation on theme: "Instance Based Learning"— Presentation transcript:

Similar presentations

About project

Feedback