Instance Based Learning

Slides:



Advertisements
Similar presentations
1 Classification using instance-based learning. 3 March, 2000Advanced Knowledge Management2 Introduction (lazy vs. eager learning) Notion of similarity.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining Classification: Alternative Techniques
Pattern Recognition and Machine Learning: Kernel Methods.
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
1 Machine Learning: Lecture 7 Instance-Based Learning (IBL) (Based on Chapter 8 of Mitchell T.., Machine Learning, 1997)
Lazy vs. Eager Learning Lazy vs. eager learning
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
1er. Escuela Red ProTIC - Tandil, de Abril, Instance-Based Learning 4.1 Introduction Instance-Based Learning: Local approximation to the.
Classification and Decision Boundaries
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 8 - Instance based learning Prof. Giancarlo.
Pattern Recognition and Machine Learning
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Instance based learning K-Nearest Neighbor Locally weighted regression Radial basis functions.
Instance Based Learning
Instance-Based Learning
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
These slides are based on Tom Mitchell’s book “Machine Learning” Lazy learning vs. eager learning Processing is delayed until a new instance must be classified.
1 Nearest Neighbor Learning Greg Grudic (Notes borrowed from Thomas G. Dietterich and Tom Mitchell) Intro AI.
Aprendizagem baseada em instâncias (K vizinhos mais próximos)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Instance Based Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán) 1.
INSTANCE-BASE LEARNING
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
CS Instance Based Learning1 Instance Based Learning.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
1 Instance Based Learning Ata Kaban The University of Birmingham.
Instance Based Learning
CpSc 881: Machine Learning Instance Based Learning.
CpSc 810: Machine Learning Instance Based Learning.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Outline K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Lazy Learners K-Nearest Neighbor algorithm Fuzzy Set theory Classifier Accuracy Measures.
Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Tuesday, November 23, 1999.
KNN Classifier.  Handed an instance you wish to classify  Look around the nearby region to see what other classes are around  Whichever is most common—make.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Kansas State University Department of Computing and Information Sciences CIS 890: Special Topics in Intelligent Systems Wednesday, November 15, 2000 Cecil.
CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.
1 Instance Based Learning Soongsil University Intelligent Systems Lab.
1 Instance Based Learning Soongsil University Intelligent Systems Lab.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Deep Feedforward Networks
Ch8: Nonparametric Methods
K Nearest Neighbor Classification
یادگیری بر پایه نمونه Instance Based Learning Instructor : Saeed Shiry
Neuro-Computing Lecture 4 Radial Basis Function Network
Instance Based Learning
Chap 8. Instance Based Learning
Machine Learning: UNIT-4 CHAPTER-1
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Support Vector Machines 2
Presentation transcript:

Instance Based Learning Machine Learning 21431 Instance Based Learning

Outline K-Nearest Neighbor Locally weighted learning Local linear models Radial basis functions

Literature & Software T. Mitchell, “Machine Learning”, chapter 8, “Instance-Based Learning” “Locally Weighted Learning”, Christopher Atkeson, Andrew Moore, Stefan Schaal ftp:/ftp.cc.gatech.edu/pub/people/cga/air.html R. Duda et ak, “Pattern recognition”, chapter 4 “Non-Parametric Techniques” Netlab toolbox k-nearest neighbor classification Radial basis function networks

When to Consider Nearest Neighbors Instances map to points in RN Less than 20 attributes per instance Lots of training data Advantages: Training is very fast Learn complex target functions Do not loose information Disadvantages: Slow at query time Easily fooled by irrelevant attributes

Instance Based Learning Key idea: just store all training examples <xi,f(xi)> Nearest neighbor: Given query instance xq, first locate nearest training example xn, then estimate f(xq)=f(xn) K-nearest neighbor: Given xq, take vote among its k nearest neighbors (if discrete-valued target function) Take mean of f values of k nearest neighbors (if real-valued) f(xq)=i=1k f(xi)/k

Voronoi Diagram query point qf nearest neighbor qi

3-Nearest Neighbors query point qf 3 nearest neighbors 2x,1o

7-Nearest Neighbors query point qf 7 nearest neighbors 3x,4o

Behavior in the Limit Consider p(x) the probability that instance x is classified as positive (1) versus negative (0) Nearest neighbor: As number of instances  approaches Gibbs algorithm Gibbs algorithm: with probability p(x) predict , else 0 K-nearest neighbors: As number of instances  approaches Bayes optimal classifier Bayes optimal: if p(x)> 0.5 predict 1 else 0

Nearest Neighbor (continuous)

Nearest Neighbor (continuous)

Nearest Neighbor (continuous)

Locally Weighted Regression Forms an explicit approximation f*(x) for region surrounding query point xq. Fit linear function to k nearest neighbors Fit quadratic function Produces piecewiese approximation of f Squared error over k nearest neighbors E(xq) = xi  nearest neighbors (f*(xq)-f(xi))2 Distance weighted error over all neighbors E(xq) = i (f*(xq)-f(xi))2 K(d(xi,xq))

Locally Weighted Regression Regression means approximating a real-valued target function Residual is the error f*(x)-f(x) in approximating the target function Kernel function is the function of distance that is used to determine the weight of each training example. In other words, the kernel function is the function K such that wi=K(d(xi,xq))

Distance Weighted k-NN Give more weight to neighbors closer to the query point f*(xq) = i=1k wi f(xi) / i=1k wi where wi=K(d(xq,xi)) and d(xq,xi) is the distance between xq and xi Instead of only k-nearest neighbors use all training examples (Shepard’s method)

Distance Weighted Average Weighting the data: f*(xq) = i f(xi) K(d(xi,xq))/ i K(d(xi,xq)) Relevance of a data point (xi,f(xi)) is measured by calculating the distance d(xi,xq) between the query xq and the input vector xi Weighting the error criterion: E(xq) = i (f*(xq)-f(xi))2 K(d(xi,xq)) the best estimate f*(xq) will minimize the cost E(xq), therefore E(xq)/f*(xq)=0

Kernel Functions

Distance Weighted NN K(d(xq,xi)) = 1/ d(xq,xi)2

Distance Weighted NN K(d(xq,xi)) = 1/(d0+d(xq,xi))2

Distance Weighted NN K(d(xq,xi)) = exp(-(d(xq,xi)/0)2)

Example: Mexican Hat f(x1,x2)=sin(x1)sin(x2)/x1x2 approximation

Example: Mexican Hat residual

Locally Weighted Linear Regression Local linear function f*(x) = w0 + Sn wn xn Error criterion E = i (w0 + Sn wn xqn -f(xi))2 K(d(xi,xq)) Gradient descent Dwn = i (f*(xq)- f(xi)) xn K(d(xi,xq)) Least square solution w = ((KX)T KX)-1 (KX)T f(X) with KX NxM matrix of row vectors K(d(xi,xq)) xi and f(X) is a vector whose i-th element is f(xi)

Curse of Dimensionality Imagine instances described by 20 attributes but only are relevant to target function Curse of dimensionality: nearest neighbor is easily misled when instance space is high-dimensional One approach: Stretch j-th axis by weight zj, where z1,…,zn chosen to minimize prediction error Use cross-validation to automatically choose weights z1,…,zn Note setting zj to zero eliminates this dimension alltogether (feature subset selection)

Linear Global Models The model is linear in the parameters bk, which can be estimated using a least squares algorithm f^(xi) = k=1D wk xki or F(x) = X b Where xi=(x1,…,xD)i, i=1..N, with D the input dimension and N the number of data points. Estimate the bk by minimizing the error criterion E= i=1N (f^(xi) – yi)2 (XTX) b = XT F(X) b = (XT X)-1 XT F(X) bk= m=1D n=1N (l=1D xTkl xlm)-1 xTmn f(xn)

Linear Regression Example

Linear Local Models Estimate the parameters bk such that they locally (near the query point xq) match the training data either by weighting the data: wi=K(d(xi,xq))1/2 and transforming zi=wi xi vi=wi yi or by weighting the error criterion: E= i=1N (xiT  – yi)2 K(d(xi,xq)) still linear in  with LSQ solution  = ((WX)T WX)-1 (WX)T F(X)

Linear Local Model Example Kernel K(x,xq) Local linear model: f^(x)=b1x+b0 f^(xq)=0.266 query point Xq=0.35

Linear Local Model Example

Design Issues in Local Regression Local model order (constant, linear, quadratic) Distance function d feature scaling: d(x,q)=(j=1d mj(xj-qj)2)1/2 irrelevant dimensions mj=0 kernel function K smoothing parameter bandwidth h in K(d(x,q)/h) h=|m| global bandwidth h= distance to k-th nearest neighbor point h=h(q) depending on query point h=hi depending on stored data points See paper by Atkeson [1996] ”Locally Weighted Learning”

Radial Basis Function Network Global approximation to target function in terms of linear combination of local approximations Used, e.g. for image classification Similar to back-propagation neural network but activation function is Gaussian rather than sigmoid Closely related to distance-weighted regression but ”eager” instead of ”lazy”

Radial Basis Function Network output f(x) wn linear parameters Kernel functions Kn(d(xn,x))= exp(-1/2 d(xn,x)2/2) input layer xi f(x)=w0+n=1k wn Kn(d(xn,x))

Training Radial Basis Function Networks How to choose the center xn for each Kernel function Kn? scatter uniformly across instance space use distribution of training instances (clustering) How to train the weights? Choose mean xn and variance n for each Kn non-linear optimization or EM Hold Kn fixed and use local linear regression to compute the optimal weights wn

Radial Basis Network Example K1(d(x1,x))= exp(-1/2 d(x1,x)2/2) w1 x+ w0 f^(x) = K1 (w1 x+ w0) + K2 (w3 x + w2)

and Eager Learning Lazy: wait for query before generalizing k-nearest neighbors, weighted linear regression Eager: generalize before seeing query Radial basis function networks, decision trees, back-propagation, LOLIMOT Eager learner must create global approximation Lazy learner can create local approximations If they use the same hypothesis space, lazy can represent more complex functions (H=linear functions)

Laboration 3 Distance weighted average Cross-validation for optimal kernel width s Leave 1-out cross-validation f*(xq) = iq f(xi) K(d(xi,xq))/ i q K(d(xi,xq)) Cross-validation for feature subset selection Neural Network