KNN & Naïve Bayes Hongning Wang CS@UVa.

Slides:

Advertisements

Similar presentations

Principles of Density Estimation

Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

Hinrich Schütze and Christina Lioma

Lecture 3 Nonparametric density estimation and classification

Chapter 4: Linear Models for Classification

Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.

K-means clustering Hongning Wang

Classification and Decision Boundaries

Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.

Text Categorization Hongning Wang Today’s lecture Bayes decision theory Supervised text categorization – General steps for text categorization.

K nearest neighbor and Rocchio algorithm

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Scalable Text Mining with Sparse Generative Models

Bayesian Classification with a brief introduction to pattern recognition Modified from slides by Michael L. Raymer, Ph.D.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

K Nearest Neighborhood (KNNs)

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

1 Data Mining Lecture 5: KNN and Bayes Classifiers.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

1 E. Fatemizadeh Statistical Pattern Recognition.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

Jakob Verbeek December 11, 2009

Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given query

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Chapter 13 (Prototype Methods and Nearest-Neighbors )

Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

Relevance Feedback Hongning Wang

METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Ch8: Nonparametric Methods

CH 5: Multivariate Methods

Information Retrieval

K Nearest Neighbor Classification

Nearest-Neighbor Classifiers

Locality Sensitive Hashing

Instance Based Learning

Data Mining Classification: Alternative Techniques

Parametric Methods Berlin Chen, 2005 References:

Nonparametric density estimation and classification

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Recap: Naïve Bayes classifier

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

kNN & Naïve Bayes Hongning Wang CS@UVa

Today’s lecture Instance-based classifiers Model-based classifiers k nearest neighbors Non-parametric learning algorithm Model-based classifiers Naïve Bayes classifier A generative model Parametric learning algorithm CS@UVa CS 6501: Text Mining

How to classify this document? Documents by vector space representation ? Sports Politics Finance CS@UVa CS 6501: Text Mining

Let’s check the nearest neighbor Are you confident about this? ? Sports Politics Finance CS@UVa CS 6501: Text Mining

Let’s check more nearest neighbors Ask k nearest neighbors Let them vote ? Sports Politics Finance CS@UVa CS 6501: Text Mining

Probabilistic interpretation of kNN Approximate Bayes decision rule in a subset of data around the testing point Let 𝑉 be the volume of the 𝑚 dimensional ball around 𝑥 containing the 𝑘 nearest neighbors for 𝑥, we have Nearest neighbors from class 1 𝑝 𝑥 𝑉= 𝑘 𝑁 𝑝 𝑥 = 𝑘 𝑁𝑉 => 𝑝 𝑥|𝑦=1 = 𝑘 1 𝑁 1 𝑉 𝑝 𝑦=1 = 𝑁 1 𝑁 Total number of instances Total number of instances in class 1 With Bayes rule: 𝑝 𝑦=1|𝑥 = 𝑁 1 𝑁 × 𝑘 1 𝑁 1 𝑉 𝑘 𝑁𝑉 = 𝑘 1 𝑘 Counting the nearest neighbors from class 1 CS@UVa CS 6501: Text Mining

kNN is close to optimal Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice of the Bayes error rate Decision boundary 1NN - Voronoi tessellation A non-parametric estimation of posterior distribution CS@UVa CS 6501: Text Mining

Components in kNN A distance metric Euclidean distance/cosine similarity How many nearby neighbors to look at k Instance look up Efficiently search nearby points CS@UVa CS 6501: Text Mining

Effect of k Choice of k influences the “smoothness” of the resulting classifier CS@UVa CS 6501: Text Mining

Effect of k Choice of k influences the “smoothness” of the resulting classifier k=1 CS@UVa CS 6501: Text Mining

Effect of k Choice of k influences the “smoothness” of the resulting classifier k=5 CS@UVa CS 6501: Text Mining

Effect of k Large k -> smooth shape for decision boundary Small k -> complicated decision boundary Error Error on testing set Error on training set Larger k Smaller k CS@UVa CS 6501: Text Mining Model complexity

Efficient instance look-up Recall MP1 In Yelp_small data set, there are 629K reviews for training and 174K reviews for testing Assume we have a vocabulary of 15K Complexity of kNN 𝑂(𝑁𝑀𝑉) Training corpus size Testing corpus size Feature size CS@UVa CS 6501: Text Mining

Efficient instance look-up Exact solutions Build inverted index for text documents Special mapping: word -> document list Speed-up is limited when average document length is large Dictionary Postings information retrieval retrieved is helpful Doc1 Doc2 Doc1 Doc2 Doc1 Doc2 Doc1 Doc2 CS@UVa CS 6501: Text Mining

Efficient instance look-up Exact solutions Build inverted index for text documents Special mapping: word -> document list Speed-up is limited when average document length is large Parallelize the computation Map-Reduce Map training/testing data onto different reducers Merge the nearest k neighbors from the reducers CS@UVa CS 6501: Text Mining

Efficient instance look-up Approximate solution Locality sensitive hashing Similar documents -> (likely) same hash values h(x) CS@UVa CS 6501: Text Mining

Efficient instance look-up Approximate solution Locality sensitive hashing Similar documents -> (likely) same hash values Construct the hash function such that similar items map to the same “buckets” with high probability Learning-based: learn the hash function with annotated examples, e.g., must-link, cannot-link Random projection CS@UVa CS 6501: Text Mining

Random projection Approximate the cosine similarity between vectors ℎ 𝑟 𝑥 =𝑠𝑔𝑛(𝑥⋅𝑟), 𝑟 is a random unit vector Each 𝑟 defines one hash function, i.e., one bit in the hash value 1 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝐷 𝑥 𝐷 𝑦 𝐷 𝑥 𝐷 𝑦 𝜃 𝒓 𝟏 𝐷 𝑥 𝐷 𝑦 𝜃 𝒓 𝟐 𝐷 𝑦 𝐷 𝑥 𝜃 𝒓 𝟑 CS@UVa CS 6501: Text Mining

Random projection Approximate the cosine similarity between vectors ℎ 𝑟 𝑥 =𝑠𝑔𝑛(𝑥⋅𝑟), 𝑟 is a random unit vector Each 𝑟 defines one hash function, i.e., one bit in the hash value 1 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝐷 𝑥 𝐷 𝑦 𝒓 𝟏 𝐷 𝑥 𝐷 𝑥 𝐷 𝑥 𝜃 𝜃 𝒓 𝟐 𝜃 𝐷 𝑦 𝐷 𝑦 𝐷 𝑦 CS@UVa CS 6501: Text Mining 𝒓 𝟑

Random projection Approximate the cosine similarity between vectors ℎ 𝑟 𝑥 =𝑠𝑔𝑛(𝑥⋅𝑟), 𝑟 is a random unit vector Each 𝑟 defines one hash function, i.e., one bit in the hash value Provable approximation error 𝑃 ℎ 𝑥 =ℎ 𝑦 =1− 𝜃(𝑥,𝑦) 𝜋 CS@UVa CS 6501: Text Mining

Efficient instance look-up Effectiveness of random projection 1.2M images + 1000 dimensions 1000x speed-up CS@UVa CS 6501: Text Mining

Weight the nearby instances When the data distribution is highly skewed, frequent classes might dominate majority vote They occur more often in the k nearest neighbors just because they have large volume ? Sports Politics Finance CS@UVa CS 6501: Text Mining

Weight the nearby instances When the data distribution is highly skewed, frequent classes might dominate majority vote They occur more often in the k nearest neighbors just because they have large volume Solution Weight the neighbors in voting 𝑤 𝑥, 𝑥 𝑖 = 1 |𝑥− 𝑥 𝑖 | or 𝑤 𝑥, 𝑥 𝑖 =cos⁡(𝑥, 𝑥 𝑖 ) CS@UVa CS 6501: Text Mining

Summary of kNN Instance-based learning Efficient computation No training phase Assign label to a testing case by its nearest neighbors Non-parametric Approximate Bayes decision boundary in a local region Efficient computation Locality sensitive hashing Random projection CS@UVa CS 6501: Text Mining

Recap: probabilistic interpretation of kNN Approximate Bayes decision rule in a subset of data around the testing point Let 𝑉 be the volume of the 𝑚 dimensional ball around 𝑥 containing the 𝑘 nearest neighbors for 𝑥, we have Nearest neighbors from class 1 𝑝 𝑥 𝑉= 𝑘 𝑁 𝑝 𝑥 = 𝑘 𝑁𝑉 => 𝑝 𝑥|𝑦=1 = 𝑘 1 𝑁 1 𝑉 𝑝 𝑦=1 = 𝑁 1 𝑁 Total number of instances Total number of instances in class 1 With Bayes rule: 𝑝 𝑦=1|𝑥 = 𝑁 1 𝑁 × 𝑘 1 𝑁 1 𝑉 𝑘 𝑁𝑉 = 𝑘 1 𝑘 Counting the nearest neighbors from class 1 CS@UVa CS 6501: Text Mining

Recap: effect of k Large k -> smooth shape for decision boundary Small k -> complicated decision boundary Error Error on testing set Error on training set Larger k Smaller k CS@UVa CS 6501: Text Mining Model complexity

Recap: efficient instance look-up Approximate solution Locality sensitive hashing Similar documents -> (likely) same hash values h(x) CS@UVa CS 6501: Text Mining

Random projection Approximate the cosine similarity between vectors ℎ 𝑟 𝑥 =𝑠𝑔𝑛(𝑥⋅𝑟), 𝑟 is a random unit vector Each 𝑟 defines one hash function, i.e., one bit in the hash value 1 𝒓 𝟏 𝒓 𝟐 𝒓 𝟑 𝐷 𝑥 𝐷 𝑦 𝐷 𝑥 𝐷 𝑦 𝜃 𝒓 𝟏 𝐷 𝑥 𝐷 𝑦 𝜃 𝒓 𝟐 𝐷 𝑦 𝐷 𝑥 𝜃 𝒓 𝟑 CS@UVa CS 6501: Text Mining

Recap: random projection Approximate the cosine similarity between vectors ℎ 𝑟 𝑥 =𝑠𝑔𝑛(𝑥⋅𝑟), 𝑟 is a random unit vector Each 𝑟 defines one hash function, i.e., one bit in the hash value Provable approximation accuracy 𝑃 ℎ 𝑥 =ℎ 𝑦 =1− 𝜃(𝑥,𝑦) 𝜋 CS@UVa CS 6501: Text Mining

Recall optimal Bayes decision boundary 𝑓 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃(𝑦|𝑋) *Optimal Bayes decision boundary 𝑝(𝑋,𝑦) 𝑦 =0 𝑦 =1 𝑝 𝑋 𝑦=0 𝑝(𝑦=0) 𝑝 𝑋 𝑦=1 𝑝(𝑦=1) False negative False positive 𝑋 CS@UVa CS 6501: Text Mining

Estimating the optimal classifier 𝑓 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑦 𝑋 Requirement: |D|>> 𝒀 ×( 𝟐 𝑽 −𝟏) =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑋 𝑦 𝑃(𝑦) Class conditional density Class prior #parameters: 𝑌 ×( 2 𝑉 −1) 𝑌 −1 text information identify mining mined is useful to from apple delicious Y D1 1 D2 D3 V binary features CS@UVa CS 6501: Text Mining

We need to simplify this Features are conditionally independent given class label 𝑝 𝑥 1 , 𝑥 2 𝑦 =𝑝 𝑥 2 𝑥 1 ,𝑦 𝑝( 𝑥 1 |𝑦) E.g., 𝑝 ‘𝑤ℎ𝑖𝑡𝑒 ℎ𝑜𝑢𝑠𝑒’,‘𝑜𝑏𝑎𝑚𝑎’ 𝑝𝑜𝑙𝑖𝑡𝑖𝑐𝑎𝑙 𝑛𝑒𝑤𝑠 =𝑝 ‘𝑤ℎ𝑖𝑡𝑒 ℎ𝑜𝑢𝑠𝑒’ 𝑝𝑜𝑙𝑖𝑡𝑖𝑐𝑎𝑙 𝑛𝑒𝑤𝑠 ×𝑝(‘𝑜𝑏𝑎𝑚𝑎’|𝑝𝑜𝑙𝑖𝑡𝑖𝑐𝑎𝑙 𝑛𝑒𝑤𝑠) =𝑝 𝑥 2 𝑦 𝑝( 𝑥 1 |𝑦) This does not mean ‘white house’ is independent of ‘obama’! CS@UVa CS 6501: Text Mining

Conditional v.s. marginal independence Features are not necessarily marginally independent from each other 𝑝 ‘𝑤ℎ𝑖𝑡𝑒 ℎ𝑜𝑢𝑠𝑒’ ‘𝑜𝑏𝑎𝑚𝑎’ >𝑝(‘𝑤ℎ𝑖𝑡𝑒 ℎ𝑜𝑢𝑠𝑒’) However, once we know the class label, features become independent from each other Knowing it is already political news, observing ‘obama’ contributes little about occurrence of ‘while house’ CS@UVa CS 6501: Text Mining

Naïve Bayes classifier 𝑓 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑦 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑋 𝑦 𝑃(𝑦) =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑖=1 |𝑑| 𝑃( 𝑥 𝑖 |𝑦) 𝑃 𝑦 Class conditional density Class prior #parameters: 𝑌 ×(𝑉−1) 𝑌 −1 Computationally feasible 𝑌 ×( 2 𝑉 −1) v.s. CS@UVa CS 6501: Text Mining

Naïve Bayes classifier 𝑓 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑦 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑃 𝑋 𝑦 𝑃(𝑦) By Bayes rule =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 𝑖=1 |𝑑| 𝑃( 𝑥 𝑖 |𝑦) 𝑃 𝑦 By independence assumption y x2 x3 xv x1 … CS@UVa CS 6501: Text Mining

Estimating parameters Maximial likelihood estimator 𝑃 𝑥 𝑖 𝑦 = 𝑑 𝑗 𝛿( 𝑥 𝑑 𝑗 = 𝑥 𝑖 , 𝑦 𝑑 =𝑦) 𝑑 𝛿( 𝑦 𝑑 =𝑦) 𝑃(𝑦)= 𝑑 𝛿( 𝑦 𝑑 =𝑦) 𝑑 1 text information identify mining mined is useful to from apple delicious Y D1 1 D2 D3 CS@UVa CS 6501: Text Mining

Enhancing Naïve Bayes for text classification I The frequency of words in a document matters 𝑃 𝑋 𝑦 = 𝑖=1 |𝑑| 𝑃 𝑥 𝑖 𝑦 𝑐( 𝑥 𝑖 ,𝑑) In log space 𝑓 𝑋 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 log 𝑃 𝑦 𝑋 Essentially, estimating |𝒀| different language models! =𝑎𝑟𝑔𝑚𝑎 𝑥 𝑦 log 𝑃(𝑦) + 𝑖=1 |𝑑| 𝑐( 𝑥 𝑖 ,𝑑) log 𝑃( 𝑥 𝑖 |𝑦) Class bias Feature vector Model parameter CS@UVa CS 6501: Text Mining

Enhancing Naïve Bayes for text classification For binary case 𝑓 𝑋 =𝑠𝑔𝑛 log 𝑃 𝑦=1 𝑋 𝑃 𝑦=0 𝑋 =𝑠𝑔𝑛 log 𝑃 𝑦=1 𝑃 𝑦=0 + 𝑖=1 𝑑 𝑐 𝑥 𝑖 ,𝑑 log 𝑃 𝑥 𝑖 𝑦=1 𝑃 𝑥 𝑖 𝑦=0 =𝑠𝑔𝑛( 𝑤 𝑇 𝑥 ) where 𝑤= log 𝑃 𝑦=1 𝑃 𝑦=0 , log 𝑃 𝑥 1 𝑦=1 𝑃 𝑥 1 𝑦=0 ,…, log 𝑃 𝑥 𝑣 𝑦=1 𝑃 𝑥 𝑣 𝑦=0 𝑥 =(1,𝑐( 𝑥 1 ,𝑑),…,𝑐( 𝑥 𝑣 ,𝑑)) a linear model with vector space representation? We will come back to this topic later. CS@UVa CS 6501: Text Mining

Enhancing Naïve Bayes for text classification II Usually, features are not conditionally independent 𝑝 𝑋 𝑦 ≠ 𝑖=1 |𝑑| 𝑃( 𝑥 𝑖 |𝑦) Enhance the conditional independence assumptions by N-gram language models 𝑝 𝑋 𝑦 = 𝑖=1 |𝑑| 𝑃( 𝑥 𝑖 | 𝑥 𝑖−1 ,…, 𝑥 𝑖−𝑁+1 ,𝑦) CS@UVa CS 6501: Text Mining

Enhancing Naïve Bayes for text classification III Sparse observation 𝛿 𝑥 𝑑 𝑗 = 𝑥 𝑖 , 𝑦 𝑑 =𝑦 =0⇒𝑝 𝑥 𝑖 |𝑦 =0 Then, no matter what values the other features take, 𝑝 𝑥 1 ,…, 𝑥 𝑖 ,…, 𝑥 𝑉 |𝑦 =0 Smoothing class conditional density All smoothing techniques we have discussed in language models are applicable here CS@UVa CS 6501: Text Mining

Maximum a Posterior estimator Adding pseudo instances Priors: 𝑞(𝑦) and 𝑞(𝑥,𝑦) MAP estimator for Naïve Bayes 𝑃 𝑥 𝑖 𝑦 = 𝑑 𝑗 𝛿( 𝑥 𝑑 𝑗 = 𝑥 𝑖 , 𝑦 𝑑 =𝑦) +𝑀𝑞( 𝑥 𝑖 ,𝑦) 𝑑 𝛿( 𝑦 𝑑 =𝑦) +𝑀𝑞(𝑦) Can be estimated from a related corpus or manually tuned #pseudo instances CS@UVa CS 6501: Text Mining

Summary of Naïve Bayes Optimal Bayes classifier Naïve Bayes with independence assumptions Parameter estimation in Naïve Bayes Maximum likelihood estimator Smoothing is necessary CS@UVa CS 6501: Text Mining

Today’s reading Introduction to Information Retrieval Chapter 13: Text classification and Naive Bayes 13.2 – Naive Bayes text classification 13.4 – Properties of Naive Bayes Chapter 14: Vector space classification 14.3 k nearest neighbor 14.4 Linear versus nonlinear classifiers CS@UVa CS 6501: Text Mining