Nearest Neighbors in High- Dimensional Data – The Emergence and Influence of Hubs Milos Radovanovic, Alexandros Nanopoulos, Mirjana Ivanovic. ICML 2009.

Slides:



Advertisements
Similar presentations
Reverse Spatial and Textual k Nearest Neighbor Search.
Advertisements

Sheeraza Bandi Sheeraza Bandi CS-635 Advanced Machine Learning Zahid Irfan February 2004 Picture © Greg Martin,
Voronoi-based Geospatial Query Processing with MapReduce
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Keeping Track: Incorporating Nvivo across Time and Space Catherine Compton-Lilly, Jieun Kim, Erin Quast, and Patricia Venegas.
Splines IV – B-spline Curves
COMPUTER AIDED DIAGNOSIS: CLASSIFICATION Prof. Yasser Mostafa Kadah –
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
STOR 892 Object Oriented Data Analysis Radial Distance Weighted Discrimination Jie Xiong Advised by Prof. J.S. Marron Department of Statistics and Operations.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Hubness in the Context of Feature Selection and Generation Miloš Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanović 1 1 Department of Mathematics and.
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Lazy vs. Eager Learning Lazy vs. eager learning
Multivariate Methods Pattern Recognition and Hypothesis Testing.
Measures of Central Tendency. Central Tendency “Values that describe the middle, or central, characteristics of a set of data” Terms used to describe.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Nearest-Neighbor Classifiers Sec minutes of math... Definition: a metric function is a function that obeys the following properties: Identity:
FLANN Fast Library for Approximate Nearest Neighbors
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
ENN: Extended Nearest Neighbor Method for Pattern Recognition
This week: overview on pattern recognition (related to machine learning)
Spatial Statistics Applied to point data.
© Copyright McGraw-Hill CHAPTER 3 Data Description.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Analyzing Graphs Section 2.3. Important Characteristics of Data Center: a representative or average value that indicates where the middle of the data.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Sampling Populations Ideal situation - Perfect knowledge Not possible in many cases - Size & cost Not necessary - appropriate subset  adequate estimates.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Numeric Summaries and Descriptive Statistics. populations vs. samples we want to describe both samples and populations the latter is a matter of inference…
Mathematical Programming in Data Mining Author: O. L. Mangasarian Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. A fast nearest neighbor classifier based on self-organizing incremental neural network (SOINN) Neuron.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given query
Linear Models for Classification
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Unsupervised Learning Networks 主講人 : 虞台文. Content Introduction Important Unsupervised Learning NNs – Hamming Networks – Kohonen’s Self-Organizing Feature.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Probability and Distributions. Deterministic vs. Random Processes In deterministic processes, the outcome can be predicted exactly in advance Eg. Force.
CS Machine Learning Instance Based Learning (Adapted from various sources)
A Spatial Index Structure for High Dimensional Point Data Wei Wang, Jiong Yang, and Richard Muntz Data Mining Lab Department of Computer Science University.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:
KNN & Naïve Bayes Hongning Wang
Columbia University Advanced Machine Learning & Perception – Fall 2006 Term Project Nonlinear Dimensionality Reduction and K-Nearest Neighbor Classification.
Gilad Lerman Math Department, UMN
Machine Learning with Spark MLlib
Data Science Algorithms: The Basic Methods
Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1
Ch8: Nonparametric Methods
Lecture 05: K-nearest neighbors
Instance Based Learning (Adapted from various sources)
Boosting Nearest-Neighbor Classifier for Character Recognition
Collaborative Filtering Nearest Neighbor Approach
Nearest-Neighbor Classifiers
Chapter 8 - Estimation.
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
Department of Electrical Engineering
Lecture 03: K-nearest neighbors
Data Mining Classification: Alternative Techniques
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Nearest Neighbors in High- Dimensional Data – The Emergence and Influence of Hubs Milos Radovanovic, Alexandros Nanopoulos, Mirjana Ivanovic. ICML 2009 Presented by Feng Chen

Outline The Emergence of Hubs – Skewness in Simulated Data – Skewness in Real Data The Influence of Hubs – Influence on Classification – Influence on Clustering Conclusion

The Emergence of Hubs What is Hub? What is Hub? – It refers to a point that is close to many other points, Ex., the points in the center of a cluster. K-occurrences (N k (x)): a metric for “Hubness” K-occurrences (N k (x)): a metric for “Hubness” – The number of times x occurs among the k-NNs of all other points in D. – N k (x) = |R k NN(x)| p1 p2 p3 p4 p5 p0 Hub N 2 (p0)=4, N 2 (p4)=3, N 2 (p5)=0 |R2NN(p0)|=|{p1, p2, p3, p4}|=4 |R2NN(p4)|=|{p0, p1, p3}|=3 |R2NN(p5)|=|{}|=0

The Emergence of Hubs Main Result Main Result : As dimensionality increases, the distribution of k-occurences becomes considerably skewed and hub points emerge (points with very high k-occurrences)

The Emergence of Hubs Skewness in Real Data Skewness in Real Data

The Causes of Skewness Based on existing theoretical results, high dimensional points are approximately lying on a hyper-sphere centered at the data set mean. More theoretical results show that the distribution of distances to the data set mean has a non-negligible variance for any finite d. That implies the existence of non-negligible number of points closer to all other points, a tendency which is amplified by high dimensionality.

The Causes of Skewness Based on existing theoretical results, high dimensional points are approximately lying on a hyper-sphere centered at the data set mean. More theoretical results show that the distribution of distances to the data set mean has a non-negligible variance for any finite d. That implies the existence of non-negligible number of points closer to all other points, a tendency which is amplified by high dimensionality.

Outline The Emergence of Hubs – Skewness in Simulated Data – Skewness in Real Data The Influence of Hubs – Influence on Classification – Influence on Clustering Conclusion

Influence of Hubs on Classification “Good” Hubs vs. “Bad” Hubs “Good” Hubs vs. “Bad” Hubs – A Hub is “good” if, among the points for which x is among the first k-NNs, most of these points have the same class label as the hub point. – Otherwise, it is called a bad hub.

Influence of Hubs on Classification Consider the impacts of hubs on KNN classifier – Revision: Increase the weight of good hubs and reduce the weight of bad hubs

Influence of Hubs on Clustering Defining SC, a metric about the quality of clusters

Influence of Hubs on Clustering Outliers affect intra-cluster distance Hubs affect inter-cluster distance increasesDecreases Outlier Hub Point

Influence of Hubs on Clustering Relative SC: the ratio of the SC of Hubs (outliers) over the SC of normal points.

Conclusion This is the first paper to study the patterns of hubs in high dimensional data Although it is just a preliminary examination, but it demonstrates that the phenomenon of hubs may be significant to many fields of machine learning, data mining, and information retrieval.

What is k-occurrences (N k (x))? N k (x): The number of times x occurs among the k NNs of all other points in D. N k (x) = |R k NN(x)| p1 p2 p3 p4 p5 N 2 (p0)=4 N 2 (p4)=3 N 2 (p5)=0 |R2NN(p0)|=|{p1, p2, p3, p4}|=4 |R2NN(p4)|=|{p0, p1, p3}|=1 |R2NN(p5)|=|{}|=0 p0 Hub

Introduction The emergence of – Distance concentration – What else? This Paper Focuses on the Following Issues – What special characteristics about nearest neighbors in a high dimensional space? – What are the impacts of these special characteristics on machine learning?

Outline Motivation The Skewness of k-occurances Influence on Classification Influence on Clustering Conclusion

What is k-occurrences (N k (x))? N k (x): The number of times x occurs among the k NNs of all other points in D. N k (x) = |R k NN(x)| p1 p2 p3 p4 p5 N 2 (p0)=4 N 2 (p4)=3 N 2 (p5)=0 |R2NN(p0)|=|{p1, p2, p3, p4}|=4 |R2NN(p4)|=|{p0, p1, p3}|=1 |R2NN(p5)|=|{}|=0 p0 Hub

Dim   N k (x)  In  1, the maximum N 1 (x) = 2 In  2, the maximum N 1 (x) = 6 …

The Skewness of k-occurances