Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1

Nearest Neighbors in High-Dimensional Data: The Emergence and Influence of Hubs
Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1 1Department of Mathematics and Informatics Faculty of Science, University of Novi Sad, Serbia 2Institute of Computer Science University of Hildesheim, Germany

ICML'09 Miloš Radovanović
Introduction The curse of dimensionality Distance concentration The tendency of distances between all pairs of points in high-dimensional data to become almost equal Affects meaningfulness of nearest neighbors, indexing, classification, regression [Beyer 1999, Aggarwal 2001, François 2007] We study a related phenomenon which concerns k-NN directed graphs June 15, 2009 ICML' Miloš Radovanović

k-occurrences Nk(x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set Nk(x) is the in-degree of node x in the k-NN digraph It was observed that the distribution of Nk can become skewed, resulting in the emergence of hubs – points with high Nk Music retrieval [Aucouturier 2007] Speech recognition [Doddington 1998] Fingerprint identification [Hicklin 2005] June 15, 2009 ICML' Miloš Radovanović

k-occurrences What caused the skewness of Nk? Artefact of data? Are some songs more similar to others? Do some people have fingerprints or voices that are harder to distinguish from other people’s? Specifics of modeling algorithms? Inadequate choice of features? Something more general? June 15, 2009 ICML' Miloš Radovanović

June 15, 2009 ICML' Miloš Radovanović

This tendency is amplified by high dimensionality
The Causes of Skewness Distance concentration Ratio between a measure of spread (e.g., Std) and a measure of magnitude (e.g., E) of distances converges to 0 as d increases High-dimensional data points approximately lie on a sphere centered at data set mean [Beyer 1999, Aggarwal 2001] The distribution of distances to data set mean always has non-negligible variance [Demartines 1994, François 2007] Existence of points closer to the data set mean is expected, even in high dimensions Points closer to the data set mean tend to be closer to all other points (regardless of dimensionality) This tendency is amplified by high dimensionality June 15, 2009 ICML' Miloš Radovanović

Skewness in Real Data Important factors for real data Dependent attributes Grouping (clustering) 50 data sets From well known repositories (UCI, Kent Ridge) Euclidean and cosine distances, as appropriate Measurements: SN10 – standardized 3rd moment of N10 – Spearman correlation between N10 and distance from data set mean June 15, 2009 ICML' Miloš Radovanović

1. Dependent Attributes Skewness of Nk depends on intrinsic dimensionality dmle – MLE estimate of intrinsic dimensionality Over 50 data sets: Corr(d, SN10) = 0.62, Corr(dmle, SN10) = 0.80 Shuffle elements of each attribute, raising intrinsic to embedding dimensionality, but keeping attribute distributions [François 2007] June 15, 2009 ICML' Miloš Radovanović

1. Dependent Attributes The effect of dimensionality reduction June 15, 2009 ICML' Miloš Radovanović

2. Grouping (Clustering)
Hubs are in proximity of cluster centers Measurement: · – Spearman correlation between N10 and distance from closest cluster mean K-means clustering No. of clusters chosen to maximize June 15, 2009 ICML' Miloš Radovanović

Hubs and Outliers In high dimensions, points with low Nk can be considered distance-based outliers They are far away from other points in the data set / their cluster Their existence is caused by high dimensionality June 15, 2009 ICML' Miloš Radovanović

Hubs and Outliers In high dimensions, points with low Nk can be considered distance-based outliers They are far away from other points in the data set / their cluster Their existence is caused by high dimensionality (k = 20) June 15, 2009 ICML' Miloš Radovanović

Hubs and Outliers Hubs can even be considered probabilistic outliers hubs outliers June 15, 2009 ICML' Miloš Radovanović

Classification Based on labels, k-occurrences can be distinguished into: “Bad” k-occurrences, BNk(x) “Good” k-occurrences, GNk(x) Nk(x) = BNk(x) + GNk(x) “Bad” hubs can appear How do “bad” hubs originate? What is the influence of (“bad”) hubs on classification algorithms? June 15, 2009 ICML' Miloš Radovanović

How do “bad” hubs originate?
Measurements: – normalized sum of all BN10 in data set – correlation between BN10 and N10 CAV – Cluster Assumption Violation coefficient Cluster Assumption (CA): Most pairs of points in a cluster should be of the same class [Chapelle 2006] CAV = a / (a + b) a = no. of pairs of points w. different classes, same cluster b = no. of pairs of points w. same class and cluster K-means clustering June 15, 2009 ICML' Miloš Radovanović

How do “bad” hubs originate?
Observations and answers: High dimensionality and skewness of Nk do not automatically induce “badness” No correlation between and d, dmle, SN10 “Bad” hubs originate from a combination of high dimensionality and violation of the CA Corr( , CAV) = 0.85 Corr(dmle, ) = 0.39 June 15, 2009 ICML' Miloš Radovanović

Influence on the k-NN Classifier
“Bad” hubs provide erroneous class information to many other points We introduce standardized “bad” hubness: hB(x) = (BNk(x) – μBNk) / σBNk During majority voting, the vote of each neighbor x is weighted by exp(–hB(x)) June 15, 2009 ICML' Miloš Radovanović

Influence on SVMs RBF (Gaussian) kernel: K(x, y) = exp(– γ||x–y||2)
Nk, BNk, GNk in kernel space exactly the same as in original space We progressively remove points from training sets (10-fold CV) in the order of decreasing BNk, and at random June 15, 2009 ICML' Miloš Radovanović

“Bad” hubs can be good support vectors: June 15, 2009 ICML' Miloš Radovanović

Influence on AdaBoost + CART
AdaBoost assigns weights to training points, to be considered by weak learners Weights initially equal (1/n) Both hubs and outliers can harm AdaBoost Standardized hubness: h(x) = (Nk(x) – μNk) / σNk Set initial weight of each training point x to 1/(1+|h(x)|) (normalized by the sum over x) June 15, 2009 ICML' Miloš Radovanović

(k = 20) (k = 40) June 15, 2009 ICML' Miloš Radovanović

Clustering Distance-based clustering objectives: Minimize within-cluster distance Maximize between-cluster distance Skewness of Nk affects both objectives Outliers do not cluster well because of high within-cluster distance Hubs also do not cluster well, but because of low between-cluster distance June 15, 2009 ICML' Miloš Radovanović

Clustering Silhouette coefficient (SC): For i-th point ai = avg. distance to points from its cluster (within-cluster distance) bi = min. avg. distance to points from other clusters (between-cluster distance) SCi = (bi – ai) / max(ai, bi) In range [–1, 1], higher is better SC for a set of points is the average of SCi for every point i in the set June 15, 2009 ICML' Miloš Radovanović

Information Retrieval
Retrieving documents most similar to query document Hubs harm precision For document x from data set D, and query q, distance d(x,q) in [0,1], we increase d(x,q) for every x such that h(x) > 2 as follows: June 15, 2009 ICML' Miloš Radovanović

Information Retrieval
(k = 10) (k = 1) Bag-of-words representation Cosine distance Leave-one-out cross-validation June 15, 2009 ICML' Miloš Radovanović

Conclusion Skewness of Nk – an under-studied phenomenon that can have a strong impact Future work: Theoretical study of impact on distance-based ML Examine possible impact on non distance-based ML Seeding iterative clustering algorithms Outlier detection Reverse k-NN queries Time series Using skewness of Nk to estimate intrinsic dimensionality June 15, 2009 ICML' Miloš Radovanović

Thank You June 15, 2009 ICML' Miloš Radovanović

Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1

Similar presentations

Presentation on theme: "Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1

Similar presentations

Presentation on theme: "Miloš Radovanović1 Alexandros Nanopoulos2 Mirjana Ivanović1"— Presentation transcript:

Similar presentations

About project

Feedback