Download presentation

Presentation is loading. Please wait.

Published byLeonel Irons Modified about 1 year ago

1
Hubness in the Context of Feature Selection and Generation Miloš Radovanović 1 Alexandros Nanopoulos 2 Mirjana Ivanović 1 1 Department of Mathematics and Informatics Faculty of Science, University of Novi Sad, Serbia 2 Institute of Computer Science University of Hildesheim, Germany

2
k-occurrences (N k ) N k (x), the number of k-occurrences of point x, is the number of times x occurs among k nearest neighbors of all other points in a data set N k (x) is the in-degree of node x in the k-NN digraph It was observed that the distribution of N k can become skewed, resulting in the emergence of hubs – points with high N k Music retrieval [Aucouturier 2007] Speech recognition [Doddington 1998] Fingerprint identification [Hicklin 2005] FGSIR'10 2 July 23, 2010

3
Skewness of N k What causes the skewness of N k ? Artefact of data? Are some songs more similar to others? Do some people have fingerprints or voices that are harder to distinguish from other people’s? Specifics of modeling algorithms? Inadequate choice of features? Something more general? FGSIR'10 3 July 23, 2010

4
FGSIR'10 4 July 23, 2010

5
Contributions - Outline Demonstrate the phenomenon Skewness in the distr of k-occurrences Explain its main reasons No artifact of data No specifics of models (inadequate features, etc.) A new aspect of the „curse of dimensionality“ Impact on Feature Selection and Generation FGSIR'10 July 23, 2010 5

6
Outline Demonstrate the phenomenon Explain its main reasons Impact on FSG Conclusions FGSIR'10 July 23, 2010 6

7
Collection of 23 real text data sets FGSIR'10 July 23, 2010 S N k is standardized 3 rd moment of N k If S N k = 0 no skew, positive (negative) values signify right (left) skew High skewness indicates hubness 7

8
Collection of 14 real UCI data sets + microarray data FGSIR'10 8 July 23, 2010

9
Outline Demonstrate the phenomenon Explain its main reasons Impact on IR Conclusions FGSIR'10 July 23, 2010 9

10
Where are the hubs located? FGSIR'10 July 23, 2010 Spearman correlation between N 10 and distance from data set mean Hubs are closer to the data center 10

11
Centrality and its amplification Hubs due to centrality vectors closer to the center tend to be closer to all other vectors thus more frequent k-NN Centrality is amplified by dimensionality FGSIR'10 July 23, 2010 point A closer to center than point B ∑ sim(A,x) - ∑ sim(B,x) x x 11

12
Concentration of similarity Concentration: as dim grows to infinity Ratio between standard deviation of pairwise similarities (distances) and their expectation shrinks to zero Minkowski [François 2007, Beyer 1999, Aggarwal 2001] Meaningfulness of nearest neighbors? Analytical proof for cosine sim [Radovanović 2010] FGSIR'10 July 23, 2010 12

13
The hyper-sphere view Hyper-sphere view Most vectors are about equidistant from the center and from each other, and lie on the surface of a hyper-sphere Few vectors lie at the inner part of hyper-sphere, closer to its center, thus closer to all others This is expected for large but finite dimensionality, since is non negligible FGSIR'10 July 23, 2010 E √V 13

14
What happens with real data? Real text data are usually clustered (mixture of distributions) Cluster with k-Means (#clusters = 3*Cls) Compare with Generalization of the hyper-sphere view with clusters FGSIR'10 July 23, 2010 Spearman correlation between N 10 and distance from data/cluster center 14

15
UCI data FGSIR'10 15 July 23, 2010

16
Can dim reduction help? FGSIR'10 July 23, 2010 Intrinsic dimensionality is reached 16

17
UCI data FGSIR'10 17 July 23, 2010

18
Outline Demonstrate the phenomenon Explain its main reasons Impact on FSG Conclusions FGSIR'10 July 23, 2010 18

19
FGSIR'10 July 23, 2010 “Bad” hubs as obstinate results Based on information about classes, k-occurrences can be distinguished into: “Bad” k-occurrences, BN k (x) “Good” k-occurrences, GN k (x) N k (x) = BN k (x) + GN k (x) 19

20
FGSIR'10 July 23, 2010 How do “bad” hubs originate? Mixture is important also: High dimensionality and skewness of N k do not automatically induce “badness” “Bad” hubs originate from a combination of high dimensionality and violation of the CA Cluster Assumption (CA): Most pairs of vectors in a cluster should be of the same class [Chapelle 2006] 20

21
Skewness of N k vs. #features FGSIR'10 21 July 23, 2010 Skewness stays relatively constant It abruptly drops when intrinsic dimensionality is reached Further feature selection may incur loss of information.

22
Badness vs. #features FGSIR'10 22 July 23, 2010 Similar observations When reaching intrinsic dimensionality, BNk ratio increases The representation ceases to reflect the information provided by labels very well

23
Feature generation When adding features to bring new information to the data: Representation will ultimately increase S Nk and, thus, produce hubs The reduction of BNk ratio “flattens out” fairly quickly, limiting the usefulness of adding new features in the sense of being able to express the “ground truth” If instead of BNk ratio we use classifier error rate, the results are similar FGSIR'10 23 July 23, 2010

24
FGSIR'10 July 23, 2010 Conclusion Little attention by research in feature selection/ generation to the fact that in intrinsically high- dimensional data, hubs will : Result in an uneven distribution of the cluster assumption violation (hubs will be generated that attract more label mismatches with neighborin points) Result in an uneven distribution of responsibility for classification or retrieval error among data points. Investigating further the interaction between: hubness and different notions of CA violation Important new insights into feature selection/generation 24

25
FGSIR'10 July 23, 2010 Thank You! Alexandros Nanopoulos nanopoulos@ismll.de 25

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google