Presentation is loading. Please wait.

Presentation is loading. Please wait.

Spectral Approaches to Nearest Neighbor Search [FOCS 2014]

Similar presentations


Presentation on theme: "Spectral Approaches to Nearest Neighbor Search [FOCS 2014]"β€” Presentation transcript:

1 Spectral Approaches to Nearest Neighbor Search [FOCS 2014]
Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi Kannan Simons Institute, Nov. 2016

2 Nearest Neighbor Search (NNS)
Preprocess: a set 𝑃 of 𝑛 points in ℝ 𝑑 Query: given a query point π‘ž, report a point 𝑝 βˆ— βˆˆπ‘ƒ with the smallest distance to π‘ž 𝑝 βˆ— π‘ž

3 Motivation Generic setup: Application areas: Distance can be:
Points model objects (e.g. images) Distance models (dis)similarity measure Application areas: machine learning: k-NN rule signal processing, vector quantization, bioinformatics, etc… Distance can be: Hamming, Euclidean, edit distance, earth-mover distance, … 000000 011100 010100 000100 011111 000000 001100 000100 110100 111111 𝑝 βˆ— π‘ž

4 Curse of Dimensionality
All exact algorithms degrade rapidly with the dimension 𝑑 Algorithm Query time Space Full indexing 𝑂(log 𝑛⋅𝑑) 𝑛 𝑂(𝑑) (Voronoi diagram size) No indexing – linear scan 𝑂(𝑛⋅𝑑)

5 Approximate NNS Given a query point π‘ž, report 𝑝 β€² βˆˆπ‘ƒ s.t. π‘β€²βˆ’π‘ž ≀𝑐 min 𝑝 βˆ— 𝑝 βˆ— βˆ’π‘ž 𝑐β‰₯1 : approximation factor randomized: return such 𝑝′ with probability β‰₯90% Heuristic perspective: gives a set of candidates (hopefully small) 𝑝 βˆ— π‘ž 𝑝 β€²

6 NNS algorithms It’s all about space partitions ! Low-dimensional
[Arya-Mount’93], [Clarkson’94], [Arya-Mount- Netanyahu-Silverman-We’98], [Kleinberg’97], [HarPeled’02],[Arya-Fonseca-Mount’11],… High-dimensional [Indyk-Motwani’98], [Kushilevitz-Ostrovsky- Rabani’98], [Indyk’98, β€˜01], [Gionis-Indyk- Motwani’99], [Charikar’02], [Datar-Immorlica- Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [Andoni- Indyk’06], [Andoni-Indyk-Nguyen- Razenshteyn’14], [Andoni-Razenshteyn’15]

7 Low-dimensional kd-trees,… 𝑐=1+πœ– runtime: πœ– βˆ’π‘‚(𝑑) β‹…log 𝑛

8 High-dimensional Locality-Sensitive Hashing
Crucial use of random projections Johnson-Lindenstrauss Lemma: project to random subspace of dimension 𝑂( πœ– βˆ’2 log 𝑛 ) for 1+πœ– approximation Runtime: 𝑛 1/𝑐 for 𝑐-approximation

9 Practice Data-aware partitions optimize the partition to your dataset
PCA-tree [Sproull’91, McNames’01, Verma-Kpotufe-Dasgupta’09] randomized kd-trees [SilpaAnan-Hartley’08, Muja-Lowe’09] spectral/PCA/semantic/WTA hashing [Weiss-Torralba-Fergus’08, Wang-Kumar-Chang’09, Salakhutdinov-Hinton’09, Yagnik-Strelow-Ross- Lin’11]

10 Practice vs Theory Data-aware projections often outperform (vanilla) random-projection methods But no guarantees (correctness or performance) JL generally optimal [Alon’03, Jayram-Woodruff’11] Even for some NNS setups! [Andoni-Indyk-Patrascu’06] Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon?

11 Plan for the rest Model Two spectral algorithms Conclusion

12 Our model β€œlow-dimensional signal + large noise”
inside high dimensional space Signal: π‘ƒβŠ‚π‘ˆ for subspace π‘ˆβŠ‚ ℝ 𝑑 of dimension π‘˜β‰ͺ𝑑 Data: each point in 𝑃 is perturbed by a full-dimensional Gaussian noise 𝑁 𝑑 (0, 𝜎 2 𝐼 𝑑 ) π‘ˆ

13 Model properties Data 𝑃 =𝑃+𝐺 Query π‘ž =π‘ž+ 𝑔 π‘ž s.t.:
each point in P have at least unit norm Query π‘ž =π‘ž+ 𝑔 π‘ž s.t.: ||π‘žβˆ’ 𝑝 βˆ— ||≀1 for β€œnearest neighbor” 𝑝 βˆ— ||π‘žβˆ’π‘||β‰₯1+πœ– for everybody else Noise entries 𝑁(0, 𝜎 2 ) πœŽβ‰ˆ 1 𝑑 1/4 up to factor poly( πœ– βˆ’1 π‘˜ log 𝑛) Claim: exact nearest neighbor is still the same Noise is large: has magnitude 𝜎 𝑑 β‰ˆ 𝑑 1/4 ≫1 top π‘˜ dimensions of 𝑃 capture sub-constant mass JL would not work: after noise, gap very close to 1

14 Algorithms via PCA Find the β€œsignal subspace” π‘ˆ ?
then can project everything to π‘ˆ and solve NNS there Use Principal Component Analysis (PCA)? β‰ˆ extract top direction(s) from SVD e.g., π‘˜-dimensional space 𝑆 that minimizes π‘βˆˆπ‘ƒ 𝑑 2 (𝑝,𝑆) If PCA removes noise β€œperfectly”, we are done: 𝑆=π‘ˆ Can reduce to π‘˜-dimensional NNS

15 NNS performance as if we are in π‘˜ dimensions for full model?
Best we can hope for dataset contains a β€œworst-case” π‘˜-dimensional instance Effectively reduce dimension 𝑑 to π‘˜ Spoiler: Yes NNS performance as if we are in π‘˜ dimensions for full model?

16 PCA under noise fails Does PCA find β€œsignal subspace” π‘ˆ under noise ?
PCA minimizes π‘βˆˆπ‘ƒ 𝑑 2 (𝑝,𝑆) good only on β€œaverage”, not β€œworst-case” weak signal directions overpowered by noise directions typical noise direction contributes 𝑖=1 𝑛 𝑔 𝑖 2 𝜎 2 =Θ(𝑛 𝜎 2 ) 𝑝 βˆ—

17 1st Algorithm: intuition
Extract β€œwell-captured points” points with signal mostly inside top PCA space should work for large fraction of points Iterate on the rest 𝑝 βˆ—

18 Iterative PCA To make this work: Find top PCA subspace 𝑆
Nearly no noise in 𝑆: ensuring 𝑆 close to π‘ˆ 𝑆 determined by heavy-enough spectral directions (dimension may be less than π‘˜) Capture only points whose signal fully in 𝑆 well-captured: distance to 𝑆 explained by noise only Find top PCA subspace 𝑆 𝐢=points well-captured by 𝑆 Build NNS d.s. on {𝐢 projected onto 𝑆} Iterate on the remaining points, 𝑃 βˆ–πΆ Query: query each NNS d.s. separately

19 Simpler model Find top-k PCA subspace 𝑆 𝐢=points well-captured by 𝑆
Build NNS on {𝐢 projected onto 𝑆} Iterate on remaining points, 𝑃 βˆ–πΆ Query: query each NNS separately Assume: small noise 𝑝 𝑖 = 𝑝 𝑖 + 𝛼 𝑖 , where || 𝛼 𝑖 ||β‰ͺπœ– can be even adversarial Algorithm: well-captured if 𝑑 𝑝 ,𝑆 ≀2𝛼 Claim 1: if 𝑝 βˆ— captured by 𝐢, will find it in NNS for any captured 𝑝 : || 𝑝 𝑆 βˆ’ π‘ž 𝑆 ||=|| 𝑝 βˆ’π‘ž||Β±4𝛼=||π‘βˆ’π‘ž||Β±5𝛼 Claim 2: number of iterations is 𝑂( log 𝑛 ) 𝑝 ∈ 𝑃 𝑑 2 ( 𝑝 ,𝑆) ≀ 𝑝 ∈ 𝑃 𝑑 2 𝑝 ,π‘ˆ ≀𝑛⋅ 𝛼 2 for at most 1/4-fraction of points, 𝑑 2 𝑝 ,𝑆 β‰₯4 𝛼 2 hence constant fraction captured in each iteration

20 Analysis of general model
Noise is larger, must use that it is random β€œSignal” should be stronger than β€œnoise” (on average) Use random matrix theory 𝑃 =𝑃+𝐺 𝐺 is random 𝑛×𝑑 with entries 𝑁(0, 𝜎 2 ) All singular values πœ† 2 ≀ 𝜎 2 π‘›β‰ˆπ‘›/ 𝑑 𝑃 has rank β‰€π‘˜ and (Frobenius-norm)2 β‰₯𝑛 important directions have πœ† 2 β‰₯Ξ©(𝑛/π‘˜) can ignore directions with πœ† 2 β‰ͺπœ–π‘›/π‘˜ Important signal directions stronger than noise!

21 Closeness of subspaces ?
Trickier than singular values Top singular vector not stable under perturbation! Only stable if second singular value much smaller How to even define β€œcloseness” of subspaces? To the rescue: Wedin’s sin-theta theorem sin πœƒ 𝑆,π‘ˆ = max π‘₯βˆˆπ‘† |π‘₯|=1 min π‘¦βˆˆπ‘ˆ ||π‘₯βˆ’π‘¦|| 𝑆 πœƒ π‘ˆ

22 Wedin’s sin-theta theorem
Developed by [Davis-Kahan’70], [Wedin’72] Theorem: Consider 𝑃 =𝑃+𝐺 𝑆 is top-𝑙 subspace of 𝑃 π‘ˆ is the π‘˜-space containing 𝑃 Then: sin πœƒ 𝑆,π‘ˆ ≀ ||𝐺|| πœ† 𝑙 (𝑃) Another way to see why we need to take directions with sufficiently heavy singular values πœƒ

23 Additional issue: Conditioning
After an iteration, the noise is not random anymore! non-captured points might be β€œbiased” by capturing criterion Fix: estimate top PCA subspace from a small sample of the data Might be purely due to analysis But does not sound like a bad idea in practice either

24 Performance of Iterative PCA
Can prove there are 𝑂 𝑑 log 𝑛 iterations In each, we have NNS in β‰€π‘˜ dimensional space Overall query time: 𝑂 1 πœ– 𝑂 π‘˜ β‹… 𝑑 β‹… log 3/2 𝑛 Reduced to 𝑂 𝑑 log 𝑛 instances of π‘˜-dimension NNS!

25 2nd Algorithm: PCA-tree
Closer to algorithms used in practice Find top PCA direction 𝑣 Partition into slabs βŠ₯𝑣 Snap points to βŠ₯ hyperplane Recurse on each slice Query: follow all tree paths that may contain 𝑝 βˆ— β‰ˆπœ–/π‘˜

26 Two algorithmic modifications
Centering: Need to use centered PCA (subtract average) Otherwise errors from perturbations accumulate Sparsification: Need to sparsify the set of points in each node of the tree Otherwise can get a β€œdense” cluster: not enough variance in signal lots of noise Find top PCA direction 𝑣 Partition into slabs βŠ₯𝑣 Snap points to βŠ₯ hyperplanes Recurse on each slice Query: follow all tree paths that may contain 𝑝 βˆ—

27 Analysis An β€œextreme” version of Iterative PCA Algorithm:
just use the top PCA direction: guaranteed to have signal ! Main lemma: the tree depth is ≀2π‘˜ because each discovered direction close to π‘ˆ snapping: like orthogonalizing with respect to each one cannot have too many such directions Query runtime: 𝑂 π‘˜ πœ– 2π‘˜ Overall performs like 𝑂(π‘˜β‹…log π‘˜)-dimensional NNS!

28 Wrap-up Recent development: Here: Immediate questions:
Why do data-aware projections outperform random projections ? Algorithmic framework to study phenomenon? Recent development: Data-aware worst-case algorithm [Andoni-Razenshtein’15] Here: Model: β€œlow-dimensional signal + large noise” like NNS in low dimensional space via β€œright” adaptation of PCA Immediate questions: Other, less-structured signal/noise models? Algorithms with runtime dependent on spectrum? Broader Q: Analysis that explains empirical success?


Download ppt "Spectral Approaches to Nearest Neighbor Search [FOCS 2014]"

Similar presentations


Ads by Google