# Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

## Presentation on theme: "Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)"— Presentation transcript:

Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)

2 http://www.cs.ucla.edu/~rafail/ PLAN  Problem Formulations  Communication complexity game  What really happened? (dimension reduction)  Solutions to 2 problems –ANN –k-clustering  What’s next?

3 http://www.cs.ucla.edu/~rafail/ Problem statements  Johnson-lindenstrauss lemma: n points in high dim. Hilbert Space can be embedded into O(logn) dim subspace with small distortion  Q: how do we do it for the Hamming Cube?  (we show how to avoid impossibility of [Charicar-Sahai])

4 http://www.cs.ucla.edu/~rafail/ Many different formulations of ANN  ANN – “approximate nearest neighbor search”  (many applications in computational geometry, biology/stringology, IR, other areas)  Here are different formulations:

5 http://www.cs.ucla.edu/~rafail/ Approximate Searching  Motivation: given a DB of “names”, user with a “target” name, find if any of DB names are “close” to the current name, without doing liner scan. Jon Alice Bob Eve Panconesi Kate Fred A.Panconesi ? 

6 http://www.cs.ucla.edu/~rafail/ Geometric formulation  Nearest Neighbor Search (NNS): given N blue points (and a distance function, say Euclidian distance in R d ), store all these points somehow

7 http://www.cs.ucla.edu/~rafail/ Data structure question  given a new red point, find closest blue point. Naive solution 1: store blue points “as is” and when given a red point, measure distances to all blue points. Q: can we do better?

8 http://www.cs.ucla.edu/~rafail/ Can we do better?  Easy in small dimensions (Voronoi diagrams)  “Curse of dimensionality” in High Dimensions…  [KOR]: Can get a good “approximate” solution efficiently!

9 http://www.cs.ucla.edu/~rafail/ Hamming Cube Formulation for ANN  Given a DB of N blue n-bit strings, process them somehow. Given an n-bit red string find ANN in the Hyper-Cube {0,1} n  Naïve solution 2: pre-compute all (exponential #) of answers (want small data-structures!) 00101011 01011001 11101001 10110110 11010101 11011000 10101010 10101111 11010100 

10 http://www.cs.ucla.edu/~rafail/ Clustering problem that I’ll discuss in detail  K-clustering

11 http://www.cs.ucla.edu/~rafail/ An example of Clustering – find “centers”  Given N points in R d

12 http://www.cs.ucla.edu/~rafail/ A clustering formulation  Find cluster “centers”

13 http://www.cs.ucla.edu/~rafail/ Clustering formulation  The “cost” is the sum of distances

14 http://www.cs.ucla.edu/~rafail/ Main technique  First, as a communication game  Second, interpreted as a dimension reduction

15 http://www.cs.ucla.edu/~rafail/ COMMUNICATION COMPLEXITY GAME  Given two players Alice and Bob,  Alice is secretly given string x  Bob is secretly given string y  they want to estimate hamming distance between x and y with small communication (with small error), provided that they have common randomness  How can they do it? (say length of |x|=|y|= N)  Much easier: how do we check that x=y ?

16 http://www.cs.ucla.edu/~rafail/ Main lemma : an abstract game  How can Alice and Bob estimate hamming distance between X and Y with small CC?  We assume Alice and Bob share randomness ALICE X 1 X 2 X 3 X 4 …X n BOB Y 1 Y 2 Y 3 Y 4 …Y n 

17 http://www.cs.ucla.edu/~rafail/ A simpler question  To estimate hamming distance between X and Y (within (1+  )) with small CC, sufficient for Alice and Bob for any L to be able to distinguish X and Y for: – H(X,Y) <= L OR – H(X,Y) > (1+  ) L  Q: why sampling does not work? ALICE X 1 X 2 X 3 X 4 …X n BOB Y 1 Y 2 Y 3 Y 4 …Y n 

18 http://www.cs.ucla.edu/~rafail/ Alice and Bob pick the SAME n-bit blue R each bit of R=1 independently with probability 1/2L 0 1 0 1 0 0 0 1 0 1 0 XOR 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 XOR 0/1 0 1 0 0 0 1 0 0 1 0 0 X Y

19 http://www.cs.ucla.edu/~rafail/ What is the difference in probabilities? H(X,Y) (1+  ) L 0 1 0 1 0 0 0 1 0 1 0 XOR 0/1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 XOR 0/1 0 1 0 0 0 1 0 0 1 0 0 X Y

20 http://www.cs.ucla.edu/~rafail/ How do we amplify? 0 1 0 1 0 0 0 1 0 1 0 XOR 0/1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 XOR 0/1 0 1 0 0 0 1 0 0 1 0 0 X Y

21 http://www.cs.ucla.edu/~rafail/ How do we amplify? - Repeat, with many independent R ’ s but same distribution! 0 1 0 1 0 0 0 1 0 1 0 XOR 0/1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 0 XOR 0/1 0 1 0 0 0 1 0 0 1 0 0 X Y

22 http://www.cs.ucla.edu/~rafail/ a refined game with a small communication  How can Alice and Bob distinguish X and Y: – H(X,Y) <= L OR – H(X,Y) > (1+  ) L ALICE X 1 X 2 X 3 X 4 …X n For each R XOR (subset) of X i Compare the outputs. BOB Y 1 Y 2 Y 3 Y 4 …Y n For each R XOR (the same subset) of Y i Compare the outputs.  Pick 1/   logN R’s with correct distribution Compare this linear transformation.

23 http://www.cs.ucla.edu/~rafail/ Dimension Reduction in the Hamming Cube [OR] For each L, we can pick O(log N) R’s and boost the Probabilities! Key Property: we get an embedding from large to small cube that preserve ranges around L very well.

24 http://www.cs.ucla.edu/~rafail/ Dimension Reduction in the Hamming Cube [OR] For each L, we can pick O(log N) R’s and boost the Probabilities! Key Property: we get an embedding from large to small cube that preserve ranges around L. Key idea in applications: can build inverse lookup table for the small cube!

25 http://www.cs.ucla.edu/~rafail/ Applications  Applications of the dimension reduction in the Hamming CUBE  For ANN in the Hamming cube and R d  For K-Clustering

26 http://www.cs.ucla.edu/~rafail/ Application to ANN in the Hamming Cube  For each possible L build a “small cube” and project original DB to a small cube  Pre-compute inverse table for each entry of the small cube.  Why is this efficient?  How do we answer any query?  How do we navigate between different L?

27 http://www.cs.ucla.edu/~rafail/ Putting it All together: User’s private approx search from DB  Each projection is O(log N) R’s. User picks many such projections for each L-range. That defines all the embeddings.  Now, DB builds inverse lookup tables for each projection as new DB’s for each L.  User can now “project” its query into small cube and use binary search on L

28 http://www.cs.ucla.edu/~rafail/ MAIN THM [KOR]  Can build poly-size data-structure to do ANN for high-dimensional data in time polynomial in d and poly-log in N –For the hamming cube –L_1 –L_2 –Square of the Euclidian dist.  [IM] had a similar results, slightly weaker guarantee.

29 http://www.cs.ucla.edu/~rafail/ Dealing with R d  Project to random lines, choose “cut” points…  Well, not exactly… we need “navigation”

30 http://www.cs.ucla.edu/~rafail/ Clustering  Huge number of applications (IR, mining, analysis of stat data, biology, automatic taxonomy formation, web, topic-specific data-collections, etc.)  Two independent issues: –Representation of data –Forming “clusters” (many incomparable methods)

31 http://www.cs.ucla.edu/~rafail/ Representation of data examples  Latent semantic indexing yields points in R d with l 2 distance (distance indicating similarity)  Min-wise permutation (Broder at. al.) approach yields points in the hamming metric  Many other representations from IR literature lead to other metrics, including edit-distance metric on strings  Recent news: [OR-95] showed that we can embed edit-distance metric into l 1 with small distortion distortion= exp(sqrt(\log n \log log n))

32 http://www.cs.ucla.edu/~rafail/ Geometric Clustering: examples  Min-sum clustering in R d : form clusters s.t. the sum of intra-cluster distances in minimized  K-clustering: pick k “centers” in the ambient space. The cost is the sum of distances from each data-point to the closest center  Agglomerative clustering (form clusters below some distance-threshold)  Q: which is better?

33 http://www.cs.ucla.edu/~rafail/ Methods are (in general) incomparable

34 http://www.cs.ucla.edu/~rafail/ Min-SUM

35 http://www.cs.ucla.edu/~rafail/ 2-Clustering

36 http://www.cs.ucla.edu/~rafail/ A k-clustering problem: notation  N – number of points  d – dimension  k – number of centers

37 http://www.cs.ucla.edu/~rafail/ About k-clustering  When k if fixed, this is easy for small d  [Kleinberg, Papadimitriou, Raghavan]: NP-complete for k=2 for the cube  [Drineas, Frieze, Kannan, Vempala, Vinay]” NP complete for R d for square of the Euclidian distance  When k is not fixed, this is facility location (Euclidian k- median)  For fixed d but growing k a PTAS was given by [Arora, Raghavan, Rao] (using dynamic prog.)  (this talk): [OR]: PTAS for fixed k, arbitrary d

38 http://www.cs.ucla.edu/~rafail/ Common tools in geometric PTAS  Dynamic programming  Sampling [Schulman, AS, DLVK]  [DFKVV] use SVD  Embeddings/dimension reduction seem useless because –Too many candidate centers –May introduce new centers

39 http://www.cs.ucla.edu/~rafail/ [OR] k-clustering result  A PTAS for fixed k –Hamming cube {0,1} d –l1d–l1d –l 2 d (Euclidian distance) –Square of the Euclidian distance

40 http://www.cs.ucla.edu/~rafail/ Main ideas  For 2-clustering find a good partition is as good as solving the problem  Switch to cube  Try partitions in the embedded low- dimensional data set  Given a partition, compute centers and cost in the original data send  Embedding/dim. reduction used to reduce the number of partitions

41 http://www.cs.ucla.edu/~rafail/ Stronger property of [OR] dimension reduction  Our random linear transformation preserve ranges!

42 http://www.cs.ucla.edu/~rafail/ THE ALGORITHM

43 http://www.cs.ucla.edu/~rafail/ The algorithm yet again  Guess 2-center distance  Map to small cube  Partition in the small cube  Measure the partition in the big cube  THM: gets within (1+  of optimal.  Disclaimer: PTAS is (almost never) practical, this shows “feasibility only”, more ideas are needed for a practical solution.

44 http://www.cs.ucla.edu/~rafail/ Dealing with k>2  Apex of a tournament is a node of max out- degree  Fact: apex has a path of length 2 to every node  Every point is assigned an apex of center “tournaments”: –Guess all (k choose 2) center distances –Embed into (k choose 2) small cubes –Guess center-projection in small cubes –For every point, for every pair of centers, define a “tournament” which center is closer in the projection

45 http://www.cs.ucla.edu/~rafail/ Conclusions  Dimension reduction in the cube allows to deal with huge number of “incomparable” attributes.  Embeddings of other metrics into the cube allows fast ANN for other metrics  Real applications still require considerable additional ideas  Fun area to work in

Download ppt "Dimension Reduction in the Hamming Cube (and its Applications) Rafail Ostrovsky UCLA (joint works with Rabani; and Kushilevitz and Rabani)"

Similar presentations