Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Similar presentations


Presentation on theme: "Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18."— Presentation transcript:

1 Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18

2 Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000  100) while preserving distances? Now, O(nm) to compute cos(d,q) for all d Then, O(km+kn) where k << n,m Two methods: “Latent semantic indexing” Random projection

3 Briefly LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees good stretching properties with high probability between any pair of points. What about polysemy ?

4 Notions from linear algebra Matrix A, vector v Matrix transpose (A t ) Matrix product Rank Eigenvalues  and eigenvector v: Av = v

5 Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled (faster) in this new space

6 Singular-Value Decomposition Recall m  n matrix of terms  docs, A. A has rank r  m,n Define term-term correlation matrix T=AA t T is a square, symmetric m  m matrix Let P be m  r matrix of eigenvectors of T Define doc-doc correlation matrix D=A t A D is a square, symmetric n  n matrix Let R be n  r matrix of eigenvectors of D

7 A’s decomposition Given P (for T, m  r) and R (for D, n  r) formed by orthonormal columns (unit dot-product) It turns out that A = P  R t Where  is a diagonal matrix with the eigenvalues of T=AA t in decreasing order. = A P  RtRt mnmnmrmr rrrr rnrn

8  For some k << r, zero out all but the k biggest eigenvalues in  [choice of k is crucial] Denote by  k this new version of , having rank k Typically k is about 100, while r ( A’s rank ) is > 10,000 = P kk RtRt Dimensionality reduction AkAk document useless due to 0-col/0-row of  k m x r r x n r k k k 00 0 A m x k k x n

9 Guarantee A k is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m  n matrices of rank k, A k is the best approximation to A wrt the following measures: min B, rank(B)=k ||A-B|| 2 = ||A-A k || 2 =  k  min B, rank(B)=k ||A-B|| F 2 = ||A-A k || F 2 =  k  2  k+2 2  r 2 Frobenius norm ||A|| F 2 =   2   2  r 2

10 Reduction X k =  k R t is the doc-matrix k x n, hence reduced to k dim Since we are interested in doc/q correlation, we consider: D=A t A =(P  R t ) t (P  R t ) = (  R t ) t (  R t ) Approx  with  k, thus get A t A  X k t X k (both are n x n matr.) We use X k to define how to project A and Q: X k =  k R t, substitute R t =   P t A, so get P k t A In fact,  k   P t = P k t which is a k x m matrix This means that to reduce a doc/query vector is enough to multiply it by P k t thus paying O(km) per doc/query Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn) R,P are formed by orthonormal eigenvectors of the matrices D,T

11 Which are the concepts ? c-th concept = c-th row of P k t (which is k x m) Denote it by P k t [c], whose size is m = #terms P k t [c][i] = strength of association between c-th concept and i-th term Projected document: d’ j = P k t d j d’ j [c] = strenght of concept c in d j Projected query: q’ = P k t q q’ [c] = strenght of concept c in q

12 Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

13 An interesting math result f() is called JL-embedding Setting v=0 we also get a bound on f(u)’s stretching!!! Lemma (Johnson-Linderstrauss, ‘82) Let P be a set of n distinct points in m-dimensions. Given  > 0, there exists a function f : P  IR k such that for every pair of points u,v in P it holds: (1 -  ) ||u - v|| 2 ≤ ||f(u) – f(v)|| 2 ≤ (1 +  ) ||u-v|| 2 Where k = O(  -2 log n)

14 What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above for ||u-v|| 2

15 How to compute a JL-embedding? E[r i,j ] = 0 Var[r i,j ] = 1 If we set R = r i,j to be a random mx k matrix, where the components are independent random variables with one of the following distributions

16 Finally...  Random projections hide large constants  k  (1/  ) 2 * log m, so it may be large…  it is simple and fast to compute  LSI is intuitive and may scale to any k  optimal under various metrics  but costly to compute, now good libraries indeed


Download ppt "Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18."

Similar presentations


Ads by Google