 # 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

## Presentation on theme: "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005"— Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005 http://www.ee.technion.ac.il/courses/049011

2 Spectral Methods in Information Retrieval

3 Outline Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD

4 Synonymy and Polysemy Synonymy: multiple terms with (almost) the same meaning  Ex: cars, autos, vehicles  Harms recall Polysemy: a term with multiple meanings  Ex: java (programming language, coffee, island)  Harms precision

5 Traditional Solutions Query expansion  Synonymy: OR on all synonyms Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision  Polysemy: AND on term and additional specializing terms Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall

6 Syntactic Space D: document collection, |D| = n T: term space, |T| = m A t,d : “weight” of t in d (e.g., TFIDF) A T A: pairwise document similarities AA T : pairwise term similarities A m n terms documents

7 Syntactic Indexing Index keys: terms Limitations  Synonymy (Near)-identical rows  Polysemy  Space inefficiency Matrix usually is not full rank Gap between syntax and semantics: Information need is semantic but index and query are syntactic.

8 Semantic Space C: concept space, |C| = r B c,d : “weight” of c in d Change of basis Compare to wavelet and Fourier transforms B r n concepts documents

9 Latent Semantic Indexing (LSI) [Deerwester et al. 1990] Index keys: concepts Documents & query: mixtures of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap Space-efficient  Concepts are orthogonal  Matrix is full rank Questions  What is the concept space?  What is the transformation from the syntax space to the semantic space?  How to filter “noise concepts”?

10 Singular Values A: m×n real matrix Definition:  ≥ 0 is a singular value of A if there exist a pair of vectors u,v s.t. Av =  u and A T u =  v u and v are called singular vectors. Ex:  = ||A|| 2 = max ||x|| 2 = 1 ||Ax|| 2.  Corresponding singular vectors: x that maximizes ||Ax|| 2 and y = Ax / ||A|| 2. Note: A T Av =  2 v and AA T u =  2 u   2 is eigenvalue of A T A and AA T  u eigenvector of A T A  v eigenvector of AA T

11 Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there exists a singular value decomposition: A = U  V T   1 ≥ … ≥  p ≥ 0 (p = min(m,n)): singular values of A   = Diag(  1,…,  p )  U: column-orthogonal m×m matrix (U T U = I )  V: column-orthogonal m×m matrix (V T V = I ) AU VTVT ××  =

12 Singular Values vs. Eigenvalues A = U  V T  1,…,  p : singular values of A   1 2,…,  p 2 : eigenvalues of A T A and AA T u 1,…,u m : columns of U  Orthonormal basis of R m  Left singular vectors of A  Eigenvectors of A T A v 1,…,v n : columns of V  Orthonormal basis of R n  Right singular vectors  Eigenvectors of AA T

13 Economy SVD Let r = max i s.t.  i > 0   r+1 = … =  p = 0  rank(A) = r u 1,…,u r : left singular vectors v 1,…,v n : right singular vectors U T A =  V T AU VTVT ××  = r mm nn r r

14 LSI as SVD U T A =  V T u 1,…,u r : concept basis B =  V T : LSI matrix A d : d-th column of A B d : d-th column of B B d = U T A d B d [c] =

15 Noisy Concepts B = U T A =  V T B d [c] =  c v d [c] If  c is small, then B d [c] small for all d k = largest i s.t.  i is “large” For all c = k+1,…,r, and for all d, c is a low- weight concept in d Main idea: filter out all concepts c = k+1,…,r  Space efficient: # of index terms = k (vs. r or m)  Better retrieval: noisy concepts are filtered out across the board

16 Low-rank SVD B = U T A =  V T U k = (u 1,…,u k ) V k = (v 1,…,v k )  k = upper-left k×k sub-matrix of  A k = U k  k V k T B k =  k V k T rank(A k ) = rank(B k ) = k

17 Low Dimensional Embedding Forbenius norm: Fact: Therefore, if is small, then for “most” d,d’,. A k preserves pairwise similarities among documents  at least as good as A for retrieval.

18 Why is LSI Better? [Papadimitriou et al. 1998] [Azar et al. 2001] LSI summary  Documents are embedded in low dimensional space (m  k)  Pairwise similarities are preserved  More space-efficient But why is retrieval better?  Synonymy  Polysemy

19 Generative Model T: term space, |T| = m A concept c: a distribution on T C: concept space, |C| = k C’: space of all convex combinations of concepts D: distribution on C’×N A corpus model M = (T,C’,D) A document d is generated as follows:  Sample (w,n) according to D  Repeat n times: Sample a concept c from C according to w Sample a term t from T according to c

20 Simplifying Assumptions A: m×n term-document matrix, representing n instantiations of the model D c : documents whose topic is the concept c T c : terms in supp(c) Assumptions: Every document has a single topic (C’ = C) For every two concepts c,c’, ||c – c’|| ≥ 1 -  The probability of every term under a concept c is at most some constant .

21 LSI Works Theorem [Papadimitriou et al. 1998] Given the above assumptions, then with high probability, for every two documents d,d’,  If d,d’ have the same topic, then  If d,d’ have different topics, then

22 Proof For simplicity, assume  = 0 Want to show:  (1) if d,d’ on same topic, A d k, A d’ k are in the same direction  (2) If d,d’ on different topics, A d k, A d’ k are orthogonal A has non-zeroes only in blocks: B 1,…,B k, where B c : sub-matrix of A with rows in T c and columns in D c A T A is a block diagonal matrix with blocks B T 1 B 1,…, B T k B k (i,j)-th entry of B T c B c : term similarity between i-th and j-th documents on the concept c B T c B c : adjacency matrix of a bipartite (multi-)graph G c on D c

23 Proof (cont.) G c is a “random” graph First and second eigenvalues of B T c B c are well separated For all c,c’, second eigenvalue of B T c B c is smaller than first eigenvalue of B T c’ B c’ Top k eigenvalues of A T A are the principal eigenvalues of B T c B c for c = 1,…,k Let u 1,…,u k be corresponding eigenvectors For every document d on topic c, A d is orthogonal to all u 1,…,u k, except for u d. A k d is a scalar multiple of u d.

24 Extensions [Azar et al. 2001] A more general generative model Explain also improved treatment of polysemy

25 Computing SVD Compute singular values of A, by computing eigenvalues of A T A Compute U,V by computing eigenvectors of A T A and AA T Running time not too good: O(m 2 n + m n 2 )  Not practical for huge corpora Sub-linear time algorithms for estimating A_k [Frieze,Kannan,Vempala 1998]

26 HITS and SVD A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of A T A h is principal eigenvector of AA T Therefore: a and h give A 1 : the rank-1 SVD of A Generalization: using A k, we can get k authority and hub vectors, corresponding to other topics in G.

27 End of Lecture 4

Download ppt "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005"

Similar presentations