 # 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

## Presentation on theme: "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006"— Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 http://www.ee.technion.ac.il/courses/049011

2 Principal Eigenvector Computation E: n × n matrix | 1 | > | 2 | ≥ | 3 | … ≥ | n | : eigenvalues of E  Suppose 1 > 0 v 1,…,v n : corresponding eigenvectors  Eigenvectors form a basis  Suppose ||v 1 || 2 = 1 Input:  The matrix E  A unit vector u, which is not in span(v 2,…,v n ) Goal: compute 1 and v 1

3 The Power Method

4 Why Does It Work? Theorem: As t  , w  ±v 1 Convergence rate: Proportional to ( 2 / 1 ) t The larger the “spectral gap” | 1 |- | 2 |, the faster the convergence.

5 Spectral Methods in Information Retrieval

6 Outline Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD

7 Synonymy and Polysemy Synonymy: multiple terms with (almost) the same meaning  Ex: cars, autos, vehicles  Harms recall Polysemy: a term with multiple meanings  Ex: java (programming language, coffee, island)  Harms precision

8 Traditional Solutions Query expansion  Synonymy: OR on all synonyms Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision  Polysemy: AND on term and additional specializing terms Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall

9 Syntactic Indexing D: document collection, |D| = n T: term space, |T| = m A t,d : “weight” of t in d (e.g., TFIDF) A T A: pairwise document similarities AA T : pairwise term similarities A m n terms documents

10 Latent Semantic Indexing (LSI) [Deerwester et al. 1990] C: concept space, |C| = r Documents & query: “mixtures” of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap B r n concepts documents

11 Fourier Transform Time domain: time 3 ×+ 1.1 ×= frequency 3 1.1 Frequency domain: Compact discrete representation Effective for noise removal

12 Latent Semantic Indexing Documents, queries ~ signals  Vectors in R m Concepts ~ base signals  Orthonormal basis of columns(A) Semantic indexing of a document ~ Fourier transform of a signal  Representation of document in concept basis Advantages  Space-efficient  Better handling of synonymy and polysemy  Removal of “noise”

13 Open Questions How to choose the concept basis? How to transform the syntactic index into a semantic index? How to filter out “noisy concepts”?

14 Singular Values A: m×n real matrix Definition:  ≥ 0 is a singular value of A if there exists a pair of vectors u,v s.t. Av =  u and A T u =  v. u and v are called singular vectors. Ex:  = ||A|| 2 = max ||x|| 2 = 1 ||Ax|| 2.  Corresponding singular vectors: x that maximizes ||Ax|| 2 and y = Ax / ||A|| 2. Note: A T Av =  2 v and AA T u =  2 u   2 is eigenvalue of A T A and AA T  u eigenvector of A T A  v eigenvector of AA T

15 Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there exists a singular value decomposition: A = U  V T   1 ≥ … ≥  r > 0 (r = rank(A)): singular values of A   = Diag(  1,…,  r )  U: column-orthonormal m×r matrix (U T U = I )  V: column-orthonormal n×r matrix (V T V = I ) A U VTVT ××  =

16 Singular Values vs. Eigenvalues A = U  V T  1,…,  r : singular values of A   1 2,…,  r 2 : non-zero eigenvalues of A T A and AA T u 1,…,u r : columns of U  Orthonormal basis for span(columns of A)  Left singular vectors of A  Eigenvectors of A T A v 1,…,v r : columns of V  Orthonormal basis for span(rows of A)  Right singular vectors  Eigenvectors of AA T

17 LSI as SVD A = U  V T  U T A =  V T u 1,…,u r : concept basis B =  V T : LSI matrix (semantic index) A d : d-th column of A B d : d-th column of B B d = U T A d B d [c] = u c T A d

18 Noisy Concepts B = U T A =  V T B d [c] =  c v d [c] If  c is small, then B d [c] small for all d k = largest i s.t.  i is “large” For all c = k+1,…,r, and for all d, c is a low- weight concept in d Main idea: filter out all concepts c = k+1,…,r  Space efficient: # of index terms = k (vs. r or m)  Better retrieval: noisy concepts are filtered out across the board

19 Low-rank SVD B = U T A =  V T U k = (u 1,…,u k ) V k = (v 1,…,v k )  k = upper-left k×k sub-matrix of  A k = U k  k V k T B k =  k V k T rank(A k ) = rank(B k ) = k

20 Low Dimensional Embedding Theorem: If is small, then for “most” d,d’,. A k preserves pairwise similarities among documents  at least as good as A for retrieval.

21 Why is LSI Better? [Papadimitriou et al. 1998] [Azar et al. 2001] LSI summary  Documents are embedded in low dimensional space (m  k)  Pairwise similarities are preserved  More space-efficient But why is retrieval better?  Synonymy  Polysemy

22 Generative Model A corpus model M = (T,C,W,D)  T: Term space, |T| = m  C: Concept space, |C| = k Concept: distribution over terms  W: Topic space Topic: distribution over concepts  D: Document distribution Distribution over W × N A document d is generated as follows:  Sample a topic w and a length n according to D  Repeat n times: Sample a concept c from C according to w Sample a term t from T according to c

23 Simplifying Assumptions Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 -  The probability of every term under a concept c is at most some constant .

24 LSI Works A: m×n term-document matrix, representing n documents generated according to the model Theorem [Papadimitriou et al. 1998] With high probability, for every two documents d,d’,  If topic(d) = topic(d’), then  If topic(d)  topic(d’), then

25 Proof For simplicity, assume  = 0 Want to show:  If topic(d) = topic(d’), A d k || A d’ k  If topic(d)  topic(d’), A d k  A d’ k D c : documents whose topic is the concept c T c : terms in supp(c)  Since ||c – c’|| = 1, T c ∩ T c’ = Ø A has non-zeroes only in blocks: B 1,…,B k, where B c : sub-matrix of A with rows in T c and columns in D c A T A is a block diagonal matrix with blocks B T 1 B 1,…, B T k B k (i,j)-th entry of B T c B c : term similarity between i-th and j-th documents whose topic is the concept c B T c B c : adjacency matrix of a bipartite (multi-)graph G c on D c

26 Proof (cont.) G c is a “random” graph First and second eigenvalues of B T c B c are well separated For all c,c’, second eigenvalue of B T c B c is smaller than first eigenvalue of B T c’ B c’ Top k eigenvalues of A T A are the principal eigenvalues of B T c B c for c = 1,…,k Let u 1,…,u k be corresponding eigenvectors For every document d on topic c, A d is orthogonal to all u 1,…,u k, except for u c. A k d is a scalar multiple of u c.

27 Extensions [Azar et al. 2001] A more general generative model Explain also improved treatment of polysemy

28 Computing SVD Compute singular values of A, by computing eigenvalues of A T A Compute U,V by computing eigenvectors of A T A and AA T Running time not too good: O(m 2 n + m n 2 )  Not practical for huge corpora Sub-linear time algorithms for estimating A k [Frieze,Kannan,Vempala 1998]

29 HITS and SVD A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of A T A h is principal eigenvector of AA T Therefore: a and h give A 1 : the rank-1 SVD of A Generalization: using A k, we can get k authority and hub vectors, corresponding to other topics in G.

30 End of Lecture 5

Download ppt "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006"

Similar presentations