Download presentation

Presentation is loading. Please wait.

1
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 http://www.ee.technion.ac.il/courses/049011

2
2 Principal Eigenvector Computation E: n × n matrix | 1 | > | 2 | ≥ | 3 | … ≥ | n | : eigenvalues of E Suppose 1 > 0 v 1,…,v n : corresponding eigenvectors Eigenvectors form a basis Suppose ||v 1 || 2 = 1 Input: The matrix E A unit vector u, which is not in span(v 2,…,v n ) Goal: compute 1 and v 1

3
3 The Power Method

4
4 Why Does It Work? Theorem: As t , w ±v 1 Convergence rate: Proportional to ( 2 / 1 ) t The larger the “spectral gap” | 1 |- | 2 |, the faster the convergence.

5
5 Spectral Methods in Information Retrieval

6
6 Outline Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD

7
7 Synonymy and Polysemy Synonymy: multiple terms with (almost) the same meaning Ex: cars, autos, vehicles Harms recall Polysemy: a term with multiple meanings Ex: java (programming language, coffee, island) Harms precision

8
8 Traditional Solutions Query expansion Synonymy: OR on all synonyms Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision Polysemy: AND on term and additional specializing terms Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall

9
9 Syntactic Indexing D: document collection, |D| = n T: term space, |T| = m A t,d : “weight” of t in d (e.g., TFIDF) A T A: pairwise document similarities AA T : pairwise term similarities A m n terms documents

10
10 Latent Semantic Indexing (LSI) [Deerwester et al. 1990] C: concept space, |C| = r Documents & query: “mixtures” of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap B r n concepts documents

11
11 Fourier Transform Time domain: time 3 ×+ 1.1 ×= frequency 3 1.1 Frequency domain: Compact discrete representation Effective for noise removal

12
12 Latent Semantic Indexing Documents, queries ~ signals Vectors in R m Concepts ~ base signals Orthonormal basis of columns(A) Semantic indexing of a document ~ Fourier transform of a signal Representation of document in concept basis Advantages Space-efficient Better handling of synonymy and polysemy Removal of “noise”

13
13 Open Questions How to choose the concept basis? How to transform the syntactic index into a semantic index? How to filter out “noisy concepts”?

14
14 Singular Values A: m×n real matrix Definition: ≥ 0 is a singular value of A if there exists a pair of vectors u,v s.t. Av = u and A T u = v. u and v are called singular vectors. Ex: = ||A|| 2 = max ||x|| 2 = 1 ||Ax|| 2. Corresponding singular vectors: x that maximizes ||Ax|| 2 and y = Ax / ||A|| 2. Note: A T Av = 2 v and AA T u = 2 u 2 is eigenvalue of A T A and AA T u eigenvector of A T A v eigenvector of AA T

15
15 Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there exists a singular value decomposition: A = U V T 1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A = Diag( 1,…, r ) U: column-orthonormal m×r matrix (U T U = I ) V: column-orthonormal n×r matrix (V T V = I ) A U VTVT ×× =

16
16 Singular Values vs. Eigenvalues A = U V T 1,…, r : singular values of A 1 2,…, r 2 : non-zero eigenvalues of A T A and AA T u 1,…,u r : columns of U Orthonormal basis for span(columns of A) Left singular vectors of A Eigenvectors of A T A v 1,…,v r : columns of V Orthonormal basis for span(rows of A) Right singular vectors Eigenvectors of AA T

17
17 LSI as SVD A = U V T U T A = V T u 1,…,u r : concept basis B = V T : LSI matrix (semantic index) A d : d-th column of A B d : d-th column of B B d = U T A d B d [c] = u c T A d

18
18 Noisy Concepts B = U T A = V T B d [c] = c v d [c] If c is small, then B d [c] small for all d k = largest i s.t. i is “large” For all c = k+1,…,r, and for all d, c is a low- weight concept in d Main idea: filter out all concepts c = k+1,…,r Space efficient: # of index terms = k (vs. r or m) Better retrieval: noisy concepts are filtered out across the board

19
19 Low-rank SVD B = U T A = V T U k = (u 1,…,u k ) V k = (v 1,…,v k ) k = upper-left k×k sub-matrix of A k = U k k V k T B k = k V k T rank(A k ) = rank(B k ) = k

20
20 Low Dimensional Embedding Theorem: If is small, then for “most” d,d’,. A k preserves pairwise similarities among documents at least as good as A for retrieval.

21
21 Why is LSI Better? [Papadimitriou et al. 1998] [Azar et al. 2001] LSI summary Documents are embedded in low dimensional space (m k) Pairwise similarities are preserved More space-efficient But why is retrieval better? Synonymy Polysemy

22
22 Generative Model A corpus model M = (T,C,W,D) T: Term space, |T| = m C: Concept space, |C| = k Concept: distribution over terms W: Topic space Topic: distribution over concepts D: Document distribution Distribution over W × N A document d is generated as follows: Sample a topic w and a length n according to D Repeat n times: Sample a concept c from C according to w Sample a term t from T according to c

23
23 Simplifying Assumptions Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 - The probability of every term under a concept c is at most some constant .

24
24 LSI Works A: m×n term-document matrix, representing n documents generated according to the model Theorem [Papadimitriou et al. 1998] With high probability, for every two documents d,d’, If topic(d) = topic(d’), then If topic(d) topic(d’), then

25
25 Proof For simplicity, assume = 0 Want to show: If topic(d) = topic(d’), A d k || A d’ k If topic(d) topic(d’), A d k A d’ k D c : documents whose topic is the concept c T c : terms in supp(c) Since ||c – c’|| = 1, T c ∩ T c’ = Ø A has non-zeroes only in blocks: B 1,…,B k, where B c : sub-matrix of A with rows in T c and columns in D c A T A is a block diagonal matrix with blocks B T 1 B 1,…, B T k B k (i,j)-th entry of B T c B c : term similarity between i-th and j-th documents whose topic is the concept c B T c B c : adjacency matrix of a bipartite (multi-)graph G c on D c

26
26 Proof (cont.) G c is a “random” graph First and second eigenvalues of B T c B c are well separated For all c,c’, second eigenvalue of B T c B c is smaller than first eigenvalue of B T c’ B c’ Top k eigenvalues of A T A are the principal eigenvalues of B T c B c for c = 1,…,k Let u 1,…,u k be corresponding eigenvectors For every document d on topic c, A d is orthogonal to all u 1,…,u k, except for u c. A k d is a scalar multiple of u c.

27
27 Extensions [Azar et al. 2001] A more general generative model Explain also improved treatment of polysemy

28
28 Computing SVD Compute singular values of A, by computing eigenvalues of A T A Compute U,V by computing eigenvectors of A T A and AA T Running time not too good: O(m 2 n + m n 2 ) Not practical for huge corpora Sub-linear time algorithms for estimating A k [Frieze,Kannan,Vempala 1998]

29
29 HITS and SVD A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of A T A h is principal eigenvector of AA T Therefore: a and h give A 1 : the rank-1 SVD of A Generalization: using A k, we can get k authority and hub vectors, corresponding to other topics in G.

30
30 End of Lecture 5

Similar presentations

© 2019 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google