1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Published byModified over 4 years ago
Presentation on theme: "1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006"— Presentation transcript:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 http://www.ee.technion.ac.il/courses/049011
2 Principal Eigenvector Computation E: n × n matrix | 1 | > | 2 | ≥ | 3 | … ≥ | n | : eigenvalues of E Suppose 1 > 0 v 1,…,v n : corresponding eigenvectors Eigenvectors form a basis Suppose ||v 1 || 2 = 1 Input: The matrix E A unit vector u, which is not in span(v 2,…,v n ) Goal: compute 1 and v 1
6 Outline Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD
7 Synonymy and Polysemy Synonymy: multiple terms with (almost) the same meaning Ex: cars, autos, vehicles Harms recall Polysemy: a term with multiple meanings Ex: java (programming language, coffee, island) Harms precision
8 Traditional Solutions Query expansion Synonymy: OR on all synonyms Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision Polysemy: AND on term and additional specializing terms Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall
9 Syntactic Indexing D: document collection, |D| = n T: term space, |T| = m A t,d : “weight” of t in d (e.g., TFIDF) A T A: pairwise document similarities AA T : pairwise term similarities A m n terms documents
10 Latent Semantic Indexing (LSI) [Deerwester et al. 1990] C: concept space, |C| = r Documents & query: “mixtures” of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap B r n concepts documents
11 Fourier Transform Time domain: time 3 ×+ 1.1 ×= frequency 3 1.1 Frequency domain: Compact discrete representation Effective for noise removal
12 Latent Semantic Indexing Documents, queries ~ signals Vectors in R m Concepts ~ base signals Orthonormal basis of columns(A) Semantic indexing of a document ~ Fourier transform of a signal Representation of document in concept basis Advantages Space-efficient Better handling of synonymy and polysemy Removal of “noise”
13 Open Questions How to choose the concept basis? How to transform the syntactic index into a semantic index? How to filter out “noisy concepts”?
14 Singular Values A: m×n real matrix Definition: ≥ 0 is a singular value of A if there exists a pair of vectors u,v s.t. Av = u and A T u = v. u and v are called singular vectors. Ex: = ||A|| 2 = max ||x|| 2 = 1 ||Ax|| 2. Corresponding singular vectors: x that maximizes ||Ax|| 2 and y = Ax / ||A|| 2. Note: A T Av = 2 v and AA T u = 2 u 2 is eigenvalue of A T A and AA T u eigenvector of A T A v eigenvector of AA T
15 Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there exists a singular value decomposition: A = U V T 1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A = Diag( 1,…, r ) U: column-orthonormal m×r matrix (U T U = I ) V: column-orthonormal n×r matrix (V T V = I ) A U VTVT ×× =
16 Singular Values vs. Eigenvalues A = U V T 1,…, r : singular values of A 1 2,…, r 2 : non-zero eigenvalues of A T A and AA T u 1,…,u r : columns of U Orthonormal basis for span(columns of A) Left singular vectors of A Eigenvectors of A T A v 1,…,v r : columns of V Orthonormal basis for span(rows of A) Right singular vectors Eigenvectors of AA T
17 LSI as SVD A = U V T U T A = V T u 1,…,u r : concept basis B = V T : LSI matrix (semantic index) A d : d-th column of A B d : d-th column of B B d = U T A d B d [c] = u c T A d
18 Noisy Concepts B = U T A = V T B d [c] = c v d [c] If c is small, then B d [c] small for all d k = largest i s.t. i is “large” For all c = k+1,…,r, and for all d, c is a low- weight concept in d Main idea: filter out all concepts c = k+1,…,r Space efficient: # of index terms = k (vs. r or m) Better retrieval: noisy concepts are filtered out across the board
19 Low-rank SVD B = U T A = V T U k = (u 1,…,u k ) V k = (v 1,…,v k ) k = upper-left k×k sub-matrix of A k = U k k V k T B k = k V k T rank(A k ) = rank(B k ) = k
20 Low Dimensional Embedding Theorem: If is small, then for “most” d,d’,. A k preserves pairwise similarities among documents at least as good as A for retrieval.
21 Why is LSI Better? [Papadimitriou et al. 1998] [Azar et al. 2001] LSI summary Documents are embedded in low dimensional space (m k) Pairwise similarities are preserved More space-efficient But why is retrieval better? Synonymy Polysemy
22 Generative Model A corpus model M = (T,C,W,D) T: Term space, |T| = m C: Concept space, |C| = k Concept: distribution over terms W: Topic space Topic: distribution over concepts D: Document distribution Distribution over W × N A document d is generated as follows: Sample a topic w and a length n according to D Repeat n times: Sample a concept c from C according to w Sample a term t from T according to c
23 Simplifying Assumptions Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 - The probability of every term under a concept c is at most some constant .
24 LSI Works A: m×n term-document matrix, representing n documents generated according to the model Theorem [Papadimitriou et al. 1998] With high probability, for every two documents d,d’, If topic(d) = topic(d’), then If topic(d) topic(d’), then
25 Proof For simplicity, assume = 0 Want to show: If topic(d) = topic(d’), A d k || A d’ k If topic(d) topic(d’), A d k A d’ k D c : documents whose topic is the concept c T c : terms in supp(c) Since ||c – c’|| = 1, T c ∩ T c’ = Ø A has non-zeroes only in blocks: B 1,…,B k, where B c : sub-matrix of A with rows in T c and columns in D c A T A is a block diagonal matrix with blocks B T 1 B 1,…, B T k B k (i,j)-th entry of B T c B c : term similarity between i-th and j-th documents whose topic is the concept c B T c B c : adjacency matrix of a bipartite (multi-)graph G c on D c
26 Proof (cont.) G c is a “random” graph First and second eigenvalues of B T c B c are well separated For all c,c’, second eigenvalue of B T c B c is smaller than first eigenvalue of B T c’ B c’ Top k eigenvalues of A T A are the principal eigenvalues of B T c B c for c = 1,…,k Let u 1,…,u k be corresponding eigenvectors For every document d on topic c, A d is orthogonal to all u 1,…,u k, except for u c. A k d is a scalar multiple of u c.
27 Extensions [Azar et al. 2001] A more general generative model Explain also improved treatment of polysemy
28 Computing SVD Compute singular values of A, by computing eigenvalues of A T A Compute U,V by computing eigenvectors of A T A and AA T Running time not too good: O(m 2 n + m n 2 ) Not practical for huge corpora Sub-linear time algorithms for estimating A k [Frieze,Kannan,Vempala 1998]
29 HITS and SVD A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of A T A h is principal eigenvector of AA T Therefore: a and h give A 1 : the rank-1 SVD of A Generalization: using A k, we can get k authority and hub vectors, corresponding to other topics in G.