Presentation on theme: "Latent Semantic Analysis. An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the."— Presentation transcript:
Latent Semantic Analysis
An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger Which document should be returned and how the ranking should be?
Eigenvectors and Eigenvalues Let A be an n × n matrix. If x is an n-dimensional vector, then the matrix-vector product Ax is well-defined, and the result is again an n-dimensional vector. In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have that Ax = x for some scalar.
Matrix Decomposition Let S be the matrix with eigenvectors of A as columns. Let be the diagonal matrix with the eigenvalues of A on the diagonal. Then A = S S -1 If A is symmetric then we have S -1 =S T A = S S T
Singular Value Decomposition Let A be an m × n matrix with entries being real numbers and m > n. Consider the n × n square matrix B = A T A. –B is symmetric –it has been shown that the eigenvalues of such (A T A) matrices are non-negative. –Since they are non-negative we can write them in decreasing order as squares of non-negative real numbers: 1 2 >... > n 2 For some index r (possibly n) the first r numbers are positive whereas the rest are zero. S 1 = [x 1,..., x r ] y 1 =(1/ 1 )Ax 1... y r =(1/ r )Ax r S 2 = [y 1,..., y r ] We can show that A = S 2 S 1 T is diagonal and the values along the diagonal are 1,..., n which are called singular values. If we denote S 2 by S and S 1 by U we have A = S U T
Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger! d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger
Latent Concepts Latent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data. Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. –Each element in a vector gives the degree of participation of the document or term in the corresponding concept. Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing –document-document, –document-term, and –term-term similarities which are otherwise hidden…
Matrix Matrix A can be written: A = S U T Let's "neglect" the last three singular values of as being too "small"... Also, just keep two columns from S obtaining S 2 and two rows from U T obtaining U 2 T Matrix A is approximated as: A 2 = S 2 U 2 T In general: A k = S k U k T where a good value for k is determined empirically.
Matrices 2, S 2, U 2
Representing Documents, Terms, and Queries Represent documents by the column vectors of U 2 T Represent terms by the row vectors S 2 Represent queries by the centroid vector of their terms