# Latent Semantic Analysis

## Presentation on theme: "Latent Semantic Analysis"— Presentation transcript:

Latent Semantic Analysis

An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger Which document should be returned and how the ranking should be?

Eigenvectors and Eigenvalues
Let A be an n × n matrix. If x is an n-dimensional vector, then the matrix-vector product Ax is well-defined, and the result is again an n-dimensional vector. In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have that Ax =  x for some scalar .

Matrix Decomposition Let S be the matrix with eigenvectors of A as columns. Let  be the diagonal matrix with the eigenvalues of A on the diagonal. Then A = SS-1 If A is symmetric then we have S-1=ST A = SST

Singular Value Decomposition
Let A be an m × n matrix with entries being real numbers and m > n. Consider the n × n square matrix B = ATA. B is symmetric it has been shown that the eigenvalues of such (ATA) matrices are non-negative. Since they are non-negative we can write them in decreasing order as squares of non-negative real numbers: 12 > > n2 For some index r (possibly n) the first r numbers are positive whereas the rest are zero. S1 = [x1, , xr] y1=(1/1)Ax yr=(1/r)Axr S2 = [y1, ..., yr] We can show that A = S2  S1T  is diagonal and the values along the diagonal are 1, , n which are called singular values. If we denote S2 by S and S1 by U we have A = S  UT

Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger!
d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger

Document-term matrix

Latent Concepts Latent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data. Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept. Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities which are otherwise hidden…

Matrix  Matrix A can be written: A = SUT
Let's "neglect" the last three singular values of  as being too "small"... Also, just keep two columns from S obtaining S2 and two rows from UT obtaining U2T Matrix A is approximated as: A2 = S2U2T In general: Ak = SkUkT where a good value for k is determined empirically.

Matrices 2, S2, U2

Representing Documents, Terms, and Queries
Represent documents by the column vectors of U2T Represent terms by the row vectors S2 Represent queries by the centroid vector of their terms

Geometry

Similar presentations