Download presentation

1
**Latent Semantic Analysis**

2
**An Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger!**

d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger Which document should be returned and how the ranking should be?

3
**Eigenvectors and Eigenvalues**

Let A be an n × n matrix. If x is an n-dimensional vector, then the matrix-vector product Ax is well-defined, and the result is again an n-dimensional vector. In general, multiplication by a matrix changes the direction of a non-zero vector x, unless the vector is special and we have that Ax = x for some scalar .

4
Matrix Decomposition Let S be the matrix with eigenvectors of A as columns. Let be the diagonal matrix with the eigenvalues of A on the diagonal. Then A = SS-1 If A is symmetric then we have S-1=ST A = SST

5
**Singular Value Decomposition**

Let A be an m × n matrix with entries being real numbers and m > n. Consider the n × n square matrix B = ATA. B is symmetric it has been shown that the eigenvalues of such (ATA) matrices are non-negative. Since they are non-negative we can write them in decreasing order as squares of non-negative real numbers: 12 > > n2 For some index r (possibly n) the first r numbers are positive whereas the rest are zero. S1 = [x1, , xr] y1=(1/1)Ax yr=(1/r)Axr S2 = [y1, ..., yr] We can show that A = S2 S1T is diagonal and the values along the diagonal are 1, , n which are called singular values. If we denote S2 by S and S1 by U we have A = S UT

6
**Example d1 : Romeo and Juliet. d2 : Juliet: O happy dagger!**

d3 : Romeo died by dagger. d4 : “Live free or die”, that’s the New-Hampshire’s motto. d5 : Did you know, New-Hampshire is in New-England. q: dies, dagger

7
Document-term matrix

8
Latent Concepts Latent Semantic Indexing (LSI) is a method for discovering hidden concepts in document data. Each document and term (word) is then expressed as a vector with elements corresponding to these concepts. Each element in a vector gives the degree of participation of the document or term in the corresponding concept. Goal is not to describe the concepts verbally, but to be able to represent the documents and terms in a unified way for exposing document-document, document-term, and term-term similarities which are otherwise hidden…

9
**Matrix Matrix A can be written: A = SUT**

Let's "neglect" the last three singular values of as being too "small"... Also, just keep two columns from S obtaining S2 and two rows from UT obtaining U2T Matrix A is approximated as: A2 = S2U2T In general: Ak = SkUkT where a good value for k is determined empirically.

10
Matrices 2, S2, U2

11
**Representing Documents, Terms, and Queries**

Represent documents by the column vectors of U2T Represent terms by the row vectors S2 Represent queries by the centroid vector of their terms

12
Geometry

Similar presentations

OK

CS246 Topic-Based Models. Motivation Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.

CS246 Topic-Based Models. Motivation Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google