4 That is, A k is the optimal approximation in terms of the approximation error measured by the Frobenius norm, among all matrices of rank k Forms the basics of LSI (Latent Semantic Indexing) in informational retrieval
6 Applications of SVD Pseudoinverse Range, null space and rank Matrix approximation Other examples http://en.wikipedia.org/wiki/Singular_value_decomposition
7 LSI (Latent Semantic Indexing) Introduction Latent Semantic Indexing LSI Query Updating An example
8 Problem Introduction Traditional term-matching method doesn’t work well in information retrieval We want to capture the concepts instead of words. Concepts are reflected in the words. However, One term may have multiple meaning Different terms may have the same meaning.
9 LSI (Latent Semantic Indexing) LSI approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term- document association data as a statistical problem. The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants.
10 LSI, the Method Document-Term M Decompose M by SVD. Approximating M using truncated SVD
11 LSI, the Method (cont.) Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.
12 Query A query q is also mapped into this space, by Compare the similarity in the new space Intuition: Dimension reduction through LSI brings together “related” axes in the vector space.
18 How to set the value of k? LSI is useful only if k << n. If k is too large, it doesn't capture the underlying latent semantic space; if k is too small, too much is lost. No principled way of determining the best k.
19 How well does LSI work? Effectiveness of LSI compared to regular term- matching depends on nature of documents. Typical improvement: 0 to 30% better precision. Advantage greater for texts in which synonymy and ambiguity are more prevalent. Best when recall is high. Costs of LSI might outweigh improvement. SVD is computationally expensive; limited use for really large document collections Inverted index not possible
20 References Mini tutorial on the Singular Value Decomposition http://www.cs.brown.edu/research/ai/dynamics/tutoria l/Postscript/SingularValueDecomposition.ps http://www.cs.brown.edu/research/ai/dynamics/tutoria l/Postscript/SingularValueDecomposition.ps Basics of linear algebra http://www.stanford.edu/class/cs229/section/section_li n_algebra.pdf http://www.stanford.edu/class/cs229/section/section_li n_algebra.pdf