Latent Semantic Analysis

Latent Semantic Analysis

Problem Introduction Traditional term-matching method doesn’t work well in information retrieval We want to capture the concepts instead of words. Concepts are reflected in the words. However, One term may have multiple meaning Different terms may have the same meaning.

The Problem Two problems that arose using the vector space model:
synonymy: many ways to refer to the same object, e.g. car and automobile leads to poor recall polysemy: most words have more than one distinct meaning, e.g.model, python, chip leads to poor precision

The Problem Example: Vector Space Model (from Lillian Lee) auto engine
bonnet tyres lorry boot car emissions hood make model trunk make hidden Markov model emissions normalize Synonymy Will have small cosine but are related Polysemy Will have large cosine but not truly related

LSI (Latent Semantic Analysis)
LSI approach tries to overcome the deficiencies of term-matching retrieval by treating the unreliability of observed term-document association data as a statistical problem. The goal is to find effective models to represent the relationship between terms and documents. Hence a set of terms, which is by itself incomplete and unreliable, will be replaced by some set of entities which are more reliable indicants. Terms that did not appear in a document may still associate with a document. LSI derives uncorrelated index factors that might be considered artificial concepts.

Some History Latent Semantic Indexing was developed at Bellcore (now Telcordia) in the late 1980s (1988). It was patented in 1989.

Some History The first papers about LSI:
Dumais, S. T., Furnas, G. W., Landauer, T. K. and Deerwester, S. (1988), "Using latent semantic analysis to improve information retrieval." In Proceedings of CHI'88: Conference on Human Factors in Computing, New York: ACM, Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R.A. (1990) "Indexing by latent semantic analysis." Journal of the Society for Information Science, 41(6), Foltz, P. W. (1990) "Using Latent Semantic Indexing for Information Filtering". In R. B. Allen (Ed.) Proceedings of the Conference on Office Information Systems, Cambridge, MA,

LSA But first: What is the difference between LSI and LSA???
LSI refers to using it for indexing or information retrieval. LSA refers to everything else.

LSA Idea (Deerwester et al):
“We would like a representation in which a set of terms, which by itself is incomplete and unreliable evidence of the relevance of a given document, is replaced by some other set of entities which are more reliable indicants. We take advantage of the implicit higher-order (or latent) structure in the association of terms and documents to reveal such relationships.”

SVD (Singular Value Decomposition)
How to learn the concepts from data? SVD is applied on the term-document matrix to derive the latent semantic structure model. What is SVD?

SVD Basics S D X T S D T X = ^ = Singular Value Decomposition
t x m m x m m x d * D S T Singular Value Decomposition t x d terms documents X = t x d t x k k x k k x d = terms documents * X D S T Select first k singular values ^

SVD Basics II Rank-reduced Singular Value Decomposition (SVD) performed on matrix all but the k highest singular values are set to 0 produces k-dimensional approximation of the original matrix (in least-squares sense) this is the “semantic space” Compute similarities between entities in semantic space (usually with cosine)

SVD SVD of the term-by-document matrix X:
If the singular values of S0 are ordered by size, we only keep the first k largest values and get a reduced model: doesn’t exactly match X and it gets closer as more and more singular values are kept This is what we want. We don’t want perfect fit since we think some of 0’s in X should be 1 and vice versa. It reflects the major associative patterns in the data, and ignores the smaller, less important influence and noise.

Fundamental Comparison Quantities from the SVD Model
Comparing Two Terms: the dot product between two row vectors of reflects the extent to which two terms have a similar pattern of occurrence across the set of document. Comparing Two Documents: dot product between two column vectors of Comparing a Term and a Document

LSI Paper example Index terms in italics

term-document Matrix

Latent Semantic Indexing
t x m m x m m x d * D S T Singular Value Decomposition t x d terms documents X = t x d t x k k x k k x d = terms documents * X D S T Select first k singular values ^

SVD with minor terms dropped
TS define coordinates for documents in latent space

Terms Graphed in Two Dimensions

Documents and Terms

Change in Text Correlation

Summary Some Issues SVD Algorithm complexity O(n^2k^3)
n = number of terms k = number of dimensions in semantic space (typically small ~50 to 350) for stable document collection, only have to run once dynamic document collections: might need to rerun SVD, but can also “fold in” new documents

Summary Some issues Finding optimal dimension for semantic space
precision-recall improve as dimension is increased until hits optimal, then slowly decreases until it hits standard vector model run SVD once with big dimension, say k = 1000 then can test dimensions <= k in many tasks works well, still room for research

Summary Some issues SVD assumes normally distributed data
term occurrence is not normally distributed matrix entries are weights, not counts, which may be normally distributed even when counts are not

Latent Semantic Analysis

Similar presentations

Presentation on theme: "Latent Semantic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latent Semantic Analysis

Similar presentations

Presentation on theme: "Latent Semantic Analysis"— Presentation transcript:

Similar presentations

About project

Feedback