Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent Semantic Analysis

Similar presentations


Presentation on theme: "Latent Semantic Analysis"— Presentation transcript:

1 Latent Semantic Analysis
Keith Trnka

2 Latent Semantic Indexing
Application of Latent Semantic Analysis (LSA) to Information Retrieval Motivations Unreliable evidence synonomy: Many words refer to same object Affects recall polysemy: Many words have multiple meanings Affects precision

3 LSI Motivation (example)
Source: Deerwater et al. 1990

4 LSA Solution Terms are overly noisy
analogous to overfitting in Term-by-Document matrix Terms and documents should be represented by vectors in a “latent” semantic space LSI essentially infers knowledge from co-occurrence of terms Assume “errors” (sparse data, non-co-occurrences) are normal and account for them

5 LSA Methods Start with a Term-by-Document matrix (A, like fig. 15.5)
Optionally weight cells Apply Singular Value Decomposition: t = # of terms d = # of documents n = min(t, d) Approximate using k (semantic) dimensions:

6 LSA Methods (cont’d) So that the Euclidean distance is minimized (hence, a least squares method) Each row of T is a measure of similarity for a term to a semantic dimension Likewise for D

7 LSA Application Querying for Information Retrieval: query is a psuedo-document: weighted sum over all terms in query of rows of T compare similarity to all documents in D using cosine similarity measure Document similarity: vector comparison of D Term similarity: vector comparison of T

8 LSA Application (cont’d)
Choosing k is difficult commonly k = 100, 150, 300 or so overfitting (superfluous dimensions) vs. underfitting (not enough dimensions) What are the k semantic dimensions? undefined k performance

9 LSI Performance Source: Dumais 1997

10 Considerations of LSA Conceptually high recall: query and document terms may be disjoint Polysemes not handled well LSI: Unsupervised/completely automatic Language independent CL-LSI: Cross-Language LSI weakly trained Computational complexity is high optimization: random sampling methods Formal Linear Algebra foundations Models language acquisition in children

11 Fun Applications of LSA
Synonyms on TOEFL Train on a corpus of newspapers, Grolier’s encyclopedia, children’s reading Use Term similarity and random guessing for unknown terms Best results at k = 400 The model got 51.5 correct, or 64.4% (52.5% corrected for guessing). By comparison a large sample of applicants to U. S. colleges from non-English speaking countries who took tests containing these items averaged 51.6 items correct, or 64.5% (52.7% corrected for guessing).

12 Subject-Matter Knowledge
Introduction to Psychology multiple choice tests from New Mexico State University and the University of Colorado, Boulder LSA scored significantly less than average, although received a passing grade LSA has trouble with knowledge not in corpus

13 Text Coherence/Comprehension
Kintcsh and colleagues Convert a text to a logic and test for coherence Slow, by hand LSA Compute cosine distance between sentences or paragraphs (windows of discourse) and the next one Predicted comprehension with r = .93

14 Essay Grading

15 Summary Latent Semantic Analysis has a wide variety of useful applications Useful in cognitive psychology research Useful as a generic IR technique Possibly useful for lazy TAs?

16 References (Papers, etc.)
(Manning and Shutze) Using LSI for Information Retrieval, Information Filtering and Other Things (Dumais 1997) Indexing by Latent Semantic Analysis (Deerwester, et al. 1990) Automatic Cross-Language IR Using LSI (Dumais et al 1996) (Landauer, et al) An Introduction to Latent Semantic Analysis (1998) A Solution to Plato’s Problem… (1997) How Well Can Passage Meaning be Derived without Using Word Order? … (1997)

17 References (Books) NLP Related SVD Related
Foundations of Statistical NLP Manning and Schutze 2003 SVD Related Linear Algebra Strang 1998 Matrix Computations (3rd ed.) Golub and Van Loan 1996

18 LSA Example/The End


Download ppt "Latent Semantic Analysis"

Similar presentations


Ads by Google