Latent Semantic Analysis

Slides:



Advertisements
Similar presentations
Information retrieval – LSI, pLSI and LDA
Advertisements

2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Dimensionality Reduction PCA -- SVD
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Concepts and Categories. Functions of Concepts By dividing the world into classes of things to decrease the amount of information we need to learn, perceive,
An Introduction to Latent Semantic Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Adding Semantics to Information Retrieval By Kedar Bellare 20 th April 2003.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Michael W. Berry Xiaoyan (Kathy) Zhang Padma Raghavan Department of Computer Science University of Tennessee Level Search Filtering for IR Model Reduction.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Latent Semantic Indexing Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
Evaluation of Utility of LSA for Word Sense Discrimination Esther Levin, Mehrbod Sharifi, Jerry Ball
NLP: Why? How much? How? Peter Wiemer-Hastings. Why NLP? Intro: once upon a time, I was a grad student and worked on MUC. Learned: –the NLP was as good.
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Speech Analysing Component in Automatic Tutoring Systems Presentation by Doris Diedrich and Benjamin Kempe.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Learn to Comment Lance Lebanoff Mentor: Mahdi. Emotion classification of text  In our neural network, one feature is the emotion detected in the image.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
No. 1 Knowledge Acquisition from Documents with both Fixed and Free Formats* Shigeich Hirasawa Department of Industrial and Management Systems Engineering.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Computer-assisted essay assessment zSimilarity scores by Latent Semantic Analysis zComparison material based on relevant passages from textbook zDefining.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Natural Language Processing Topics in Information Retrieval August, 2002.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
Latent Semantic Analysis John Martin Small Bear Technologies, Inc.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Best pTree organization? level-1 gives te, tf (term level)
Document Clustering Based on Non-negative Matrix Factorization
LSI, SVD and Data Management
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Analysis Keith Trnka

Latent Semantic Indexing Application of Latent Semantic Analysis (LSA) to Information Retrieval Motivations Unreliable evidence synonomy: Many words refer to same object Affects recall polysemy: Many words have multiple meanings Affects precision

LSI Motivation (example) Source: Deerwater et al. 1990

LSA Solution Terms are overly noisy analogous to overfitting in Term-by-Document matrix Terms and documents should be represented by vectors in a “latent” semantic space LSI essentially infers knowledge from co-occurrence of terms Assume “errors” (sparse data, non-co-occurrences) are normal and account for them

LSA Methods Start with a Term-by-Document matrix (A, like fig. 15.5) Optionally weight cells Apply Singular Value Decomposition: t = # of terms d = # of documents n = min(t, d) Approximate using k (semantic) dimensions:

LSA Methods (cont’d) So that the Euclidean distance is minimized (hence, a least squares method) Each row of T is a measure of similarity for a term to a semantic dimension Likewise for D

LSA Application Querying for Information Retrieval: query is a psuedo-document: weighted sum over all terms in query of rows of T compare similarity to all documents in D using cosine similarity measure Document similarity: vector comparison of D Term similarity: vector comparison of T

LSA Application (cont’d) Choosing k is difficult commonly k = 100, 150, 300 or so overfitting (superfluous dimensions) vs. underfitting (not enough dimensions) What are the k semantic dimensions? undefined k performance

LSI Performance Source: Dumais 1997

Considerations of LSA Conceptually high recall: query and document terms may be disjoint Polysemes not handled well LSI: Unsupervised/completely automatic Language independent CL-LSI: Cross-Language LSI weakly trained Computational complexity is high optimization: random sampling methods Formal Linear Algebra foundations Models language acquisition in children

Fun Applications of LSA Synonyms on TOEFL Train on a corpus of newspapers, Grolier’s encyclopedia, children’s reading Use Term similarity and random guessing for unknown terms Best results at k = 400 The model got 51.5 correct, or 64.4% (52.5% corrected for guessing). By comparison a large sample of applicants to U. S. colleges from non-English speaking countries who took tests containing these items averaged 51.6 items correct, or 64.5% (52.7% corrected for guessing).

Subject-Matter Knowledge Introduction to Psychology multiple choice tests from New Mexico State University and the University of Colorado, Boulder LSA scored significantly less than average, although received a passing grade LSA has trouble with knowledge not in corpus

Text Coherence/Comprehension Kintcsh and colleagues Convert a text to a logic and test for coherence Slow, by hand LSA Compute cosine distance between sentences or paragraphs (windows of discourse) and the next one Predicted comprehension with r = .93

Essay Grading

Summary Latent Semantic Analysis has a wide variety of useful applications Useful in cognitive psychology research Useful as a generic IR technique Possibly useful for lazy TAs?

References (Papers, etc.) http://www-nlp.stanford.edu/fsnlp/ir/fsnlp-slides-ir.pdf (Manning and Shutze) http://www.cs.utk.edu/~lsi/ Using LSI for Information Retrieval, Information Filtering and Other Things (Dumais 1997) Indexing by Latent Semantic Analysis (Deerwester, et al. 1990) Automatic Cross-Language IR Using LSI (Dumais et al 1996) http://lsa.colorado.edu/ (Landauer, et al) An Introduction to Latent Semantic Analysis (1998) A Solution to Plato’s Problem… (1997) How Well Can Passage Meaning be Derived without Using Word Order? … (1997)

References (Books) NLP Related SVD Related Foundations of Statistical NLP Manning and Schutze 2003 SVD Related Linear Algebra Strang 1998 Matrix Computations (3rd ed.) Golub and Van Loan 1996

LSA Example/The End