Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

Slides:



Advertisements
Similar presentations
CMU SCS : Multimedia Databases and Data Mining Lecture #17: Text - part IV (LSI) C. Faloutsos.
Advertisements

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Dimensionality Reduction PCA -- SVD
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Evaluating the Performance of IR Sytems
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Information Retrieval in Text Part II Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Michael W. Berry Xiaoyan (Kathy) Zhang Padma Raghavan Department of Computer Science University of Tennessee Level Search Filtering for IR Model Reduction.
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Chapter 5: Information Retrieval and Web Search
CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
PrasadL18LSI1 Latent Semantic Indexing Adapted from Lectures by Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Natural Language Processing Topics in Information Retrieval August, 2002.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
PrasadL18LSI1 Latent Semantic Indexing Adapted from Lectures by Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann.
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Latent Semantic Indexing
Restructuring Sparse High Dimensional Data for Effective Retrieval
Information Retrieval and Web Design
Latent Semantic Analysis
Presentation transcript:

Web Search and Text Mining Lecture 5

Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI

Vector Space Model: Pros Automatic selection of index terms Partial matching of queries and documents (dealing with the case where no document contains all search terms) Ranking according to similarity score (dealing with large result sets) Term weighting schemes (improves retrieval performance) Various extensions Document clustering Relevance feedback (modifying query vector) Geometric foundation

Problems with Lexical Semantics Ambiguity and association in natural language Polysemy: Words often have a multitude of meanings and different types of usage (more severe in very heterogeneous collections). The vector space model is unable to discriminate between different meanings of the same word.

Problems with Lexical Semantics Synonymy: Different terms may have identical or a similar meaning (weaker: words indicating the same topic). No associations between words are made in the vector space representation.

Latent Semantic Indexing (LSI) Perform a low-rank approximation of document-term matrix (typical rank ) General idea Map documents (and terms) to a low- dimensional representation. Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). Compute document similarity based on the inner product in this latent semantic space

Goals of LSI Similar terms map to similar location in low dimensional space Noise reduction by dimension reduction

Latent Semantic Analysis Latent semantic space: illustrating example courtesy of Susan Dumais

Performing the maps Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD. Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval. A query q is also mapped into this space, by Query NOT a sparse vector.

Empirical evidence Experiments on TREC 1/2/3 – Dumais Lanczos SVD code (available on netlib) due to Berry used in these expts Running times of ~ one day on tens of thousands of docs (old data) Dimensions – various values reported (Under 200 reported unsatisfactory) Generally expect recall to improve – what about precision?

Empirical evidence Precision at or above median TREC precision Top scorer on almost 20% of TREC topics Slightly better on average than straight vector spaces Effect of dimensionality: DimensionsPrecision

What does LSI compute? The “dimensionality” of a corpus is the number of distinct topics represented in it. Under some modeling assuptions: if A has a rank k approximation of low Frobenius error, then there are no more than k distinct topics in the corpus. ( Latent semantic indexing: A probabilistic analysis,'' ) Latent semantic indexing: A probabilistic analysis,''

LSI has many other applications In many settings in pattern recognition and retrieval, we have a feature-object matrix. For text, the terms are features and the docs are objects. Could be opinions and users … This matrix may be redundant in dimensionality. Can work with low-rank approximation. If entries are missing (e.g., users’ opinions), can recover if dimensionality is low. Powerful general analytical technique Close, principled analog to clustering methods.