ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.

Slides:



Advertisements
Similar presentations
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
Advertisements

A Vector Space Model for Automatic Indexing
Text Databases Text Types
Introduction to Information Retrieval
1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
1 Discussion Class 2 A Vector Space Model for Automated Indexing.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Knowledge management system based on Latent Semantic Analysis.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Authors: Rosario Sotomayor, Joe Carthy and John Dunnion Speaker: Rosario Sotomayor Intelligent Information Retrieval Group (IIRG) UCD School of Computer.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Vector Space Models.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
A pTree organization for text mining... Position are April apple and an always. all again a... Term (Vocab)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Information Retrieval LECTURE 1 : Introduction.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Plan for Today’s Lecture(s)
Best pTree organization? level-1 gives te, tf (term level)
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
Multimedia Information Retrieval
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides

Synonymy Same meaning, different words Access, retrieval, look-up

Polysemy Same word, different meaning E.g. bank –River side –Financial institution

Exercise 1 Get out a piece of paper List examplars (members) of category for 30 sec Category is...

Flowers

Extended Free Recall is not Easy Brain is NOT built that way! Consequences: Trouble for generating synonyms for query terms Recall problem or Precision problem?

Exercise 2 On a piece of paper, write the name you would give to a Web site that tells about interesting activities occurring in Albany area –E.g. this site would tell you what is interesting to do on Friday or Saturday night –Make the site name 20 characters or less.

Lexical Variability Very low probability of match in almost every circumstance you examine (p = ) Severe consequences for lexically based access to information or functionality –Performance problems with lexically based IR I.e. querying Precision problem or Recall problem??? Note: The generation problem (a few slides back) contributes to you underestimating the lexical variability problem - you can’t generate many of the alternatives, so you think the variability is lower than it is.

Solutions for Lexical Variability Problem Direct Manipulation –Windows, Icons, Menus, Pointers (WIMP) –Recognition rather than recall –But limited number of items it can work for before navigation gets to be an issue Controlled Vocabulary –Adaptive burden (learning) on users Adaptive Indexing –Adaptive burden on the system Semantic Indexing –E.g., NLP-based retrieval, Latent Semantic Indexing

Adaptive Index An index that Learns from its mistakes Adds links for words it used to miss on Orders results by popularity for the given query Comments: Learns about the most needed words first In the natural context of their use Success requires a sufficient density of usage across the population...

Recall Vector Space Model D 1 = “computer information retrieval” D 2 = “computer retrieval” Q 1 = “information, retrieval” computer information retrieval D1=(1, 1, 1)Q1=(0, 1, 1) D2=(1, 0, 1)

Problem with Term Space Indexing terms contains only a fraction of terms that users may use in the query –Synonymy –Hard for user to come up with alternative terms –Indexing terms may quite different from user’s terms High lexical variability –Precision problem or recall problem?

Problem with Term Space (cont.) Terms are associated with unrelated documents –Polysemy –One solution is Controlled Vocabulary Bank -> financial institution Expensive and restrictive –Another solution is to add more terms in Boolean query Bank AND finance Hard to come up with more terms Terms may not in the index

Problem with Term Space (cont.) Terms are considered independent from each other (Orthogonal term dimensions) –Not the case –Too many dimensions

Semantic Space Space of meanings Both terms and documents are data points Reduced number of dimensions Avoid Synonymy problem Avoid Polysemy problem Meaning 2 Meaning 1 Meaning 3 T1=(1, 1, 1)Q1=(0, 1, 1) D2=(1, 0, 1)

How? Meanings are hidden (latent semantics) Take advantage of dependency among terms –Occurrence of some patterns of words gives clue as to the likely occurrence of others “Bank, mortgage” co-present -> occurrence of “Finance” –These correlated words are close to each other in semantic space –Semantically unrelated words are far away “River” will be far away, meaning of “Bank” will be quite clear

Original Term Document Matrix

Latent Semantic Indexing (LSI) Transform term-document matrix To get some orthogonal factors –Using SVD –Uncover underlying independent meanings Map terms, query and documents to this factor space Do rest as if it is a vector space model –Compute similarity between query and document

Semantic Space

Results

Administrivia Volunteer to pick up, distribute, collect and return teaching evaluation forms?