Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Slides:



Advertisements
Similar presentations
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Advertisements

1 Latent Semantic Mapping: Dimensionality Reduction via Globally Optimal Continuous Parameter Modeling Jerome R. Bellegarda.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Vector Space Model Hongning Wang Recap: what is text mining 6501: Text Mining2 “Text mining, also referred to as text data mining, roughly.
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
IR Models: Overview, Boolean, and Vector
Hinrich Schütze and Christina Lioma
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
IR Models: Review Vector Model and Probabilistic.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
1 Computing Relevance, Similarity: The Vector Space Model.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
SINGULAR VALUE DECOMPOSITION (SVD)
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Information Retrieval Models: Vector Space Models
Natural Language Processing Topics in Information Retrieval August, 2002.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
KNN & Naïve Bayes Hongning Wang
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Vector Space Model Hongning Wang
Vector-Space (Distributional) Lexical Semantics
Representation of documents and queries
CS 430: Information Discovery
Information Retrieval and Web Search
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Analysis Hongning Wang

Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension – K concepts define a high-dimensional space – Element of vector corresponds to concept weight E.g., d=(x 1,…,x k ), x i is “importance” of concept i Measure relevance – Distance between the query vector and document vector in this concept space 4501: Information Retrieval2

Recap: what is a good “basic concept”? Orthogonal – Linearly independent basis vectors “Non-overlapping” in meaning No ambiguity Weights can be assigned automatically and accurately Existing solutions – Terms or N-grams, i.e., bag-of-words – Topics, i.e., topic model We will come back to this later 4501: Information Retrieval3

Recap: TF weighting Two views of document length – A doc is long because it is verbose – A doc is long because it has more content Raw TF is inaccurate – Document length variation – “Repeated occurrences” are less informative than the “first occurrence” – Relevance does not increase proportionally with number of term occurrence Generally penalize long doc, but avoid over- penalizing – Pivoted length normalization 4501: Information Retrieval4

Recap: IDF weighting Total number of docs in collection Non-linear scaling 4501: Information Retrieval5

Recap: TF-IDF weighting “Salton was perhaps the leading computer scientist working in the field of information retrieval during his time.” - wikipedia Gerard Salton Award – highest achievement award in IR 4501: Information Retrieval6

Recap: cosine similarity Unit vector TF-IDF vector Sports Finance D1D1 D2D2 Query TF-IDF space 4501: Information Retrieval7

Recap: disadvantages of VS Model Assume term independence Assume query and document to be the same Lack of “predictive adequacy” – Arbitrary term weighting – Arbitrary similarity measure Lots of parameter tuning! 4501: Information Retrieval8

VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal to each other Synonymy: car v.s. automobile Polysemy: fly (action v.s. insect) Information Retrieval9

Choosing basis for VS model A concept space is preferred – Semantic gap will be bridged Sports Education Finance D4D4 D2D2 D1D1 D5D5 D3D3 Query Information Retrieval10

How to build such a space Automatic term expansion – Construction of thesaurus WordNet – Clustering of words Word sense disambiguation – Dictionary-based Relation between a pair of words should be similar as in text and dictionary’s description – Explore word usage context Information Retrieval11

How to build such a space Latent Semantic Analysis – Assumption: there is some underlying latent semantic structure in the data that is partially obscured by the randomness of word choice with respect to retrieval – It means: the observed term-document association data is contaminated by random noise Information Retrieval12

How to build such a space Solution – Low rank matrix approximation Imagine this is our observed term-document matrix Imagine this is *true* concept-document matrix Random noise over the word selection in each document Information Retrieval13

Latent Semantic Analysis (LSA) Information Retrieval14

Basic concepts in linear algebra Information Retrieval15

Basic concepts in linear algebra Information Retrieval16

Basic concepts in linear algebra Information Retrieval17

Latent Semantic Analysis (LSA) Map to a lower dimensional space Information Retrieval18

Latent Semantic Analysis (LSA) Information Retrieval19

Geometric interpretation of LSA Information Retrieval20

Latent Semantic Analysis (LSA) Visualization HCI Graph theory Information Retrieval21

What are those dimensions in LSA Principle component analysis Information Retrieval22

Latent Semantic Analysis (LSA) What we have achieved via LSA – Terms/documents that are closely associated are placed near one another in this new space – Terms that do not occur in a document may still close to it, if that is consistent with the major patterns of association in the data – A good choice of concept space for VS model! Information Retrieval23

LSA for retrieval Information Retrieval24

LSA for retrieval q: “human computer interaction” HCI Graph theory Information Retrieval25

Discussions We will come back to this later! Information Retrieval26

LSA beyond text Collaborative filtering Information Retrieval27

LSA beyond text Eigen face Information Retrieval28

LSA beyond text Cat from deep neuron network One of the neurons in the artificial neural network, trained from still frames from unlabeled YouTube videos, learned to detect cats. Information Retrieval29

What you should know Assumption in LSA Interpretation of LSA – Low rank matrix approximation – Eigen-decomposition of co-occurrence matrix for documents and terms LSA for IR Information Retrieval30

Today’s reading Chapter 13: Matrix decompositions and latent semantic indexing – All the chapters! Deerwester, Scott C., et al. "Indexing by latent semantic analysis." JAsIs 41.6 (1990): Indexing by latent semantic analysis Information Retrieval31