Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Advertisements

Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Latent Semantic Analysis
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Lecture 19 Singular Value Decomposition
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
3D Geometry for Computer Graphics
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 5 April 23, 2006
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
PrasadL18LSI1 Latent Semantic Indexing Adapted from Lectures by Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
An Introduction to Latent Semantic Analysis. 2 Matrix Decompositions Definition: The factorization of a matrix M into two or more matrices M 1, M 2,…,
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
SINGULAR VALUE DECOMPOSITION (SVD)
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.
Latent Semantic Indexing
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Review of Linear Algebra Optimization 1/16/08 Recitation Joseph Bradley.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in.
Intelligent Search by Dimension Reduction Holger Bast Max-Planck-Institut für Informatik AG 1 - Algorithmen und Komplexität.
PrasadL18LSI1 Latent Semantic Indexing Adapted from Lectures by Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann.
Vector Semantics Dense Vectors.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Packing to fewer dimensions
LSI, SVD and Data Management
Packing to fewer dimensions
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Why Spectral Retrieval Works
Packing to fewer dimensions
Information Retrieval and Web Search
Latent Semantic Indexing
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

Vector space model Each term represents a dimension Documents are vectors in the term-space Term-document matrix: a very sparse matrix Entries are scores of the terms in the documents (Boolean  Count  Weight) Query is also a vector in the term- space 2 d1d1 d2d2 d3d3 d4d4 d5d5 q car automobile engine search Vector similarity: cosine of the angle between the vectors What is the problem? car ~ automobile But in the vector space model each term is a different dimension Recall

Synonyms in different dimensions Synonyms Car and automobiles are synonyms But different dimensions Same situation for terms belonging to similar concepts Goal: can we map synonyms (similar concepts) to same dimensions automatically? 3 automobile car d2d2 q d1d1 engine d1d1 d2d2 d3d3 d4d4 d5d5 q car automobile engine search

Linear algebra review 4  Rank of a matrix: number of linearly independent columns (or rows)  If A is an m × n matrix, rank(A) ≤ min(m, n) d1d1 d2d2 d3d3 d4d4 d5d5 car Rank ofautomobile 12000= ? engine search

Linear algebra review  A square matrix M is called orthogonal if its rows and columns are orthogonal unit vectors (orthonormal vectors) – Each column (row) has norm 1 – Any two columns (rows) have dot product 0  For a square matrix M, if there is a vector v such that Av = λv for some scalar λ, then v is called an eigenvector of A λ is the corresponding eigenvalue 5

Singular value decomposition If A is an m × n matrix with rank r Then there exists a factorization of A as: 6 where U (m × m) and V (n × n) are orthogonal, and Σ (m × n) is a diagonal-like matrix Σ = (σ ij ), where σ ii = σ i, for i = 1, …, r are the singular values of A, all non-diagonal entries of Σ are zero σ 1 ≥ σ 2 ≥ … ≥ σ r ≥ 0 Columns of U are the left singular vectors of A

Singular value decomposition 7 n×n σ1σ1 σrσr 0 0 m×nm×mm×r σ1σ1 σrσr r×rr×n

Matrix digonalization for symmetric matrix If A is an m × m matrix with rank r Consider C = AA T. Then: 8 C has rank r Σ 2 is a diagonal matrix with entries σ i 2, for i = 1, …, r Columns of U are the eigenvectors of C σ i 2 are the corresponding eigenvalues of C

SVD of term – document matrix 9 Documents are vectors in the m dimensional term space But we would think there are less number of concepts associated with the collection m terms, k concepts. k << m Ignore all but the first k singular values, singular vectors But we would think there are less number of concepts associated with the collection m terms, k concepts. k << m Ignore all but the first k singular values, singular vectors m×k k×k k×n Low rank approximation AkAk VkTVkT ΣkΣk

Low-rank approxmation 10 Rank k Now compute cosine similarity with the query q Computationally, still m dimensional vectors

Retrieval in the concept space  Retrieval in the term-space (cosine): both q and d are m dimensional vectors (m = #of terms) 11  Term space (m)  concept space (k)  Use the first k singular vectors Query: q  U k T q (k × m, m × 1 = k × 1) Document: d  U k T d (k × m, m × 1 = k × 1)  Cosine similarity in the concept space:  Other variants: map using (U k Σ k ) T

How to find the optimal low-rank?  Primarily intuitive – Assumption that a document collection has exactly k concepts – No systematic method to find optimal k – Experimental results are not very consistent 12

HOW DOES LSI WORK? Bast & Majumdar, SIGIR

Spectral retrieval – general framework Term-document matrix A (m × n) m terms, n documents q (m × 1) dimension reduction to concept space L (k× m) L. A (k × n) k concepts, n documents L. q (k × 1) cosine similarities in concept space cosine similarities in term space Singular value decomposition (SVD) A = U Σ V T m × n m × r r × r r × n U k = first k columns of U L = U k T (k × m) LSI and other LSI-based retrieval methods are called “Spectral retrieval”

Spectral retrieval as document "expansion" car auto engine search · 0-1 expansion matrix car auto engine search =

Spectral retrieval as document "expansion" car auto engine search · add car if auto is present 0-1 expansion matrix = car auto engine search

Spectral retrieval as document "expansion" car auto engine search  Ideal expansion matrix should have – high scores for intuitively related terms – low scores for intuitively unrelated terms expansion matrix U k U k T matrix L = U 2 U 2 T projecting to 2 dimensions add car if auto is present · = expansion matrix depends heavily on the subspace dimension! car auto engine search LSI expansion matrix

Why document "expansion" car auto engine search  Ideal expansion matrix should have – high scores for intuitively related terms – low scores for intuitively unrelated terms add car if auto is present · = expansion matrix U k U k T matrix L = U 3 U 3 T projecting to 3 dimensions car auto engine search LSI expansion matrix expansion matrix depends heavily on the subspace dimension! Finding the optimal number of dimensions k remained an open problem

Relatedness Curves  How the entries in the expansion matrix depend on the dimension k of the subspace  Plot (i,j)-th entry of expansion matrix T = L T L = U k U k T against the dimension k  Cumulative dot product of the i-th and j-th rows of U i j k U = {singular vectors} k

node / vertex subspace dimension logic / logics subspace dimension logic / vertex subspace dimension Types of Relatedness Curves Three main types expansion matrix entry 0 No single dimension is appropriate for all term pairs But the shape of the curve indicates the term-term relationship!

Curves for related terms We call two terms perfectly related if they have an identical co- occurrence pattern subspace dimension subspace dimension subspace dimension expansion matrix entry proven shape for perfectly related terms provably small change after slight perturbation more perturbation shape: up, then down point of fall-off is different for every term pair, we can calculate that term 1 term D b b B a0 0a AA

Curves for unrelated terms  Co-occurrence graph: – terms are vertices – edge between two terms if they co-occur  We call two terms perfectly unrelated if no path connects them in the graph curves for unrelated terms randomly oscillate around zero proven shape for perfectly unrelated terms provably small change after slight perturbation more perturbation subspace dimension subspace dimension subspace dimension expansion matrix entry 0

TN: the non-negativity Test 1.Normalize term-document matrix so that theoretical point of fall-off is same for all term pairs 2.Discard the parts of the curves after this point 3.For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to subspace dimension subspace dimension subspace dimension A simple 0-1 classification, produces a sparse expansion matrix! Related terms set entry to 1 Related terms set entry to 1 Unrelated terms set entry to 0 expansion matrix entry 0

TS: the Smoothness Test 1.Again, discard the part of the curves after the theoretical point of fall-off (same for every term-pair, after normalization) 2.For each term pair compute the smoothness of its curve (= 1 if very smooth,  0 as number of turns increase) 3.If smoothness is above some threshold, set entry in expansion matrix to 1, otherwise to subspace dimension subspace dimension subspace dimension expansion matrix entry Related terms set entry to 1 Related terms set entry to Unrelated terms set entry to 0 0 Again, 0-1 classification, produces a sparse expansion matrix!

Experimental results Time 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! Baseline: cosine similarity in term space Latent Semantic Indexing Dumais et al Term-normalized LSI Ding et al Correlation-based LSI Dupret et al Iterative Residual Rescaling Ando & Lee 2001 non-negativity test smoothness test Average precision

Experimental results Time 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms Reuters 36.2% 32.0% 37.0% 32.3% —— 41.9% 42.9% docs 5701 terms Ohsumed 13.2% 6.9% 13.0% 10.9% —— 14.4% 15.3% docs terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! Average precision

Asymmetric term-term relations Related terms: fruit – apple  Until some dimension k ’ the curve fruit – apple is above the curve apple - apple  Until dimension k ’ apple is more related to fruit than to apple itself  Asymmetric relation: fruit is more general than apple 0 k fruit - apple fruit - fruit apple - apple Bast, Dupret, Majumdar & Piwowarski, 2006

Examples More general Less general More general Less general Apple--FruitCar--Opel Space--SolarRestaurant--Dish India--GandhiFashion--Trousers Restaurant--WaiterMetal--Zinc Sweden--StockholmIndia--Delhi Church--PriestOpera--Vocal Metal--AluminumFashion--Silk Saudi--SultanFish--Shark

Sources and Acknowledgements  IR Book by Manning, Raghavan and Schuetze:  Bast and Majumdar: Why spectral retrieval works. SIGIR 2005 – Some slides are adapted from the talk by Hannah Bast 29