Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

Vector space model Each term represents a dimension Documents are vectors in the term-space Term-document matrix: a very sparse matrix Entries are scores of the terms in the documents (Boolean  Count  Weight) Query is also a vector in the term- space 2 d1d1 d2d2 d3d3 d4d4 d5d5 q car 0.500001 automobile 0.20.800.200 engine 0.70.60.900.50 search 000.70.50.80 Vector similarity: cosine of the angle between the vectors What is the problem? car ~ automobile But in the vector space model each term is a different dimension Recall

Synonyms in different dimensions Synonyms Car and automobiles are synonyms But different dimensions Same situation for terms belonging to similar concepts Goal: can we map synonyms (similar concepts) to same dimensions automatically? 3 automobile car d2d2 q d1d1 engine d1d1 d2d2 d3d3 d4d4 d5d5 q car 0.500001 automobile 0.20.800.200 engine 0.70.60.900.50 search 000.70.50.80

Linear algebra review 4  Rank of a matrix: number of linearly independent columns (or rows)  If A is an m × n matrix, rank(A) ≤ min(m, n) d1d1 d2d2 d3d3 d4d4 d5d5 car 12001 Rank ofautomobile 12000= ? engine 1210.20 search 0010.20.8

Linear algebra review  A square matrix M is called orthogonal if its rows and columns are orthogonal unit vectors (orthonormal vectors) – Each column (row) has norm 1 – Any two columns (rows) have dot product 0  For a square matrix M, if there is a vector v such that Av = λv for some scalar λ, then v is called an eigenvector of A λ is the corresponding eigenvalue 5

Singular value decomposition If A is an m × n matrix with rank r Then there exists a factorization of A as: 6 where U (m × m) and V (n × n) are orthogonal, and Σ (m × n) is a diagonal-like matrix Σ = (σ ij ), where σ ii = σ i, for i = 1, …, r are the singular values of A, all non-diagonal entries of Σ are zero σ 1 ≥ σ 2 ≥ … ≥ σ r ≥ 0 Columns of U are the left singular vectors of A

Singular value decomposition 7 n×n σ1σ1 σrσr 0 0 m×nm×mm×r σ1σ1 σrσr r×rr×n

Matrix digonalization for symmetric matrix If A is an m × m matrix with rank r Consider C = AA T. Then: 8 C has rank r Σ 2 is a diagonal matrix with entries σ i 2, for i = 1, …, r Columns of U are the eigenvectors of C σ i 2 are the corresponding eigenvalues of C

SVD of term – document matrix 9 Documents are vectors in the m dimensional term space But we would think there are less number of concepts associated with the collection m terms, k concepts. k << m Ignore all but the first k singular values, singular vectors But we would think there are less number of concepts associated with the collection m terms, k concepts. k << m Ignore all but the first k singular values, singular vectors m×k k×k k×n Low rank approximation AkAk VkTVkT ΣkΣk

Low-rank approxmation 10 Rank k Now compute cosine similarity with the query q Computationally, still m dimensional vectors

Retrieval in the concept space  Retrieval in the term-space (cosine): both q and d are m dimensional vectors (m = #of terms) 11  Term space (m)  concept space (k)  Use the first k singular vectors Query: q  U k T q (k × m, m × 1 = k × 1) Document: d  U k T d (k × m, m × 1 = k × 1)  Cosine similarity in the concept space:  Other variants: map using (U k Σ k ) T

How to find the optimal low-rank?  Primarily intuitive – Assumption that a document collection has exactly k concepts – No systematic method to find optimal k – Experimental results are not very consistent 12

HOW DOES LSI WORK? Bast & Majumdar, SIGIR 2005 13

Spectral retrieval – general framework Term-document matrix A (m × n) m terms, n documents q (m × 1) dimension reduction to concept space L (k× m) L. A (k × n) k concepts, n documents L. q (k × 1) cosine similarities in concept space cosine similarities in term space Singular value decomposition (SVD) A = U Σ V T m × n m × r r × r r × n U k = first k columns of U L = U k T (k × m) LSI and other LSI-based retrieval methods are called “Spectral retrieval”

Spectral retrieval as document "expansion" 1100 1100 0010 0001 car auto engine search · 0-1 expansion matrix car auto engine search 1 1 1 0 = 0 1 1 0

Spectral retrieval as document "expansion" 1100 1100 0010 0001 car auto engine search · add car if auto is present 0-1 expansion matrix 1 1 1 0 = 0 1 1 0 car auto engine search

Spectral retrieval as document "expansion" 0.290.360.25-0.12 0.360.440.30-0.17 0.250.300.440.30 -0.12-0.170.300.84 car auto engine search 0 1 1 0  Ideal expansion matrix should have – high scores for intuitively related terms – low scores for intuitively unrelated terms expansion matrix U k U k T 0.61 0.74 0.13 matrix L = U 2 U 2 T projecting to 2 dimensions 0.420.510.660.37 0.330.43-0.08-0.84 add car if auto is present · = expansion matrix depends heavily on the subspace dimension! car auto engine search LSI expansion matrix

Why document "expansion" 0.93-.120.20-0.11 -0.120.800.34-0.18 0.200.340.440.30 -0.11-0.180.300.84 car auto engine search 0 1 1 0  Ideal expansion matrix should have – high scores for intuitively related terms – low scores for intuitively unrelated terms 0.08 1.13 0.78 0.12 0.420.510.660.37 0.330.43-0.08-0.84 -0.800.590.06-0.01 add car if auto is present · = expansion matrix U k U k T matrix L = U 3 U 3 T projecting to 3 dimensions car auto engine search LSI expansion matrix expansion matrix depends heavily on the subspace dimension! Finding the optimal number of dimensions k remained an open problem

Relatedness Curves  How the entries in the expansion matrix depend on the dimension k of the subspace  Plot (i,j)-th entry of expansion matrix T = L T L = U k U k T against the dimension k  Cumulative dot product of the i-th and j-th rows of U i j k U = {singular vectors} k

node / vertex 2004006000 subspace dimension logic / logics 2004006000 subspace dimension logic / vertex 2004006000 subspace dimension Types of Relatedness Curves Three main types expansion matrix entry 0 No single dimension is appropriate for all term pairs But the shape of the curve indicates the term-term relationship!

Curves for related terms 0 1 0 1 1 0 1 1 1 0 1 0 1 0 1 1 0 1 1 0 00 00...... 1 1... 11 11...... We call two terms perfectly related if they have an identical co- occurrence pattern 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension expansion matrix entry proven shape for perfectly related terms provably small change after slight perturbation more perturbation shape: up, then down point of fall-off is different for every term pair, we can calculate that term 1 term 2 0 0 0 D b b B a0 0a AA

Curves for unrelated terms  Co-occurrence graph: – terms are vertices – edge between two terms if they co-occur  We call two terms perfectly unrelated if no path connects them in the graph curves for unrelated terms randomly oscillate around zero proven shape for perfectly unrelated terms provably small change after slight perturbation more perturbation 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension expansion matrix entry 0

TN: the non-negativity Test 1.Normalize term-document matrix so that theoretical point of fall-off is same for all term pairs 2.Discard the parts of the curves after this point 3.For each term pair: if curve is never negative before this point, set entry in expansion matrix to 1, otherwise to 0 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension A simple 0-1 classification, produces a sparse expansion matrix! Related terms set entry to 1 Related terms set entry to 1 Unrelated terms set entry to 0 expansion matrix entry 0

TS: the Smoothness Test 1.Again, discard the part of the curves after the theoretical point of fall-off (same for every term-pair, after normalization) 2.For each term pair compute the smoothness of its curve (= 1 if very smooth,  0 as number of turns increase) 3.If smoothness is above some threshold, set entry in expansion matrix to 1, otherwise to 0 2004006000 subspace dimension 2004006000 subspace dimension 2004006000 subspace dimension 0.82 0.69 0.07 expansion matrix entry Related terms set entry to 1 Related terms set entry to 1 0.82 0.69 0.07 Unrelated terms set entry to 0 0 Again, 0-1 classification, produces a sparse expansion matrix!

Experimental results Time 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! Baseline: cosine similarity in term space Latent Semantic Indexing Dumais et al. 1990 Term-normalized LSI Ding et al. 2001 Correlation-based LSI Dupret et al. 2001 Iterative Residual Rescaling Ando & Lee 2001 non-negativity test smoothness test Average precision

Experimental results Time 63.2% 62.8% 58.6% 59.1% 62.2% 64.9% 64.1% COS LSI* LSI-RN* CORR* IRR* TN TM 425 docs 3882 terms Reuters 36.2% 32.0% 37.0% 32.3% —— 41.9% 42.9% 21578 docs 5701 terms Ohsumed 13.2% 6.9% 13.0% 10.9% —— 14.4% 15.3% 233445 docs 99117 terms * the numbers for LSI, LSI-RN, CORR, IRR are for the best subspace dimension! Average precision

Asymmetric term-term relations Related terms: fruit – apple  Until some dimension k ’ the curve fruit – apple is above the curve apple - apple  Until dimension k ’ apple is more related to fruit than to apple itself  Asymmetric relation: fruit is more general than apple 0 k fruit - apple fruit - fruit apple - apple Bast, Dupret, Majumdar & Piwowarski, 2006

Examples More general Less general More general Less general Apple--FruitCar--Opel Space--SolarRestaurant--Dish India--GandhiFashion--Trousers Restaurant--WaiterMetal--Zinc Sweden--StockholmIndia--Delhi Church--PriestOpera--Vocal Metal--AluminumFashion--Silk Saudi--SultanFish--Shark

Sources and Acknowledgements  IR Book by Manning, Raghavan and Schuetze: http://nlp.stanford.edu/IR-book/ http://nlp.stanford.edu/IR-book/  Bast and Majumdar: Why spectral retrieval works. SIGIR 2005 – Some slides are adapted from the talk by Hannah Bast 29

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Similar presentations

Presentation on theme: "Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Similar presentations

Presentation on theme: "Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata."— Presentation transcript:

Similar presentations

About project

Feedback