What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Advertisements

Text Databases Text Types
Latent Semantic Analysis
Tensors and Component Analysis Musawir Ali. Tensor: Generalization of an n-dimensional array Vector: order-1 tensor Matrix: order-2 tensor Order-3 tensor.
Dimensionality Reduction PCA -- SVD
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
9/7 Agenda  Project 1 discussion  Correlation Analysis  PCA (LSI)
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Improving Vector Space Ranking
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Latent Semantic Indexing
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Information Retrieval CSE 8337 Spring 2005 Modeling (Part II) Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates.
ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Plan for Today’s Lecture(s)
Latent Semantic Indexing
LSI, SVD and Data Management
Representation of documents and queries
Lecture 13: Singular Value Decomposition (SVD)
CS 430: Information Discovery
Information Retrieval and Web Search
Yet another Example T This happens to be a rank-7 matrix
Recuperação de Informação B
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses information. 3. Same term may have multiple meanings and different terms may have similar meanings. 4. Similarity function used not be good enough. 5. Importance/weight of a term in representing a document and query may be inaccurate.

Some improvements n Query expansion techniques (for 1) u relevance feedback F Vector model F Probabilistic model u co-occurrence analysis (local and global thesauri) n Improving the quality of terms [(2), (3) and (5).] u Latent Semantic Indexing u Phrase-detection

Insight through Principal Components Analysis KL Transform Neural Networks Dimensionality Reduction

Latent Semantic Indexing n Classic IR might lead to poor retrieval due to: u unrelated documents might be included in the answer set u relevant documents that do not contain at least one index term are not retrieved u Reasoning: retrieval based on index terms is vague and noisy n The user information need is more related to concepts and ideas than to index terms n A document that shares concepts with another document known to be relevant might be of interest

Latent Semantic Indexing n Creates modified vector space n Captures transitive co-occurrence information u If docs A & B don’t share any words, with each other, but both share lots of words with doc C, then A & B will be considered similar u Handles polysemy (adam’s apple) & synonymy n Simulates query expansion and document clustering (sort of)

A motivating example n Suppose we have keywords u Car, automobile, driver, elephant n We want queries on car to also get docs about drivers, but not about elephants u Need to realize that driver and car are related while elephant is not n When you scrunch down the dimensions, small differences get glossed over, and you get the desired behavior

Latent Semantic Indexing n Definitions u Let t be the total number of index terms u Let N be the number of documents u Let (Mij) be a term-document matrix with t rows and N columns u To each element of this matrix is assigned a weight wij associated with the pair [ki,dj] u The weight wij can be based on a tf-idf weighting scheme

Everything You Always Wanted to Know About LSI, and More Singular Value Decomposition (SVD): Convert term-document matrix into 3matrices U, D and V  Reduce Dimensionality: Throw out low-order rows and columns Recreate Matrix: Multiply to produce approximate term- document matrix. Use new matrix to process queries

Latent Semantic Indexing n The matrix (Mij) can be decomposed into 3 matrices (singular value decomposition) as follows: u (Mij) = (U) (S) (V) t F (U) is the matrix of eigenvectors derived from (M)(M) t F (V) t is the matrix of eigenvectors derived from (M) t (M) F (S) is an r x r diagonal matrix of singular values r = min(t,N) that is, the rank of (Mij) Singular values are the positive square roots of the eigen values of (M)(M) t (also (M) t (M)) For the special case where M is a square matrix, S is the diagonal eigen value matrix, and K and D are eigen vector matrices K and S are orthogonal matrices

Latent Semantic Indexing n The key idea is to map documents and queries into a lower dimensional space (i.e., composed of higher level concepts which are in fewer number than the index terms) n Retrieval in this reduced concept space might be superior to retrieval in the space of index terms

Latent Semantic Indexing n In the matrix (S), select only the k largest singular values n Keep the corresponding columns in (U) and (V) t n The resultant matrix is called (M) k and is given by u (M) k = (U) k (S) k (D) t k u where k, k < r, is the dimensionality of the concept space n The parameter k should be u large enough to allow fitting the characteristics of the data u small enough to filter out the non-relevant representational details The classic over-fitting issue

Computing an Example n Let (Mij) be given by the matrix u Compute the matrices (K), (S), and (D) t

Example term ch2ch3 ch4 ch5 ch6 ch7 ch8 ch9 controllability observability realization feedback controller observer transfer function polynomial matrices U (9x7) = S (7x7) = V (7x8) = This happens to be a rank-7 matrix -so only 7 dimensions required Singular values = Sqrt of Eigen values of AA T T

U2 (9x2) = S2 (2x2) = V2 (8x2) = U (9x7) = S (7x7) = V (7x8) = U2*S2*V2 will be a 9x8 matrix That approximates original matrix T

K=2 K=6 One component ignored 5 components ignored U6S6V6TU6S6V6T U2S2V2TU2S2V2T USV T U4S4V4TU4S4V4T K=4 =U 7 S 7 V 7 T 3 components ignored What should be the value of k?

Coordinate transformation inherent in LSI M = U S V T Mapping of keywords into LSI space is given by US For k=2, the mapping is: controllability observability realization feedback controller observer Transfer function polynomial matrices LSx LSy controllability controller LSIx LSIy LSIx Mapping of a doc d=[w1….wk] into LSI space is given by dUS -1 The base-keywords of The doc are first mapped To LSI keywords and Then differentially weighted By S -1 ch3

Medline data from Berry’s paper

Querying To query for feedback controller, the query vector would be q = [ ]' (' indicates transpose), since feedback and controller are the 4-th and 5-th terms in the index, and no other terms are selected. Let q be the query vector. Then the document-space vector corresponding to q is given by: q'*U2*inv(S2) = Dq For the feedback controller query vector, the result is: Dq = To find the best document match, we compare the Dq vector against all the document vectors in the 2- dimensional V2 space. The document vector that is nearest in direction to Dq is the best match. The cosine values for the eight document vectors and the query vector are: Centroid of the terms In the query (with scaling)

Within.40 threshold K is the number of singular values used

Latent Ranking (a la text) n The user query can be modelled as a pseudo- document in the original (M) matrix n Assume the query is modelled as the document numbered 0 in the (M) matrix n The matrix (M) t (M) s quantifies the relantionship between any two documents in the reduced concept space n The first row of this matrix provides the rank of all the documents with regard to the user query (represented as the document numbered 0) s Inefficient way

Folding docs -Convert new documents into LSI space using the dUS -1 method Folding terms - find the vectors for new terms as weighted sum of the docs in which they occur Practical Issues: How often do you re-compute SVD when terms or documents are added to the collection? --Folding is a cheaper solution but will worsen quality over time

Summary of LSI n Latent semantic indexing provides an interesting conceptualization of the IR problem n No stemming needed, spelling errors tolerated n Can do true conceptual retrieval u Retrieval of documents that do not share any keywords with the query!

The best fit for the feedback controller query vector is with the second document, which is Chapter 3. The sixth document, or Chapter 7, is also a good match. A query for feedback realization yields the query vector Dq = and cosine values The best matches for feedback realization are Chapters 4 and 6.