1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Eigen Decomposition and Singular Value Decomposition
Text Databases Text Types
Latent Semantic Analysis
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Principal Component Analysis
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Paper Summary of: Modelling Retrieval and Navigation in Context by: Massimo Melucci Ahmed A. AlNazer May 2008 ICS-542: Multimedia Computing – 072.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Separate multivariate observations
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
Latent Semantic Analysis Hongning Wang VS model in practice Document and query are represented by term vectors – Terms are not necessarily orthogonal.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Shape Analysis and Retrieval Statistical Shape Descriptors Notes courtesy of Funk et al., SIGGRAPH 2004.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
INFORMATION SEARCH Presenter: Pham Kim Son Saint Petersburg State University.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Recuperação de Informação B Cap. 02: Modeling (Latent Semantic Indexing & Neural Network Model) 2.7.2, September 27, 1999.
Natural Language Processing Topics in Information Retrieval August, 2002.
CS479/679 Pattern Recognition Dr. George Bebis
Latent Semantic Indexing
LSI, SVD and Data Management
Recuperação de Informação B
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

2 Abstract We seek insight into Latent Semantic Indexing by establishing a method to identify the optimal number of factors in the reduced matrix for representing a keyword. We seek insight into Latent Semantic Indexing by establishing a method to identify the optimal number of factors in the reduced matrix for representing a keyword. By examining the precision, we find that lower ranked dimensions identify related terms and higher-ranked dimensions discriminate between the synonyms. By examining the precision, we find that lower ranked dimensions identify related terms and higher-ranked dimensions discriminate between the synonyms.

3 Introduction The task of retrieving the documents relevant to a user query in a large text database is complicated by the fact that different authors use different words to express the same ideas or concepts. The task of retrieving the documents relevant to a user query in a large text database is complicated by the fact that different authors use different words to express the same ideas or concepts. Methods related to Latent Semantic Analysis interpret the variability associated with the expression of a concept as a noise, and use linear algebra techniques to isolate the perennial concept from the variable noise. Methods related to Latent Semantic Analysis interpret the variability associated with the expression of a concept as a noise, and use linear algebra techniques to isolate the perennial concept from the variable noise.

4 Introduction In LSA, the SVD (singular value decomposition) technique is used to decompose a term by document matrix to a set of orthogonal factors. In LSA, the SVD (singular value decomposition) technique is used to decompose a term by document matrix to a set of orthogonal factors. A large number of factor weights provide an approximation close to the original term by document matrix but retains too much noise. A large number of factor weights provide an approximation close to the original term by document matrix but retains too much noise. On the other hand, if too many factors are discarded, the information loss is too large. On the other hand, if too many factors are discarded, the information loss is too large. The objective is to identify the optimal number of orthogonal factors. The objective is to identify the optimal number of orthogonal factors.

5 Bag-of-words representation Each documents is replaced by a vector of its attributes which are usually the keywords present in the document. Each documents is replaced by a vector of its attributes which are usually the keywords present in the document. This representation can be used to retrieve document relevant to a user query: A vector representation is derived from the query in the same way as regular documents and then compared with the database using a suitable measure of distance or similarity. This representation can be used to retrieve document relevant to a user query: A vector representation is derived from the query in the same way as regular documents and then compared with the database using a suitable measure of distance or similarity.

6 Latent Semantic Analysis LSI is one of the few methods which successfully overcome the vocabulary noise problem because it takes into account synonymy and polysemy. LSI is one of the few methods which successfully overcome the vocabulary noise problem because it takes into account synonymy and polysemy. Not accounting for synonymy leads to under- estimate the similarity between related documents, and not accounting for polysemy leads to erroneously finding similarities. Not accounting for synonymy leads to under- estimate the similarity between related documents, and not accounting for polysemy leads to erroneously finding similarities. The idea behind LSI is to reduce the dimension of the IR problem by projecting the D documents by N attributes matrix A to an adequate subspace of lower dimension. The idea behind LSI is to reduce the dimension of the IR problem by projecting the D documents by N attributes matrix A to an adequate subspace of lower dimension.

7 Latent Semantic Analysis SVD of the D × N matrix A: SVD of the D × N matrix A: A = UΔV T (1) U,V: orthogonal matrices Δ: a diagonal matrix with elements σ 1, …,σ p, where p = min(D, N) and σ 1 ≧ σ 2 ≧ … ≧ σ p-1 ≧ σ p The closest matrix A(k) of dimension k k. The closest matrix A(k) of dimension k k. A(k) = U(k) × Δ(k) × V(k) T (2)

8 Latent Semantic Analysis Then we compare documents in a k dimensional subspace based on A(k). The projection of the original document representation gives Then we compare documents in a k dimensional subspace based on A(k). The projection of the original document representation gives A T × U(k) × Δ(k) -1 = V(k)(3) the same operation on the query vector Q Q(k) = Q T × U(k) × Δ(k) -1 (4) The closest document to query Q is identified by a dissimilarity function d k (.,.): The closest document to query Q is identified by a dissimilarity function d k (.,.):(5)

9 Covariance Method Advantage: be able to handle databases of several hundred of thousands of documents. Advantage: be able to handle databases of several hundred of thousands of documents. If D is the number of documents, A d the vector representing the d th document and Ā the mean of these vectors, the covariance matrix is written: If D is the number of documents, A d the vector representing the d th document and Ā the mean of these vectors, the covariance matrix is written:(6) this matrix being symmetric, the singular value decomposition can be written: C = VΔV T (7)

10 Covariance Method Reducing Δ to the k more significant singular values, we can project the keyword space into a k dimensional subspace: Reducing Δ to the k more significant singular values, we can project the keyword space into a k dimensional subspace: (A|Q) → (A|Q)V(k)Δ = (A(k)|Q(k))(8)

11 Embedded Concepts Sending the covariance matrix onto a subspace of fewer dimension implies a loss of information. Sending the covariance matrix onto a subspace of fewer dimension implies a loss of information. It can be interpreted as the merging of keywords meaning into a more general concept. It can be interpreted as the merging of keywords meaning into a more general concept. ex: “ cat ” and “ mouse ” → “ mammal ” → ” animal ” How many singular values are necessary for a keyword to be correctly distinguished from all others in the dictionary? How many singular values are necessary for a keyword to be correctly distinguished from all others in the dictionary? What ’ s the definition of “ correctly distinguished ” ? What ’ s the definition of “ correctly distinguished ” ?

12 Correlation Method The correlation matrix S of A is defined based on the covariance matrix C: The correlation matrix S of A is defined based on the covariance matrix C:(9) Using the correlation rather than the covariance matrix results in a different weighting of correlated keywords, the justification of the model remaining otherwise identical. Using the correlation rather than the covariance matrix results in a different weighting of correlated keywords, the justification of the model remaining otherwise identical.

13 Keyword Validity The property of SVD: The property of SVD:(10) The rank k approximation S(k) of S can be written The rank k approximation S(k) of S can be written(11) with k ≦ N. S(N) = S.

14 Keyword Validity We can argue the following argument: We can argue the following argument: the k-order approximation of the correlation matrix correctly represents a given keyword only if this keyword is more correlated to itself than to any other attribute. For a given keyword α this condition is written For a given keyword α this condition is written(12) A keyword is said to be “ valid ” of rank k if k-1 is the largest value for which Eq.(12) is not verified, then k is the validity rank of the keyword. A keyword is said to be “ valid ” of rank k if k-1 is the largest value for which Eq.(12) is not verified, then k is the validity rank of the keyword.

15 Experiments Data: REUTERS (21,578 articles) and TREC5 (131,896 articles generated 1,822,531 “ documents ” ) databases. Data: REUTERS (21,578 articles) and TREC5 (131,896 articles generated 1,822,531 “ documents ” ) databases. Pre-processing: Pre-processing: using the Porter algorithm to stem words using the Porter algorithm to stem words removing the keywords appearing in either more or less than two user specified thresholds. removing the keywords appearing in either more or less than two user specified thresholds. mapping documents to vectors by TFIDF. mapping documents to vectors by TFIDF.

16 First Experiment The claim: a given keyword is correctly represented by a rank k approximation of the correlation matrix if k is at least equal to the validity rank of the keyword. The claim: a given keyword is correctly represented by a rank k approximation of the correlation matrix if k is at least equal to the validity rank of the keyword. Experiment method: Experiment method: 1. Select a keyword, for example africa, and extract all the documents containing it. Produce a new copy of these documents, in which we replace the selected keyword by an new one. Produce a new copy of these documents, in which we replace the selected keyword by an new one. ex: replace africa by afrique (French)

17 First Experiment Add these new documents to the original database, and the keyword afrique to the vocabulary. Add these new documents to the original database, and the keyword afrique to the vocabulary. Compute the correlation matrix and SVD of this extended, new database. Note that afrique and africa are perfect synonyms. Compute the correlation matrix and SVD of this extended, new database. Note that afrique and africa are perfect synonyms. Send the original database to the new subspace and issue a query for afrique. We hope to find documents containing africa first. Send the original database to the new subspace and issue a query for afrique. We hope to find documents containing africa first.

18 First Experiment Figure 1: Keyword africa is replaced by afrique: Curves corresponding to ranks 450 and 500 start with a null precision and remain under the curves of lower validity ranks.

19 First Experiment Table 1: Keywords Characteristics

20 First Experiment Note: the drop in precision is low after we reach the validity rank, but still present. As usual, the validity rank is precisely defined: the drop in precision is observed as soon as the rank reaches 400.

21 First Experiment Figure 3: keyword network is replaced by reseau: Precision is 100% until the validity rank and deteriorates drastically beyond it.

22 First Experiment This experiment shows the relation between the “ concept ” associated with afrique (or any other concept) and the actual keyword. This experiment shows the relation between the “ concept ” associated with afrique (or any other concept) and the actual keyword. For low rank approximations S(k), augmenting the number of orthogonal factors helps identifying the “ concept ” common to both afrique and africa, while orthogonal factors beyond the validity rank help distinguish between the keyword and its synonym. For low rank approximations S(k), augmenting the number of orthogonal factors helps identifying the “ concept ” common to both afrique and africa, while orthogonal factors beyond the validity rank help distinguish between the keyword and its synonym.

23 Second Experiment Figure 4: Ratios R and hit = N/G for keyword afrique. Validity rank is 428.

24 Third Experiment Figure 5: vocabulary of 1201 keywords in REUTERS db. Figure 6: vocabulary of 2486 keywords in TREC db.

25 Conclusion We examined the dependence of the Latent Semantic structure on the number of orthogonal factors in the context of the Correlation Method. We examined the dependence of the Latent Semantic structure on the number of orthogonal factors in the context of the Correlation Method. We analyzed the claim that LSA provides a method to take account of synonymy. We analyzed the claim that LSA provides a method to take account of synonymy. We propose a method to determine the number of orthogonal factors for which a given keyword best represents an associated concept. We propose a method to determine the number of orthogonal factors for which a given keyword best represents an associated concept. Further directions might include the extension to multiple keywords queries. Further directions might include the extension to multiple keywords queries.