Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.

Slides:



Advertisements
Similar presentations
Information retrieval – LSI, pLSI and LDA
Advertisements

Microarray Data Analysis (Lecture for CS397-CXZ Algorithms in Bioinformatics) March 19, 2004 ChengXiang Zhai Department of Computer Science University.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Evaluating Search Engine
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
A shot at Netflix Challenge Hybrid Recommendation System Priyank Chodisetti.
Latent Semantic Indexing via a Semi-discrete Matrix Decomposition.
Dimensional reduction, PCA
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Probabilistic Latent Semantic Analysis
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Presented By Wanchen Lu 2/25/2013
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
No. 1 Classification and clustering methods by probabilistic latent semantic indexing model A Short Course at Tamkang University Taipei, Taiwan, R.O.C.,
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
SINGULAR VALUE DECOMPOSITION (SVD)
Learning Theory Reza Shadmehr LMS with Newton-Raphson, weighted least squares, choice of loss function.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION
Analyzing Expression Data: Clustering and Stats Chapter 16.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Link Distribution on Wikipedia [0407]KwangHee Park.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
14.0 Linguistic Processing and Latent Topic Analysis.
Latent Semantic Analysis (LSA) Jed Crandall 16 June 2009.
A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.
No. 1 Classification Methods for Documents with both Fixed and Free Formats by PLSI Model* 2004International Conference in Management Sciences and Decision.
Bayesian Semi-Parametric Multiple Shrinkage
Information Retrieval: Models and Methods
Document Clustering Based on Non-negative Matrix Factorization
LECTURE 11: Advanced Discriminant Analysis
Information Retrieval: Models and Methods
Latent Semantic Indexing
Efficient Estimation of Word Representation in Vector Space
LSI, SVD and Data Management
Jianping Fan Dept of CS UNC-Charlotte
Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction
Probabilistic Models with Latent Variables
SMEM Algorithm for Mixture Models
Presented by Nagesh Adluru
Learning Theory Reza Shadmehr
CS246: Latent Dirichlet Analysis
Topic Models in Text Processing
CS 430: Information Discovery
Mathematical Foundations of BME
Learning to Rank with Ties
Restructuring Sparse High Dimensional Data for Effective Retrieval
Lecture 16. Classification (II): Practical Considerations
Latent Semantic Analysis
Probabilistic Surrogate Models
Presentation transcript:

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2000. Presenter: 游斯涵

Introduction Some studies of applying modified or generalized SVD: (improving the precision of similarities) SDD ( semi discrete decomposition) T. G. Kolda and D. P. O’Learly Proposed to reduce the storage and computational costs of LSI. R-SVD ( Riemannian SVD) E. p. Jiang and M. W. Berry User feedback can be integrated into LSI models. (theoretical of LSI) MDS (Multidimensional Scaling)、Bayesian regression model、Probabilistic models.

Documents very different Introduction Find the problem with SVD SVD: The topics underlying outlier documents tend to be lost as we chose lower number of dimensions. Dimensional reduction comes from two sources: outlier document minor term The thinking of this paper: not to consider the outlier document as “noise”, all documents assume to be equal. Try to eliminate noise from the minor terms but not eliminate the influence of the outlier documents. Outlier documents Documents very different from other documents

Introduction

Compare with SVD Same Differ Trying to find a smaller set of basis vectors for a reduced space. Differ Scale the length of each residual vector Treat documents and terms in a nonsymmetrical way.

Algorithm-basis vector creation Input: term-document matrix D, scaling factor q Output: basis vectors For ( i=1;until reaching some criterion ;i=i+1) the first unit eigenvector of End for m*m m*n Doc term

Algorithm-basis vector creation = n m n = n m

Algorithm-document vector creation Dimension reduction n n m = k m k There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)

example Find the eigenvector of

example Find it’s eigenvector

Probabilistic model Basis vectors: Follows a Gaussian distribution Multivariate Normal (MVN) Distribution

Negligible because it changes slowly Probabilistic model The log likelihood for the document vectors reduced to dimension k is computed as (Ding) Maximize this Negligible because it changes slowly

parameter : set 1 to 10,increment of 1. selection of dimension by log-likelihood

Number of topic range from 6~20 experiment Test data: TREC collections 20 topics Total umber of documents is 684 disjoint Test data pool1 pool2 training data 15 document set 15 document set Each set range from 31~126 Number of topic range from 6~20

Baseline algorithm Three algorithm SVD taking the left singular vectors as the basis vector Term-document without any basis conversion (term frequency) This paper algorithm

evaluation Assumption 67.7 62.2 60 Similarity should be higher for any document pair relevant to the same topic (intra-topic pair). 60 62.2 67.7

evaluation Preservation rate (document length): Reduction rate (越大越好) : 1 - Preservation rate Dimensional reduction rate (越大越好) : 1 - ( # of dimensions / max # of dimensions)

Selection dimension Log-likelihood method: Training-based method: Choose the dimension which make the preservation rate closer to the average preservation rate. Random guess-based method:

result 17.8%

result Dimension reduction rate 43% higher than SVD on average This algorithm shows 35.8% higher reduction rate than SVD

result

conclusion This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm. Scaling factor can become dynamic to improve the performance.