Presentation is loading. Please wait.

Presentation is loading. Please wait.

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.

Similar presentations


Presentation on theme: "Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative."— Presentation transcript:

1 Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement
Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2000. Presenter: 游斯涵

2 Introduction Some studies of applying modified or generalized SVD:
(improving the precision of similarities) SDD ( semi discrete decomposition) T. G. Kolda and D. P. O’Learly Proposed to reduce the storage and computational costs of LSI. R-SVD ( Riemannian SVD) E. p. Jiang and M. W. Berry User feedback can be integrated into LSI models. (theoretical of LSI) MDS (Multidimensional Scaling)、Bayesian regression model、Probabilistic models.

3 Documents very different
Introduction Find the problem with SVD SVD: The topics underlying outlier documents tend to be lost as we chose lower number of dimensions. Dimensional reduction comes from two sources: outlier document minor term The thinking of this paper: not to consider the outlier document as “noise”, all documents assume to be equal. Try to eliminate noise from the minor terms but not eliminate the influence of the outlier documents. Outlier documents Documents very different from other documents

4 Introduction

5 Compare with SVD Same Differ
Trying to find a smaller set of basis vectors for a reduced space. Differ Scale the length of each residual vector Treat documents and terms in a nonsymmetrical way.

6 Algorithm-basis vector creation
Input: term-document matrix D, scaling factor q Output: basis vectors For ( i=1;until reaching some criterion ;i=i+1) the first unit eigenvector of End for m*m m*n Doc term

7 Algorithm-basis vector creation
= n m n = n m

8 Algorithm-document vector creation
Dimension reduction n n m = k m k There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)

9 example Find the eigenvector of

10 example Find it’s eigenvector

11 Probabilistic model Basis vectors: Follows a Gaussian distribution
Multivariate Normal (MVN) Distribution

12 Negligible because it changes slowly
Probabilistic model The log likelihood for the document vectors reduced to dimension k is computed as (Ding) Maximize this Negligible because it changes slowly

13 parameter : set 1 to 10,increment of 1.
selection of dimension by log-likelihood

14 Number of topic range from 6~20
experiment Test data: TREC collections 20 topics Total umber of documents is 684 disjoint Test data pool1 pool2 training data 15 document set 15 document set Each set range from 31~126 Number of topic range from 6~20

15 Baseline algorithm Three algorithm
SVD taking the left singular vectors as the basis vector Term-document without any basis conversion (term frequency) This paper algorithm

16 evaluation Assumption 67.7 62.2 60
Similarity should be higher for any document pair relevant to the same topic (intra-topic pair). 60 62.2 67.7

17 evaluation Preservation rate (document length):
Reduction rate (越大越好) : 1 - Preservation rate Dimensional reduction rate (越大越好) : 1 - ( # of dimensions / max # of dimensions)

18 Selection dimension Log-likelihood method: Training-based method:
Choose the dimension which make the preservation rate closer to the average preservation rate. Random guess-based method:

19 result 17.8%

20 result Dimension reduction rate 43% higher than SVD on average
This algorithm shows 35.8% higher reduction rate than SVD

21 result

22 conclusion This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm. Scaling factor can become dynamic to improve the performance.


Download ppt "Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative."

Similar presentations


Ads by Google