Download presentation
Presentation is loading. Please wait.
1
1/ 30
2
Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking the best Experimental results SVR Vs. IRR SVR Conclusion Future work 2/ 30
3
Problems for classical IR models LSI SVD 3/ 30
4
Problems for classical IR models Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together 4/ 30
5
Latent Semantic Analysis General idea – Map documents (and terms) to a low-dimensional representation. – Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). – Compute document similarity based on the inner product in the latent semantic space. Goals – Similar terms map to similar location in low dimensional space. – Noise reduction by dimension reduction. 5/ 30
6
Vector Model 6/ 30 Set of document: A finite set of terms : Every document can be displayed as vector: the same to the query: Similarity of query q and document d: Given a threshold, all documents with similarity > threshold are retrieved i j djdj q
7
SVD and low-rank approximations This optimality property of very useful in, e.g., Principal Component Analysis (PCA), LSI, etc. Truncate the SVD by keeping n ≤ k terms: 7/ 30 orthogonal matrix containing the top k left (right) singular vectors of A. diagonal matrix containing the top k singular values of A. ordered non-increasingly. rank of A, the number of non-zero singular values. diagonal matrix containing the top k singular values of A. ordered non-increasingly. rank of A, the number of non-zero singular values. the “best” matrix among all rank-k matrices wrt. to the spectral and Frobenius norms
8
8/ 30
9
9/ 30
10
10/ 30
11
11/ 30
12
12/ 30
13
13/ 30
14
14/ 30
15
TREC-4 data set.” http://trec.nist.gov/ ”http://trec.nist.gov/ randomly chose 5305 documents. tested with 20 queries. Stemming “Porter Stemmer” and stop-word were used.” http://www.tartarus.org/~martin/PorterStemmer/”;” http://www.lextek.com/manuals/onix/stopwords1.html” http://www.tartarus.org/~martin/PorterStemmer/ http://www.lextek.com/manuals/onix/stopwords1.html term-by-document matrix was of dimension 16,571 x 5305 and was determined to have a full rank of 5305 through the SVD process. 15/ 30
16
T, measuring the area covered between the IRP curve and the horizontal axis of Recall and representing the average interpolated precision over the full range ([0, 1]) of recall 16/ 30
17
LSI IRR A term doc Weight SVD U VTVT eigenvalue eigenvector rescaling 17/ 30
18
term doc turn to term sentence IRR U VTVT Put all document as a query to count the similarity 18/ 30
19
19/ 30
20
20/ 30
21
21/ 30 Fig 5: shows SVD of 2x2 matrix
22
22/ 30
23
Mathematical analysis showed that: – The difference between the results of version A and version B is a factor of S 2 with S being the diagonal matrix of singular values in the dimension-reduced model. – The retrieval results from version B and version B’ are always identical if the Equivalency Principle is satisfied. – Version B (B’) should be a better option than version A. 23/ 30
24
Experiments on standardized TREC data set confirmed that: – 5.9% The improvement ratio of Using SVR in addition to the conventional LSI over using the conventional LSI alone. – SVR is computationally as efficient as the best standard query method ”Version B”. – SVR performs better than IRR. 24/ 30
25
Applying SVR to other fields of IR such as image retrieval and video/audio retrieval. Seeking mathematical justification of SVR, including the relationship between the optimal rescaling factor S_exp and the characteristics of any particular data set. 25/ 30
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.