1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.

Advertisements

Dimensionality Reduction PCA -- SVD

Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,

Lecture 19 Singular Value Decomposition

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Hinrich Schütze and Christina Lioma

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Lecture 19 Quadratic Shapes and Symmetric Positive Definite Matrices Shang-Hua Teng.

1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University

Dimensional reduction, PCA

Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005

Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.

TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.

Lecture 21 SVD and Latent Semantic Indexing and Dimensional Reduction

IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.

The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006

10-603/15-826A: Multimedia Databases and Data Mining SVD - part I (definitions) C. Faloutsos.

Lecture 20 SVD and Its Applications Shang-Hua Teng.

Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.

Singular Value Decomposition and Data Management

E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:

DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD

Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.

1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.

CS276A Text Retrieval and Mining Lecture 15 Thanks to Thomas Hoffman, Brown University for sharing many of these slides.

Information Retrieval Latent Semantic Indexing. Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])

Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.

Chapter 2 Dimensionality Reduction. Linear Methods

1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.

CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.

Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.

PrasadL18LSI1 Latent Semantic Indexing Adapted from Lectures by Prabhaker Raghavan, Christopher Manning and Thomas Hoffmann.

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.

Matrix Factorization and Latent Semantic Indexing 1 Lecture 13: Matrix Factorization and Latent Semantic Indexing Web Search and Mining.

Introduction to Information Retrieval Lecture 19 LSI Thanks to Thomas Hofmann for some slides.

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.

Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:

Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.

Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.

June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.

LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.

SINGULAR VALUE DECOMPOSITION (SVD)

Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Pandu Nayak.

Latent Semantic Indexing

LATENT SEMANTIC INDEXING BY SINGULAR VALUE DECOMPOSITION

Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.

Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.

ITCS 6265 Information Retrieval & Web Mining Lecture 16 Latent semantic indexing Thanks to Thomas Hofmann for some slides.

Natural Language Processing Topics in Information Retrieval August, 2002.

Web Search and Data Mining Lecture 4 Adapted from Manning, Raghavan and Schuetze.

Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.

Document Clustering Based on Non-negative Matrix Factorization

LSI, SVD and Data Management

Packing to fewer dimensions

Recitation: SVD and dimensionality reduction

Packing to fewer dimensions

Information Retrieval and Web Search

Restructuring Sparse High Dimensional Data for Effective Retrieval

Lecture 20 SVD and Its Applications

Presentation transcript:

1/ 30

Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking the best Experimental results SVR Vs. IRR SVR Conclusion Future work 2/ 30

Problems for classical IR models LSI SVD 3/ 30

Problems for classical IR models Synonymy: Various words and phrases refer to the same concept (lowers recall). Polysemy: Individual words have more than one meaning (lowers precision) Independence: No significance is given to two terms that frequently appear together 4/ 30

Latent Semantic Analysis General idea – Map documents (and terms) to a low-dimensional representation. – Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space). – Compute document similarity based on the inner product in the latent semantic space. Goals – Similar terms map to similar location in low dimensional space. – Noise reduction by dimension reduction. 5/ 30

Vector Model 6/ 30 Set of document: A finite set of terms : Every document can be displayed as vector: the same to the query: Similarity of query q and document d: Given a threshold, all documents with similarity > threshold are retrieved i j djdj q 

SVD and low-rank approximations This optimality property of very useful in, e.g., Principal Component Analysis (PCA), LSI, etc. Truncate the SVD by keeping n ≤ k terms: 7/ 30 orthogonal matrix containing the top k left (right) singular vectors of A. diagonal matrix containing the top k singular values of A. ordered non-increasingly.  rank of A, the number of non-zero singular values. diagonal matrix containing the top k singular values of A. ordered non-increasingly.  rank of A, the number of non-zero singular values. the “best” matrix among all rank-k matrices wrt. to the spectral and Frobenius norms

8/ 30

9/ 30

10/ 30

11/ 30

12/ 30

13/ 30

14/ 30

TREC-4 data set.” ” randomly chose 5305 documents. tested with 20 queries. Stemming “Porter Stemmer” and stop-word were used.” term-by-document matrix was of dimension 16,571 x 5305 and was determined to have a full rank of 5305 through the SVD process. 15/ 30

T, measuring the area covered between the IRP curve and the horizontal axis of Recall and representing the average interpolated precision over the full range ([0, 1]) of recall 16/ 30

LSI IRR A term doc Weight SVD U VTVT eigenvalue eigenvector rescaling 17/ 30

term doc turn to term sentence IRR U  VTVT Put all document as a query to count the similarity 18/ 30

19/ 30

20/ 30

21/ 30 Fig 5: shows SVD of 2x2 matrix

22/ 30

Mathematical analysis showed that: – The difference between the results of version A and version B is a factor of S 2 with S being the diagonal matrix of singular values in the dimension-reduced model. – The retrieval results from version B and version B’ are always identical if the Equivalency Principle is satisfied. – Version B (B’) should be a better option than version A. 23/ 30

Experiments on standardized TREC data set confirmed that: – 5.9% The improvement ratio of Using SVR in addition to the conventional LSI over using the conventional LSI alone. – SVR is computationally as efficient as the best standard query method ”Version B”. – SVR performs better than IRR. 24/ 30

Applying SVR to other fields of IR such as image retrieval and video/audio retrieval. Seeking mathematical justification of SVR, including the relationship between the optimal rescaling factor S_exp and the characteristics of any particular data set. 25/ 30