Presentation is loading. Please wait.

Presentation is loading. Please wait.

Restructuring Sparse High Dimensional Data for Effective Retrieval

Similar presentations


Presentation on theme: "Restructuring Sparse High Dimensional Data for Effective Retrieval"— Presentation transcript:

1 Restructuring Sparse High Dimensional Data for Effective Retrieval
C. L. Isbell and P. Viola NIPS vol. 11 Summarized by Seong-woo Chung

2 (C) 2001, SNU CSE Biointelligence Lab
Introduction The task in text retrieval is to find the subset of a collection of documents relevant to a user’s information request, usually expressed as a set of words Propose a topic based model for the generation of words in documents Each document is generated by the interaction of a set of independent hidden random variables called topics Construct a set of linear operators which extract the independent topic structure of documents Comparison between several methods and derived method which is an unsupervised method (C) 2001, SNU CSE Biointelligence Lab

3 Introduction(Continued)
When a topic is active it causes words to appear in documents The final set of observed words results from a linear combination of topics Individual words are only weak indicators of underlying topics The task is to discover from data those collections of words that best predict the (unknown) underlying topics (C) 2001, SNU CSE Biointelligence Lab

4 (C) 2001, SNU CSE Biointelligence Lab
Vector Space Model The similarity between two documents using the VSM model is their inner product diT· dj (diT· q) While the word-document matrix has a very large number of potential entries, it is sparsely populated(in practice, non-zero elements represent about 2% of the total number of elements) Any text retrieval system must overcome the fundamental difficulty that the presence or absence of a word is insufficient to determine relevance Synonymy – e.g. car, automobile (false negative) Polysemy – e.g. “apple” is both a fruit and a computer company (true positive) Solution – LSI ? (C) 2001, SNU CSE Biointelligence Lab

5 Latent Semantic Indexing
LSI constructs a smaller document matrix that retains only the most important information from the original by using the Singular Value Decomposition (SVD) The SVD of a matrix D is USVT where U and V contain orthogonal vector and S is diagonal U contains the eigenvectors of D · DT, S their corresponding eigenvalues (C) 2001, SNU CSE Biointelligence Lab

6 Latent Semantic Indexing(Continued)
LSI represents documents as linear combinations of orthogonal features Each document is projected into a lower dimensional space where contain only the largest k singular values and the corresponding eigenvectors (smaller size but representing the most variation) Queries are also projected into this space, so the relevance of documents to a query is It is hoped that these features represent meaningful underlying “topics” present in the collection (C) 2001, SNU CSE Biointelligence Lab

7 Optimal Projections for Retrieval
We could attempt a supervised approach, searching for a matrix P such that DTPPTq results in large values for documents in D that are known to be relevant for a particular query, q Given a collection of documents, D, and queries, Q, for each query we are told which documents are relevant Use this information to construct an optimal P such that DTPPTq = R where Rij equals 1 if document i is relevant to query j, and 0 otherwise Comparison between the optimal axes and LSI’s projection axes (C) 2001, SNU CSE Biointelligence Lab

8 Optimal Projections for Retrieval(Continued)
High kurtosis, low kurtosis Supervised, unsupervised (C) 2001, SNU CSE Biointelligence Lab

9 Independent Components of Documents
Project individual words onto the ICA space (this amounts to projecting the identity matrix onto that space) (C) 2001, SNU CSE Biointelligence Lab

10 Topic Centered Representations
The three groups of documents are used to drive the discovery of two sets of words One set distinguish the relevant document from documents in general, a form of gloabal clustering: f(Gk – Bk) The other set distinguish the weakly-related documents from the relevant document: -f(Mk – Gk) This leaves only a set of closely related documents (C) 2001, SNU CSE Biointelligence Lab

11 (C) 2001, SNU CSE Biointelligence Lab
Experiments Baseline LSI Documents as Clusters Relevant Documents as Clusters ICA Topic Clustering (C) 2001, SNU CSE Biointelligence Lab

12 Experiments(Continued)
(C) 2001, SNU CSE Biointelligence Lab

13 (C) 2001, SNU CSE Biointelligence Lab
Discussion Described typical dimension reduction techniques used in text retrieval and shown that these techniques make strong assumptions about the form of projection axes Characterized another set of assumptions and derived an algorithm that enjoys significant computational and space advantages (C) 2001, SNU CSE Biointelligence Lab


Download ppt "Restructuring Sparse High Dimensional Data for Effective Retrieval"

Similar presentations


Ads by Google