Restructuring Sparse High Dimensional Data for Effective Retrieval

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Eigen Decomposition and Singular Value Decomposition
Covariance Matrix Applications
Text Databases Text Types
Latent Semantic Analysis
Biointelligence Laboratory, Seoul National University
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
Dimensional reduction, PCA
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Paper Summary of: Modelling Retrieval and Navigation in Context by: Massimo Melucci Ahmed A. AlNazer May 2008 ICS-542: Multimedia Computing – 072.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
DATA MINING LECTURE 7 Dimensionality Reduction PCA – SVD
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Text Categorization Moshe Koppel Lecture 12:Latent Semantic Indexing Adapted from slides by Prabhaker Raghavan, Chris Manning and TK Prasad.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
SINGULAR VALUE DECOMPOSITION (SVD)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Unsupervised Learning II Feature Extraction
Unsupervised Learning II Feature Extraction
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
Document Clustering Based on Non-negative Matrix Factorization
LECTURE 11: Advanced Discriminant Analysis
School of Computer Science & Engineering
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Principal Component Analysis
PCA vs ICA vs LDA.
Aapo Hyvärinen and Ella Bingham
Dimension reduction : PCA and Clustering
A Fast Fixed-Point Algorithm for Independent Component Analysis
CS 430: Information Discovery
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative.
Latent Semantic Analysis
Presentation transcript:

Restructuring Sparse High Dimensional Data for Effective Retrieval C. L. Isbell and P. Viola NIPS vol. 11 Summarized by Seong-woo Chung 2001.12.21

(C) 2001, SNU CSE Biointelligence Lab Introduction The task in text retrieval is to find the subset of a collection of documents relevant to a user’s information request, usually expressed as a set of words Propose a topic based model for the generation of words in documents Each document is generated by the interaction of a set of independent hidden random variables called topics Construct a set of linear operators which extract the independent topic structure of documents Comparison between several methods and derived method which is an unsupervised method (C) 2001, SNU CSE Biointelligence Lab

Introduction(Continued) When a topic is active it causes words to appear in documents The final set of observed words results from a linear combination of topics Individual words are only weak indicators of underlying topics The task is to discover from data those collections of words that best predict the (unknown) underlying topics (C) 2001, SNU CSE Biointelligence Lab

(C) 2001, SNU CSE Biointelligence Lab Vector Space Model The similarity between two documents using the VSM model is their inner product diT· dj (diT· q) While the word-document matrix has a very large number of potential entries, it is sparsely populated(in practice, non-zero elements represent about 2% of the total number of elements) Any text retrieval system must overcome the fundamental difficulty that the presence or absence of a word is insufficient to determine relevance Synonymy – e.g. car, automobile (false negative) Polysemy – e.g. “apple” is both a fruit and a computer company (true positive) Solution – LSI ? (C) 2001, SNU CSE Biointelligence Lab

Latent Semantic Indexing LSI constructs a smaller document matrix that retains only the most important information from the original by using the Singular Value Decomposition (SVD) The SVD of a matrix D is USVT where U and V contain orthogonal vector and S is diagonal U contains the eigenvectors of D · DT, S their corresponding eigenvalues (C) 2001, SNU CSE Biointelligence Lab

Latent Semantic Indexing(Continued) LSI represents documents as linear combinations of orthogonal features Each document is projected into a lower dimensional space where contain only the largest k singular values and the corresponding eigenvectors (smaller size but representing the most variation) Queries are also projected into this space, so the relevance of documents to a query is It is hoped that these features represent meaningful underlying “topics” present in the collection (C) 2001, SNU CSE Biointelligence Lab

Optimal Projections for Retrieval We could attempt a supervised approach, searching for a matrix P such that DTPPTq results in large values for documents in D that are known to be relevant for a particular query, q Given a collection of documents, D, and queries, Q, for each query we are told which documents are relevant Use this information to construct an optimal P such that DTPPTq = R where Rij equals 1 if document i is relevant to query j, and 0 otherwise Comparison between the optimal axes and LSI’s projection axes (C) 2001, SNU CSE Biointelligence Lab

Optimal Projections for Retrieval(Continued) High kurtosis, low kurtosis Supervised, unsupervised (C) 2001, SNU CSE Biointelligence Lab

Independent Components of Documents Project individual words onto the ICA space (this amounts to projecting the identity matrix onto that space) (C) 2001, SNU CSE Biointelligence Lab

Topic Centered Representations The three groups of documents are used to drive the discovery of two sets of words One set distinguish the relevant document from documents in general, a form of gloabal clustering: f(Gk – Bk) The other set distinguish the weakly-related documents from the relevant document: -f(Mk – Gk) This leaves only a set of closely related documents (C) 2001, SNU CSE Biointelligence Lab

(C) 2001, SNU CSE Biointelligence Lab Experiments Baseline LSI Documents as Clusters Relevant Documents as Clusters ICA Topic Clustering (C) 2001, SNU CSE Biointelligence Lab

Experiments(Continued) (C) 2001, SNU CSE Biointelligence Lab

(C) 2001, SNU CSE Biointelligence Lab Discussion Described typical dimension reduction techniques used in text retrieval and shown that these techniques make strong assumptions about the form of projection axes Characterized another set of assumptions and derived an algorithm that enjoys significant computational and space advantages (C) 2001, SNU CSE Biointelligence Lab