June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.

Slides:



Advertisements
Similar presentations
The Mathematics of Information Retrieval 11/21/2005 Presented by Jeremy Chapman, Grant Gelven and Ben Lakin.
Advertisements

Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Text Databases Text Types
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Principal Component Analysis
Hinrich Schütze and Christina Lioma
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Principal Component Analysis
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 Discussion Class 4 Latent Semantic Indexing. 2 Discussion Classes Format: Question Ask a member of the class to answer. Provide opportunity for others.
Dimensional reduction, PCA
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
1/ 30. Problems for classical IR models Introduction & Background(LSI,SVD,..etc) Example Standard query method Analysis standard query method Seeking.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Text mining.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Presentation to VII International Workshop on Advanced Computing and Analysis Techniques in Physics Research October, 2000.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Authors: Rosario Sotomayor, Joe Carthy and John Dunnion Speaker: Rosario Sotomayor Intelligent Information Retrieval Group (IIRG) UCD School of Computer.
CS621 : Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 28: Principal Component Analysis; Latent Semantic Analysis.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, and John T. Riedl
SINGULAR VALUE DECOMPOSITION (SVD)
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Computer-assisted essay assessment zSimilarity scores by Latent Semantic Analysis zComparison material based on relevant passages from textbook zDefining.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Natural Language Processing Topics in Information Retrieval August, 2002.
2D-LDA: A statistical linear discriminant analysis for image matrix
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu 2004.ICDM. Improving Text.
Unsupervised Learning II Feature Extraction
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
From Frequency to Meaning: Vector Space Models of Semantics
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina University of Trento June 5, 2006

University of Trento2 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions

June 5, 2006University of Trento3 1. The Problem Vector Space Model (VSM) that has long been a standard in IR has its flaw: It ignores both the order and association between terms!

June 5, 2006University of Trento4 1. The Problem (cont.) The document by term matrix is sufficient to represent the collection. But: some of the information contained there could actually hinder the process of document retrieval!

June 5, 2006University of Trento5 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions

June 5, 2006University of Trento6 2. The Concept The solution: A smaller, more tractable representation of terms and documents that retains only the most important information from the original matrix may actually improve both the quality and the speed of the retrieval system

June 5, 2006University of Trento7 2. The Concept (cont.) Latent Semantic Indexing (LSI) is a technique that projects queries and documents into a space with “latent” semantic dimensions.

June 5, 2006University of Trento8 2. The Concept (cont.) LSI is a method for dimensionality reduction: a high-dimensional space is represented in low-dimensional space (often in two- or three- dimensional)

June 5, 2006University of Trento9 2. The Concept (cont.) LSI is the application of the particular mathematical technique, called Singular Value Decomposition, to a word-by-document matrices. SVD (and hence LSI) is a least-squares method.

June 5, 2006University of Trento10 2. The Concept (cont.) How SVD works? SVD takes the matrix A and represents it as A´ in a lower dimensional space such that the “distance” between the two matrices is minimized: Δ=||A-A´|| 2

June 5, 2006University of Trento11 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions

June 5, 2006University of Trento12 3. Advantages and Drawbacks Advantages of LSI: Synonymy (the same underlying concept can be described using different terms) Polysemy (describes the words that have more than one meaning) Dependence (improving performance by adding common phrases as search items)

June 5, 2006University of Trento13 3. Advantages and Drawbacks Drawbacks of LSI: Storage (SVD representation is more compact) Efficiency (with LSI the query must be compared to every document in the collection)

June 5, 2006University of Trento14 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions

June 5, 2006University of Trento15 4. LSI and VSM: comparison Two collections of data: MED and CISI. MED – LSI improves average precision from.45 to.51 with the largest benefits found at high recall CISI – no significant differences between LSI and VSM is found

June 5, 2006University of Trento16 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions

June 5, 2006University of Trento17 5. LSI and Routing Problem The routing problem is just a special case of the classification problem, since there are only two groups of documents, relevant and nonrelevant.

June 5, 2006University of Trento18 5. LSI and Routing Problem (cont.) To test the performance of LSI when applied to the routing task the technique of cross- validation is used.

June 5, 2006University of Trento19 5. LSI and Routing Problem (cont.) Cross-validation: The strategy is to remove one document at a time from the collection, and then use the remaining documents to try to predict the relevance of missing document. Precision and recall are used to evaluate the performance.

June 5, 2006University of Trento20 5. LSI and Routing Problem (cont.) Results: LSI does not greatly improve performance over the vector space model for the routing problem, although the difference is measurable: Evaluation method VSM LSI Avg.precision Avg.recall

June 5, 2006University of Trento21 5. LSI and Routing Problem (cont.) To obtain a significant improvement in retrieval performance LSI can be used in conjunction with statistical classification.

June 5, 2006University of Trento22 5. LSI and Routing Problem (cont.) The general statistical classification problem: A population consists of two or more groups, and there exists a training sample for which the class of each element is known and a test sample for which the class is unknown. The goal is to produce a classification rule which will predict the class of the unknown elements.

June 5, 2006University of Trento23 5. LSI and Routing Problem (cont.) Results: The performance is significantly improved: Evaluation method VSM LSITDA Avg.precision Avg.recall TDA – method for text-based discriminant analysis.

June 5, 2006University of Trento24 Index 1. The Problem 2. The Concept 3. Advantages and Drawbacks 4. LSI and VSM: comparison 5. LSI and Routing Problem 6. Conclusions

June 5, 2006University of Trento25 6. Conclusions LSI addresses the problem of term independence by re-expressing the term document matrix in a new coordinate system to capture the most significant components of the term association structure.

June 5, 2006University of Trento26 Thank You!