Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist Växjö University (Mathematics.

Slides:



Advertisements
Similar presentations
Skövde, Jan Information Access: Leif Grönqvist1 Systematic Evaluation of Swedish IR Systems using a Relevance Judged Document Collection Leif.
Advertisements

Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
Växjö: 23. Jan -04Evaluation of Vector Space...1 Evaluation of Vector Space Models Obtained by Latent Semantic Indexing Leif Grönqvist
Dimensionality Reduction PCA -- SVD
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
INFO 624 Week 3 Retrieval System Evaluation
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluating the Performance of IR Sytems
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Multimedia Databases LSI and SVD. Text - Detailed outline text problem full text scanning inversion signature files clustering information filtering and.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CS246 Topic-Based Models. Motivation  Q: For query “car”, will a document with the word “automobile” be returned as a result under the TF-IDF vector.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Which of the two appears simple to you? 1 2.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Information Retrieval by means of Vector Space Model of Document Representation and Cascade Neural Networks Igor Mokriš, Lenka Skovajsová Institute of.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
MIRACLE Multilingual Information RetrievAl for the CLEF campaign DAEDALUS – Data, Decisions and Language, S.A. Universidad Carlos III de.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Web- and Multimedia-based Information Systems Lecture 2.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Natural Language Processing Topics in Information Retrieval August, 2002.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Information Retrieval on the World Wide Web
Multimedia Information Retrieval
LSI, SVD and Data Management
IR Theory: Evaluation Methods
Chapter 5: Information Retrieval and Web Search
CS 430: Information Discovery
Information Retrieval and Web Design
Latent Semantic Analysis
Presentation transcript:

Göteborg 26. Jan -04Evaluation of Vector Space Models Obtained by Latent Semantic Indexing1 Leif Grönqvist Växjö University (Mathematics and Systems Engineering) GSLT (Graduate School of Language Technology) Göteborg University (Department of Linguistics)

2Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Outline of the talk  Vector space models in IR (short reminder since last seminar) The traditional model Latent semantic indexing (LSI)  Singular value decomposition (SVD)  Evaluation Why How & Data sources

3Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 The traditional vector model  One dimension for each index term  A document is a vector in a very high dimensional space  The similarity between a document and a query is the cosine:  Gives us a degree of similarity instead of yes/no as for basic keyword search

4Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 The traditional vector model, cont.  Assumption used: all terms are unrelated  Could be fixed partially using different weights for each term  Still, we have a lot more dimensions than we want How should we decide the index terms? Similarity between terms are always 0 Very similar documents may have sim0 if they:  use a different vocabulary  don’t use the index terms

5Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Latent semantic indexing (LSI)  Similar to factor analysis  Number of dimensions can be chosen as we like  We make some kind of projection from a vector space with all terms to the smaller dimensionality  Each dimension is a mix of terms  Impossible to know the meaning of the dimension

6Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 LSI, cont.  Distance between vectors is cosine just as before  Meaningful to calculate distance between all terms and/or documents  How can we do the projection?  There are some ways: Singular value decomposition (SVD) Random indexing Neural nets, factor analysis, etc.

7Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Why SVD?  I prefer SVD since:  Michael W Berry 1992: “… This important result indicates that A k is the best k-rank approxima- tion (in a least squares sense) to the matrix A.  Leif 2003: What Berry says is that SVD gives the best projection from n to k dimensions, that is the projection that keep distances in the best possible way.

8Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 A small example input to SVD

9Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 What SVD gives us X=T 0 S 0 D 0 : X, T 0, S 0, D 0 are matrices

10Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Using the SVD  The matrices make it easy to project term and document vectors into a m- dimensional space (m ≤ min (terms, docs)) using ordinary linear algebra  We can select m easily just by using as many rows/columns of T 0, S 0, D 0 as we want  It is possible to calculate a new (approximated) X – it will still be a t x d matrix

11Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Some applications  Automatic generation of a domain specific thesaurus  Keyword extraction from documents  Find sets of similar documents in a collection  Find documents related to a given document or a set of terms

12Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 An example based on newspaper articles stefan edberg edberg0.918 cincinnatis0.887 edbergs0.883 världsfemman0.883 stefans0.883 tennisspelarna0.863 stefan0.861 turneringsseger0.859 queensturneringen växjöspelaren0.852 grästurnering0.847 bengt johansson johansson0.852 johanssons0.704 bengt0.678 centerledare0.674 miljöcentern0.667 landsbygdscentern0.667 implikationer0.645 ickesocialistisk0.643 centerledaren0.627 regeringsalternativet vagare0.616

13Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Evaluation  We need evaluation metrics to be able to improve the model!  How can we evaluate millions of vectors? “similar terms have vectors with high cosine” What is similar?  Seems impossible to evaluate the model objectively…  Possible solution: look at specific applications! They may be much easier to evaluate

14Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Applications using the model  Vector models may be evaluated using: A typical IR test suite of queries, documents, and relevance information Texts with lists of manually selected keywords (multiword units included) Selected terms in a thesaurus (with multiword units) The Test of English as a Foreign Language (TOEFL), which tests the ability of selecting synonyms from a set of alternatives  Still subjectivity, but the more the vector model improves these applications the better it is!  Let’s look in detail at the first application

15Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 An IR testbed  There are such testbeds for English, but Swedish has other problems Very different from English Compounds without spaces “New” letters (åäö) Complex morphology Other stop words …

16Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 A new Swedish test collection  A group in Borås is building it Per Ahlgren Johan Eklund Leif Grönqvist  It will contain Documents Topics Relevance judgments

17Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Document collection  Newspaper articles from GP and HD  Same year as the TT-data in CLEF  articles, 40 MTokens  Good to have more than one newspaper: Same content, different author (not always)  10% of my newspaper article collection  Copyright is a problem

18Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Topics  Borrowed from CLEF  52/90, but not the most difficult  Examples: Filmer av bröderna Kaurismäki.  Description: Sök efter information om filmer som regisserats av någon av de båda bröderna Aki och Mika Kaurismäki.  Narrative: Relevanta dokument namnger en eller flera titlar på filmer som regisserats av Aki eller Mika Kaurismäki. Finlands första EU-kommissionär  Description: Vem utsågs att vara den första EU- kommissionären för Finland i Europeiska unionen?  Narrative: Ange namnet på Finlands första EU- kommissionär. Relevanta dokument kan också nämna sakområdena för den nya kommissionärens uppdrag.

19Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Relevance judgments  Only a subset for each topic Selected by earlier experiments Similar approach to TREC and CLEF  100 documents for 5 strategies: 100  N  500 Important to include relevant and irrelevant documents  A scale of relevance proposed by Sormonen: Irrelevant (0)  Marginally relevant (1)  Fairly relevant (2)  Highly relevant (3)  Manually annotated

20Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Relevance definitions IdTagDescription 0Irrelevant The document does not contain any information about the topic 1Marginally relevant The document only points to the topic. It does not contain any other information, with respect to the topic, than the description of the topic 2Fairly relevant The document contains more information than the description of the topic but the presentation is not exhaustive. In the case of a topic with several aspects, only some of the aspects are covered by the document 3Highly relevant The document discusses all of the themes of the topic. In the case of a topic with several aspects, all or most of the aspects are covered by the document.

21Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Statistics  Some difficult topics got very few relevant documents

22Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Statistics per relevance category

23Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 Evaluation metrics  Recall & precision is problematic: Ranked lists – how much better is position 1 than pos 5 and 10? How long should the lists be? Relevance scale – how much better is “highly relevant” than “fairly relevant” What about the unknown documents not judged?  Idea: different user types needs different evaluation metrics  Too many unknown leads to a need of more manual judgments…

24Evaluation of Vector Space Models Obtained by Latent Semantic IndexingGöteborg 26. Jan -04 The End! Thank you for listening ???