The Vector Space Model …and applications in Information Retrieval.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency.
Chapter 5: Introduction to Information Retrieval
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Advanced Query Processing Dr. Susan Gauch. Query Term Weights  The vector space model matches queries to documents with the inner product/cosine similarity.
Learning for Text Categorization
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Ch 4: Information Retrieval and Text Mining
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
1 Searching the Web Representation and Management of Data on the Internet.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Term weighting and vector representation of text Lecture 3.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
CSE3201/CSE4500 Term Weighting.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
IR 6 Scoring, term weighting and the vector space model.
Automated Information Retrieval
7CCSMWAL Algorithmic Issues in the WWW
CS 430: Information Discovery
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2017 Lecture 7: Information Retrieval II Aidan Hogan
Information Retrieval and Web Search
Aidan Hogan CC Procesamiento Masivo de Datos Otoño 2018 Lecture 7 Information Retrieval: Ranking Aidan Hogan
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
6. Implementation of Vector-Space Retrieval
Term Frequency–Inverse Document Frequency
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

The Vector Space Model …and applications in Information Retrieval

Part 1 Introduction to the Vector Space Model

Overview The Vector Space Model (VSM) is a way of representing documents through the words that they contain It is a standard technique in Information Retrieval The VSM allows decisions to be made about which documents are similar to each other and to keyword queries

How it works: Overview Each document is broken down into a word frequency table The tables are called vectors and can be stored as arrays A vocabulary is built from all the words in all documents in the system Each document is represented as a vector based against the vocabulary

Example Document A –“A dog and a cat.” Document B –“A frog.” adogandcat 2111 afrog 11

Example, continued The vocabulary contains all words used –a, dog, and, cat, frog The vocabulary needs to be sorted –a, and, cat, dog, frog

Example, continued Document A: “A dog and a cat.” –Vector: (2,1,1,1,0) Document B: “A frog.” –Vector: (1,0,0,0,1) aandcatdogfrog aandcatdogfrog 10001

Queries Queries can be represented as vectors in the same way as documents: –Dog = (0,0,0,1,0) –Frog = ( ) –Dog and frog = ( )

Similarity measures There are many different ways to measure how similar two documents are, or how similar a document is to a query The cosine measure is a very common similarity measure Using a similarity measure, a set of documents can be compared to a query and the most similar document returned

The cosine measure For two vectors d and d’ the cosine similarity between d and d’ is given by: Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together The cosine measure calculates the angle between the vectors in a high-dimensional virtual space

Example Let d = (2,1,1,1,0) and d’ = (0,0,0,1,0) –dXd’ = 2X0 + 1X0 + 1X0 + 1X1 + 0X0=1 –|d| =  ( ) =  7=2.646 –|d’| =  ( ) =  1=1 –Similarity = 1/(1 X 2.646) = Let d = (1,0,0,0,1) and d’ = (0,0,0,1,0) –Similarity =

Ranking documents A user enters a query The query is compared to all documents using a similarity measure The user is shown the documents in decreasing order of similarity to the query term

VSM variations

Vocabulary Stopword lists –Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing –Stopword lists contain frequent words to be excluded –Stopword lists need to be used carefully E.g. “to be or not to be”

Term weighting Not all words are equally useful A word is most likely to be highly relevant to document A if it is: –Infrequent in other documents –Frequent in document A The cosine measure needs to be modified to reflect this

Normalised term frequency (tf) A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document This is known as the tf factor. Document A: raw frequency vector: (2,1,1,1,0), tf vector: ( ) This stops large documents from scoring higher

Inverse document frequency (idf) A calculation designed to make rare words more important than common words The idf of word i is given by Where N is the number of documents and n i is the number that contain word i

tf-idf The tf-idf weighting scheme is to multiply each word in each document by its tf factor and idf factor Different schemes are usually used for query vectors Different variants of tf-idf are also used