Ranking in Information Retrieval Systems Prepared by: Mariam John CSE 6392 03/23/2006.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Boolean and Vector Space Retrieval Models
Text Similarity David Kauchak CS457 Fall 2011.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modeling Modern Information Retrieval
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Vector Space Model CS 652 Information Extraction and Integration.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Vector Space Model : TF - IDF
CES 514 Data Mining March 11, 2010 Lecture 5: scoring, term weighting, vector space model (Ch 6)
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
Graphics CSE 581 – Interactive Computer Graphics Mathematics for Computer Graphics CSE 581 – Roger Crawfis (slides developed from Korea University slides)
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Basic ranking Models Boolean and Vector Space Models.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 5 9/6/2011.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
Web Information Retrieval
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
IR 6 Scoring, term weighting and the vector space model.
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
Ch 6 Term Weighting and Vector Space Model
INFORMATION RETRIEVAL
Information Retrieval and Web Search
Representation of documents and queries
Principles of Data Mining Published by Springer-Verlag. 2007
From frequency to meaning: vector space models of semantics
4. Boolean and Vector Space Retrieval Models
5. Vector Space and Probabilistic Retrieval Models
Boolean and Vector Space Retrieval Models
CS 430: Information Discovery
INF 141: Information Retrieval
Term Frequency–Inverse Document Frequency
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006

Introduction Basic Assumption:  A document is a bag of words  A keyword query is a small set of words The 3 main ranking model in IR:  Vector Space Model  Probabilistic IR Model  Language Model

Factors that impact Ranking in IR Relative frequency of occurrence of query keywords in a document. Proximity of query keywords within a document ( this cannot be done in a bag-of-word model, instead it needs a sequence-of-word model). Specificity/importance of query keywords. e.g., Given a query (Microsoft, Corporation) Microsoft is more specific than Corporation

Factors that impact Ranking in IR Links provide more structure to the documents (unlike in earlier IR systems which conformed to the bag-of-word model). This is more relevant in the web context. Popularity of a page with respect to the relevance of the query.  Try to look at popularity of pages and links to check people’s opinion instead of looking at all other ad hoc methods.  All previous methods are approximation methods to achieve this property of Popularity.

Vector Space Model Was used in the earliest IR systems and uses Relative frequency and Specificity of query keywords to come up with a ranking function. Extends the idea of vector space to language in order to express the simple heuristic idea of the bag-of- word model. Idea is to view the space of documents as a set of points in a very high-dimensional space.

Relative Frequency in the Vector Space Model Consider a matrix where each column represents a distinct word in the English language and each row represents a document/page. 1 How do you represent information about 2 what, part of a document contains, with respect to a particular word? Boolean matrix used to model vector space, where:  ‘0’ means a particular word doesn’t belong to that document  ‘1’ means a particular word belongs to that document  This information can be used to find N the relative frequency of query keywords w1 w2 w7 w12 w14

Specificity in Vector Space Model E.g., Consider the query: (Microsoft, Corporation) We assume that Microsoft is more specific than Corporation. The number of documents to which a word belongs is inversely proportional to the specificity of the word. This is called ‘Inverse Document Frequency (IDF)’.

Inverse Document Frequency Let document frequency be the number of documents to which ‘w’ belongs.  Inverse Document Frequency, This is too strong a definition and we need to dampen this definition because some words can be more/less important than others.  Using a logarithmic dampener,

Inverse Document Frequency IDF is associated with every word. IDF cannot be captured using the matrix representation in slide 6 (it can capture Term Frequency and Selectivity of query keywords). Use another vector to capture Inverse Document Frequency (this captures Relative frequency of query keywords).

Term Frequency Term Frequency is a (word, document) pair. Given a word ‘w’ and a document ‘d’,  Term Frequency (w, d) = num: of times ‘w’ belongs to ‘d’ / num: of words in ‘d’.

Scoring function Mathematical function defining the score of a document with respect to a query: Assume a query is a bunch of keywords. Let Q = { } be a query. What are the arguments/domain on which the scoring function will work on ?  Documents and Words in queries

Scoring function Vector in a 2-dimensional space is represented as: ax+by. Similarly, vectors can be represented in high- dimension space. Coefficients ‘a’ and ‘b’ defines the vector such that if we move ‘a’ units in the x-axis and ‘b’ units in the y-axis, then we will hit the tip of the vector ax+by. ax+by

Vector A vector is an ordered list of coefficients such that each item in the list defines the strength of the vector in that particular dimension. Each word represents a dimension in the vector space model. If there are ‘n’ words, then this will be an n-dimension vector. Each document represents a vector in high- dimension vector space.

Vector Space Model Coefficient of each document vector corresponds to ‘Term Frequency’.  Term Frequency – tells us how important a word is with respect to a document.  Inverse Document Frequency – tells us how many times a word occurs in a document and about its importance. Score( ) – 1 ) use angles b/w 2 ) use distances b/w tips of

Scoring function Treat a query as a small document then it can also be thought of as a vector. Only words in the query will have a ‘1’ and all other words will have a ‘0’ entry. q d1 d2 d3

Geometric Interpretation of Score According to intuition, if score of a document is high, then the document is closer to the query. There are 2 ways of finding this score:  Find the angle between the query vector and the document vector.  Find the distances of tips of the document vector and query vector and find out the closest distance. Does these 2 methods produce the same result? Maybe not

Geometric Interpretation of Score This is because vectors are of arbitrary length.  SOLUTION: Make vectors of same length. Why does vectors have different lengths?  If a document has 20 words and another document has 2 words, lengths will be different. How can we make vectors have same length?  Using a unit sphere, replace all vectors such that all of them sit on the surface of the unit sphere.

Normalizing Vectors If you have all the vectors sitting on the surface of the unit sphere, and if you sort all vectors by its angles and distances from the query vector, it will give the same result. Use angle measurement after normalizing all vectors to the same length. This entire process is called ‘ TF-IDF ranking with cosine similarity ’. q d1 d2 d3

Scoring function Small angle means it is good (means, document is very similar to the query). Small angle means ‘large cosine’. Hence, replace ‘angles’ with ‘cosine’ so that now large means good. Use this feature in scoring functions since the dot product of vectors uses cosine.

Calculating the Score Let To find the score of a document,  Normalize the vectors  Find the dot product of the document vector and query vector w1w2Wm Q IDF 1 2 N

Calculating the Score of a document Normalizing vectors: Finding dot product of vectors:  For every word in the query document, find the corresponding entry in the document vector ‘d’, find the product of each term and sum them all up to get the score.

Conclusion Given a collection of documents and an incoming query, how will you find the top-k documents?  Do preprocessing by creating the data structures required earlier. Concerns with this approach:  Do you run query against all documents and words especially when there are lots of sparse entries?  How to take ranking function and boost this implementation?  How will you create these data structures? Do you create data structures for all words and create a sparse matrix or not?