Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Text Similarity David Kauchak CS457 Fall 2011.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
ISP 433/533 Week 2 IR Models.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Modeling Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Vector Space Model CS 652 Information Extraction and Integration.
Chapter 19: Information Retrieval
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Search Engines and Information Retrieval Chapter 1.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Information Retrieval CSC 9010: Special Topics. Natural Language.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Vector Space Models.
1 Information Retrieval LECTURE 1 : Introduction.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
©2003 Paula Matuszek CSC 9010: Search Engines Google Dr. Paula Matuszek (610)
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Automated Information Retrieval
Plan for Today’s Lecture(s)
Clustering of Web pages
Text Based Information Retrieval
Multimedia Information Retrieval
Information Retrieval
Information Retrieval and Web Search
Representation of documents and queries
CS 430: Information Discovery
CS 430: Information Discovery
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit to my query? –This involves determining what the query is about and how well the document answers it Compare them –Show me more like this. –This involves determining what the document is about.

Determining Relevance by Keyword The typical web query consists entirely of keywords. Retrieval can be binary: present or absent More sophisticated is to look for degree of relatedness: how much does this document reflect what the query is about? Simple strategies: –How many times does word occur in document? –How close to head of document? –If multiple keywords, how close together?

Keywords for Relevance Ranking Count: repetition is an indicaiton of emphasis –Very fast (usually in the index) –Reasonable heuristic –Unduly influenced by document length –Can be "stuffed" by web designers Position: Lead paragraphs summarize content –Requires more computation –Also reasonably heuristic –Less influenced by document length –Harder to "stuff"; can only have a few keywords near beginning

Keywords for Relevant Ranking Proximity for multiple keywords –Requires even more computation –Obviously relevant only if have multiple keywords –Effectiveness of heuristic varies with information need; typically either excellent or not very helpful at all –Very hard to "stuff" All keyword methods –Are computationally simple and adequately fast –Are effective heuristics –typically perform as well as in-depth natural language methods for standard search

Comparing Documents "Find me more like this one" really means that we are using the document as a query. This requires that we have some conception of what a document is about overall. Depends on context of query. We need to –Characterize the entire content of this document –Discriminate between this document and others in the corpus

Comparing Documents cont Two very general approaches: –statistical –semantic We will discuss semantic approaches more in text mining Statistical approach still focuses on keywords: –To what extent does each term characterize this document? –To what extent does each term discriminate this document from other documents?

Characterizing a Document: Term Frequency A document can be treated as a sequence of words. Each word characterizes that document to some extent. When we have eliminated stop words, the most frequent words tend to be what the document is about Therefore: f kd (# of occurrences of word K in document d) will be an important measure. Also called the term frequency

Characterizing a Document: Document Frequency What makes this document distinct from others in the corpus? The terms which discriminate best are not those which occur with high frequency! Therefore: D k (# of documents in which word K occurs) will also be an important measure. Also called the document frequency

TF*IDF This can all be summarized as: –Words are best discriminators when they occur often in this document (term frequency) don’t occur in a lot of documents (document frequency) One very common measure of the importance of a word to a document is TF*IDF: term frequency * inverse document frequency There are multiple formulas for actually computing this; the book gives Robertson and Jones. The underlying concept is the same in all of them.

Describing an Entire Document So what is a document about? TF*IDF: can simply list keywords in order of their TF*IDF values Document is about all of them to some degree: it is at some point in some vector space of meaning

Vector Space Any corpus has defined set of terms (index) These terms define a knowledge space Every document is somewhere in that knowledge space -- it is or is not about each of those terms. Consider each term as a vector. Then –We have an n-dimensional vector space –Where n is the number of terms (very large!) –Each document is a point in that vector space The document position in this vector space can be treated as what the document is about.

Similarity Between Documents How similar are two documents? –Measures of association How much do the feature sets overlap? Modified for length: DICE coefficient DICE coefficient: # terms compared to intersection Simple Matching coefficient: take into account exclusions –Cosine similarity similarity of angle of the two document vectors not sensitive to vector length

Bag of Words All of these techniques are what is known as bag of words approaches. Keywords treated in isolation Difference between "man bites dog" and "dog bites man" non-existent In text mining will discuss linguistic approaches which pay attention to semantics