Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

TF/IDF Ranking. Vector space model Documents are also treated as a “bag” of words or terms. –Each document is represented as a vector. Term Frequency.
Chapter 5: Introduction to Information Retrieval
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Learning for Text Categorization
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
CpSc 881: Information Retrieval
Information Retrieval Review
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
Modern Information Retrieval Chapter 2 Modeling. Probabilistic model the appearance or absent of an index term in a document is interpreted either as.
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Hinrich Schütze and Christina Lioma
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Vector Space Model Any text object can be represented by a term vector Examples: Documents, queries, sentences, …. A query is viewed as a short document.
Ranked Retrieval LBSC 796/INFM 718R Session 3 September 24, 2007.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
The Vector Space Model …and applications in Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Advanced Multimedia Text Classification Tamara Berg.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Boolean and Vector Space Models
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
The Vector Space Model LBSC 708A/CMSC 838L Session 3, September 18, 2001 Douglas W. Oard.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Information Retrieval and Web Search IR models: Vectorial Model Instructor: Rada Mihalcea Class web page: [Note: Some.
CSE3201/CSE4500 Term Weighting.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Ranked Retrieval LBSC 796/INFM 718R Session 3 February 16, 2011.
Vector Space Models.
ISchool, Cloud Computing Class Talk, Oct 6 th Computing Pairwise Document Similarity in Large Collections: A MapReduce Perspective Tamer Elsayed,
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Term Weighting approaches in automatic text retrieval. Presented by Ehsan.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
IR 6 Scoring, term weighting and the vector space model.
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
Information Retrieval and Web Search
Information Retrieval and Web Search
Basic Information Retrieval
Representation of documents and queries
From frequency to meaning: vector space models of semantics
CS 430: Information Discovery
6. Implementation of Vector-Space Retrieval
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Term Frequency–Inverse Document Frequency
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Ranked Retrieval INST 734 Module 3 Doug Oard

Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking

What’s a Model? Model –A simplification that describes something complex –A particular way of “looking at things” Computational model –Simplification or reality that facilitates computation

Similarity-Based Queries Treat the query as if it were a document –Create a query bag-of-words Find the similarity of each document –Using the coordination measure, for example Rank order the documents by similarity –Most similar to the query first Surprisingly, this works pretty well! –Especially for very short queries

Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “document frequency” makes it stronger still

A Partial Solution: TF*IDF High TF is evidence of meaning Low DF is evidence of term importance –Equivalently high “IDF” Multiply them to get a “term weight” Add up the weights for each query term

TF*IDF Example nuclear fallout siberia contaminated interesting complicated information retrieval query: contaminated retrieval Result: 2, 3, 1, 4

The Document Length Effect Document lengths vary in many collections Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –They use the same terms repeatedly So they have much higher term frequencies Two strategies –Adjust term frequencies for document length –Divide the documents into equal “passages”

Passage Retrieval Break long documents up somehow –Con chapter or section boundaries –On topic boundaries (“text tiling”) –Overlapping 300-word passages (“sliding window”) Use best passage’s rank as the document’s rank

“Cosine” Normalization Compute the length of each document vector –Multiply each weight by itself –Add all the resulting values –Take the square root of that sum Divide each weight by that length

Cosine Normalization Example nuclear fallout siberia contaminated interesting complicated information retrieval Length query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)

Why Call It “Cosine”?  d2 d1

Formally … Document Vector Query Vector Inner Product Length Normalization

Interpreting the Cosine Measure Think of query and the document as vectors –Query normalization does not change the ranking –Square root does not change the ranking Similarity is the angle between two vectors –Small angle = very similar –Large angle = little similarity Passes some key sanity checks –Depends on pattern of word use but not on length –Every document is most similar to itself

“Okapi BM-25” Term Weights TF componentIDF component

Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Sum the weights for each query term –In linear algebra, this is an “inner product” operation

Agenda Ranked retrieval Similarity-based ranking  Probability-based ranking