Information Retrieval Basic Document Scoring. Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Faster TF-IDF David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval Lecture 6bis. Recap: tf-idf weighting The tf-idf weight of a term is the product of its tf weight and its idf weight. Best known.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
From last time What’s the real point of using vector spaces?: A user’s query can be viewed as a (very) short document. Query becomes a vector in the same.
Learning for Text Categorization
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
CS347 Lecture 4 April 18, 2001 ©Prabhakar Raghavan.
CS276 Information Retrieval and Web Mining
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 7: Scores in a Complete Search.
Dictionary search Making one-side errors Paper on Bloom Filter.
Algoritmi per IR Ranking. The big fight: find the best ranking...
Information Retrieval IR 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Web Algorithmics Web Search Engines. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, , blog, e-book,... Query.
Vector Space Model : TF - IDF
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
CS276A Information Retrieval Lecture 7. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index.
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa.
Scoring, Term Weighting, and Vector Space Model Lecture 7: Scoring, Term Weighting and the Vector Space Model Web Search and Mining 1.
Information Retrieval Lecture 6. Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support.
Computing Scores in a Complete Search System Lecture 8: Scoring and results assembly Web Search and Mining 1.
Information Retrieval and Web Search Lecture 7 1.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Information Retrieval Lecture 2 Introduction to Information Retrieval (Manning et al. 2007) Chapter 6 & 7 For the MSc Computer Science Programme Dell Zhang.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
1 Computing Relevance, Similarity: The Vector Space Model.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
1 ITCS 6265 Information Retrieval and Web Mining Lecture 7.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 5: Scoring, Term.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 7 Computing scores in a complete search system.
Vector Space Models.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 7: Scoring and results assembly.
Lecture 6: Scoring, Term Weighting and the Vector Space Model
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Information Retrieval and Web Search IR models: Vector Space Model Instructor: Rada Mihalcea [Note: Some slides in this set were adapted from an IR course.
Recommendation systems Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
PrasadL09VSM-Ranking1 Vector Space Model : Ranking Revisited Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS276 Information Retrieval and Web Search Lecture 7: Vector space retrieval.
Term weighting and Vector space retrieval
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 9: Scoring, Term Weighting and the Vector Space Model.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval and Web Search Lecture 7: Vector space retrieval.
IR 6 Scoring, term weighting and the vector space model.
Information Retrieval and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Representation of documents and queries
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Presentation transcript:

Information Retrieval Basic Document Scoring

Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s wrong ?

Normalization Dice coefficient (wrt avg #terms) : Jaccard coefficient (wrt possible terms) : Cosine measure: Cos-measure less insensitive to doc’s sizes OK, triangular NO, triangular

What’s wrong in doc-similarity ? Overlap matching doesn’t consider: Term frequency in a document Talks more of t ? Then t should be weighted more. Term scarcity in collection of commoner than baby bed Length of documents score should be normalized

A famous “weight”: tf x idf where n t = #docs containing term t n = #docs in the indexed collection log Frequency of term t in doc d = #occ t / |d| n n idf tf t t t,d         Sometimes we smooth the absolute term frequency:

Term-document matrices (real valued) Note can be >1! Bags-of-words view implies that the doc “Paolo loves Marylin” is indistinguishable from the doc “Marylin loves Paolo”. A doc is a vector of tf  idf values, one component per term Even with stemming, we may have 20,000+ dims

A graphical example Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 φ Euclidean distance vs Cosine similarity (no triangular inequality)

Doc-Doc similarity If normalized, cosine of angle between doc-vectors sim = cosine  Euclidean dist. dot product Recall that ||u - v|| 2 = ||u|| 2 + ||v|| u v

Query-Doc similarity We choose the dot product as measure of proximity between document and query (seen as a short doc) Note: 0 if orthogonal (no words in common) MG-book proposes a different weighting scheme

MG book |d| may be precomputed, |q| may be considered as a constant; Hence, normalization does not depend on the query

Vector spaces and other operators Vector space OK for bag-of-words queries Clean metaphor for similar-document queries Not a good combination with operators: Boolean, wild-card, positional, proximity First generation of search engines Invented before “spamming” web search

Relevance Feedback: Rocchio User sends his/her query Q Search engine returns its best results User marks some results as relevant and resubmits query plus marked results (repeat the query !!) Search engine exploit this refined knowledge of the user need to return more relevant results.

Information Retrieval Efficient cosine computation

Find k top documents for Q IR solves the k-nearest neighbor problem for each query For short queries: standard indexes are optimized For long queries or doc-sim: we have high- dimensional spaces Locality-sensitive hashing…

Encoding document frequencies #docs(t) useful to compute IDF term freq useful to compute tf t,d Unary code is very effective for tf t,d abacus 8 aargh 12 acacia 35 1,2 7,3 83,1 87,2 … 1,1 5,1 13,1 17,1 … 7,1 8,2 40,1 97,3 … IDF ? TF ?

Computing a single cosine Accumulate component-wise sum Three problems: #sums per doc = #vocabulary terms #sim-computations = #documents #accumulators = #documents

Two tricks Sum only for terms in Q [w t,q ≠ 0] Compute Sim only for docs in IL [w t,d ≠ 0] On the query aargh abacus would only do accumulators 1,5,7,13,17,….,83,87,… abacus 8 aargh 12 acacia 35 1,2 7,3 83,1 87,2 … 1,1 5,1 13,1 17,1 … 7,1 8,2 40,1 97,3 … We could restrict to docs in the intersection!!

Advanced #1: Approximate rank-search Preprocess: Assign to each term, its m best documents (according to the TF-IDF). Lots of preprocessing Result: “preferred list” of answers for each term Search: For a q-term query, take the union of their q preferred lists – call this set S, where |S|  mq. Compute cosines from the query to only the docs in S, and choose the top k. Need to pick m>k to work well empirically.

Advanced #2: Clustering Query Leader Follower

Advanced #2: Clustering Recall that docs ~ T-dim vector Pre-processing phase on n docs: pick L docs at random: call these leaders For each other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ n/L followers. Process a query as follows: Given query Q, find its nearest leader. Seek k nearest docs from among its followers.

Advanced #3: pruning Classic approach: scan docs and compute their sim(d,q). Accumulator approach: all sim(d,q) are computed in parallel. Build an accumulator array containing all sim(d,q),  d,q. Exploit IL so that  t in Q :  d in IL t, compute TF-IDF t,d and sum it to sim(d,q) We RE-structure the computation: Terms of Q are considered for decreasing IDF (i.e. smaller ILs first) Documents in IL lists are ordered by decreasing TF This way, when a term is picked and its IL is scanned, the TF-IDF are computed by decreasing value, so that we can apply some pruning

Advanced #4: Not only term weights Current search engines use various measures for estimating the relevance of a page wrt a query Relevance(Q,d) = h(d, t 1, t 2,…,t q ) PageRank [Google] is one of these methods and denotes the relevance taking into account the hyperlinks in the Web graph (more later) Google  tf-idf PLUS PageRank (PLUS other weights) Google toolbar suggests that PageRank is crucial

Advanced #4: Fancy-hits heuristic Preprocess: Fancy_hits(t) = its m docs with highest tf-idf weight Sort IL(t) by decreasing PR weight Idea: a document that scores high should be in FH or the front of IL Search for a t-term query: First FH: Take the common docs of their FH compute the score of these docs and keep the top-k docs. Then IL: scan ILs and check the common docs of ILs  FHs Compute the score of these docs and possibly insert them into the current top-k. Stop when m docs have been checked or the PR score goes below some threshold.

 Advanced #5: high-dim space Binary vectors are easier to manage: Map unit vector u to {0,1} r is drawn from the unit sphere. The h is locality sensitive: Map u to h1(u), h2(u),..., hk(u) Repeat g times, to control error. If A & B sets

Information Retrieval Recommendation Systems

Recommendations We have a list of restaurants with  and  ratings for some Which restaurant(s) should I recommend to Dave?

Basic Algorithm Recommend the most popular restaurants say # positive votes minus # negative votes What if Dave does not like Spaghetti?

Smart Algorithm Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. Perhaps recommend Straits Cafe to Dave  Do you want to rely on one person’s opinions?